Вы находитесь на странице: 1из 13

Teaching and Teacher Education 65 (2017) 48e60

Contents lists available at ScienceDirect

Teaching and Teacher Education


journal homepage: www.elsevier.com/locate/tate

Subjectivity of teacher judgments: Exploring student characteristics


that influence teacher judgments of student ability
Kane Meissel*, Frauke Meyer, Esther S. Yao, Christine M. Rubie-Davies
University of Auckland, Faculty of Education and Social Work, Private Bag 92601, Symonds St, Auckland, New Zealand

h i g h l i g h t s

 Explored alignment of standardized achievement results with teacher judgments.


 Marginalized students received lower judgments after controlling for achievement.
 Classroom and school achievement composition inversely related to teacher judgments.
 Robust moderation of teacher judgments needed, both within and between schools.
 Professional development may assist teachers to make fair and consistent judgments.

a r t i c l e i n f o a b s t r a c t

Article history: Teacher judgments of student achievement are increasingly used for high-stakes decision-making,
Received 22 June 2016 making it imperative that judgments be as fair and reliable as possible. Using a large national database
Received in revised form from New Zealand, we explored the relation between psychometrically designed standardized
16 January 2017
achievement results and teacher judgments in reading (N ¼ 4771 students) and writing (N ¼ 11,765
Accepted 25 February 2017
Available online 22 March 2017
students) using hierarchical linear modelling. Our findings indicated that judgments were systematically
lower for marginalized learners after controlling for standardized achievement differences. Additionally,
classroom and school achievement composition were inversely related to teacher judgments. These
Keywords:
Teacher judgments
discrepancies are concerning, with important implications for equitable educational opportunities.
Standardized testing © 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND
Student achievement license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Social justice

1. Introduction higher quality learning opportunities than students judged less


able (Clark & Peterson, 1986; Rubie-Davies, 2014; Rubie-Davies,
The ability of teachers to accurately gauge student achievement Hattie, & Hamilton, 2006; Sharpley & Edgar, 1986). Furthermore,
is considered an important aspect of teachers' professional teacher judgments have implications for placement decisions in
competence, as teacher judgments are often the primary source of programs or ability groups, grade retention, and ultimately for
information about student achievement (Ready & Wright, 2011; students' future academic pathways (Begeny, Eckert, Montarello, &
Südkamp, Kaiser, & Mo € ller, 2012; Südkamp, Kaiser, & Mo € ller, Storie, 2008; Begeny, Krouse, Brown, & Mann, 2011; Francis et al.,
2014). Teacher judgments are determinations made by teachers 2016; Harlen, 2005; Parsons & Hallam, 2014; Wiliam &
about students' current achievement (see Section 2 for more Bartholomew, 2004).
detail), and can impact teachers' ongoing instructional decision- Internationally, much research has focused on teacher judgment
making within the classroom, including instructional pace, level alignment, mainly investigating the relations between teacher
of support, and level of task difficulty (Alvidrez & Weinstein, 1999; judgments and measured student performance. Reviews of this
Clark & Peterson, 1986; Hoge & Coladarci, 1989). For example, body of research have shown broad agreement between judgments
students judged to be more capable are more likely to receive and standardized assessments on average (r ¼ 0.63, Südkamp et al.,
2012), but the relations have been vastly inconsistent with a wide
range of correlations reported (0.03 to 0.92; Hoge & Coladarci,
* Corresponding author. University of Auckland, Faculty of Education and Social 1989; Südkamp et al., 2012). Südkamp et al. (2012) noted that
Work, Private Bag 92601, Symonds St, Auckland 1150, New Zealand. teacher judgments showed higher correlations with measured
E-mail address: k.meissel@auckland.ac.nz (K. Meissel).

http://dx.doi.org/10.1016/j.tate.2017.02.021
0742-051X/© 2017 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 49

achievement when teachers were informed about what measure with the work of Ready and Wright (2011), but draws on a sample
their judgment was being compared with. Correlations were also of older students (approximately 9e13 years old) in both reading
higher when judgments and measures addressed the same domain and writing. Ready and Wright’s (2011) study focused on kinder-
or aspect of ability. Other test characteristics such as the number of garten students and research has shown that the alignment be-
points on the judgment scale did not impact the degree of align- tween teacher judgments and measured achievement can differ
ment between judgments and measured student achievement. across grade levels, highlighting the need for further study with
Within the New Zealand context, overall teacher judgments respect to older students (Südkamp et al., 2014). In addition,
(OTJs) were introduced as a specific achievement measure in 2010, teachers in Ready and Wright’s (2011) study had no access to
and are assessed in relation to expected curriculum standards in standardized assessment results of students, whereas teachers in
reading, writing, and mathematics. These judgments are commonly the current study had access to each students' standardized
referred to as National Standards (NS), and are intended to reflect a achievement results, and were advised by the New Zealand Min-
student's achievement in relation to the standard expected of istry of Education that this was a source of evidence that could be
students at the same year level nationally. Teachers are asked to drawn on when making judgments about students' achievement.
consider a range of data, including observations of student learning, The meta-analysis by Südkamp et al. (2012) indicated that research
learning conversations, and formal assessments such as standard- has yet to examine the way in which teacher judgments are affected
ized achievement tests, to reach a decision on whether a student by knowledge of standardized assessment results prior to making a
meets the demands of the New Zealand curriculum (Ministry of holistic judgment about students' achievement within a learning
Education, 2011). Determination of whether a student meets the domain. Furthermore, relatively few studies have utilized data
standard is up to the teacher, with no mandate with respect to collected as part of regular school routine. The current study uses
which of these forms of evidence is utilized, nor the degree of teacher judgments and standardized achievement results collected
weighting of specific types of data. The judgment, however, should in actual classroom contexts where data collection was not an
focus solely on a student's achievement at that point in time and imposed measure for schools.
should not include construct irrelevant information such as a stu- Previous research has frequently focused on relatively small,
dent's behavior or perceived potential ability. localized samples of students; the average sample size of the 75
A perfect correlation is unlikely and arguably undesirable since studies reported in a recent meta-analysis by Südkamp et al. (2012)
the two measures should be used for different purposes e stan- was 518 students. The current study drew on data from a large-
dardized assessments often focus on a specific aspect of a student's scale teacher professional development project with almost 5000
learning whereas teacher judgments should take into account a students represented in reading, and around 12,000 additional
number of aspects of a student's achievement within a whole students for writing. Since teacher judgments are inherently likely
subject area. Nonetheless, while previous research has investigated to violate statistical assumptions of independence because one
the relation between judgments and standardized achievement, teacher determines the judgments for all students in his/her class,
the properties of teacher judgments and what informs these de- we employed three-level hierarchical linear modelling in the ana-
cisions remain relatively unexplored. The question remains lyses with students nested within classrooms and nested within
whether lower correlations simply reflect differences in the nature schools. This enabled the inherent clustering of the data to be
of the assessments, or whether there are construct irrelevant fac- accounted for.
tors which influence teachers when they make judgments about Furthermore, the majority of teacher judgment studies have not
student performance. For example, although neither ethnicity nor attended to between-group differences with respect to student
special needs status should affect a judgment about a student's characteristics. The extent to which students' characteristics in-
achievement, previous research has indicated that such factors fluence teachers' overall judgments of achievement remains largely
might indeed influence teachers' judgments (see e.g., Glock, Krolak- inconclusive. Due to the importance of equitable educational op-
Schwerdt, & Pit-ten Cate, 2015; Martínez, Stecher, & Borko, 2009; portunities, this is a key focus of the current study.
Ready & Wright, 2011). Although discrepancies in teacher judg-
ments are to be expected given that there is random error in all 2. Review of teacher judgment literature
assessments of student performance, systematic differences
relating to specific subgroups would suggest a degree of bias. The following sections provide a review of the extant literature
Alignment between standardized tests and teacher judgments on teacher judgments e their alignment with standardized
may also be affected by the inherently different interpretive ap- achievement results and the impact of student characteristics and
proaches (Hattie & Brown, 2003; Hattie et al., 2003). Standardized school composition on these judgments. Due to the overlap of
tests are specifically designed to maximize reliability and consis- teacher expectations and teacher judgments, the review begins
tency across students, classrooms, schools, and regions. In contrast, with a brief discussion of this issue.
individual teachers typically make evaluations of student perfor-
mance in relation to local (class or school) level evidence. That is, 2.1. Teacher expectations and teacher judgments
although teacher judgments may be defined as criterion-
referenced, judgments are likely to be influenced by normative Teacher expectations and teacher judgments are similar in that
evaluations, such as how well each student is performing in rela- both represent subjective teacher estimates about student
tion to other students within the teacher's class (Angoff, 1974). achievement. They mainly differ in that expectations are typically
Despite considerable work investigating the properties of predictions about future achievement while judgments are a cur-
teacher judgments, the majority of these studies have been con- rent estimate of a student's performance. The latter are mostly
ducted in a North American context. Notable exceptions include the made in circumstances in which the teacher has taught the student
earlier studies by Doherty and Conolly (1985) and Sharpley and for some time and therefore can take into account a range of in-
Edgar (1986), which were undertaken in Australia and the UK formation. In contrast, teacher expectations focus on expected
respectively, as well as the more recent research undertaken in improvement or performance over a future time period, and are
Germany by Kaiser, Retelsdorf, Südkamp, and Mo € ller (2013). predictions of the possible academic progression of a student rather
The current study extends previous work investigating the na- than an assessment of their existing skills and knowledge (see for
ture of teacher judgments in several ways. It most closely aligns example Rubie-Davies, Peterson, Sibley, & Rosenthal, 2015).
50 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

Südkamp et al. (2014) point out that research considering group a criterion-referenced scale in mathematics than suggested by
differences based on student characteristics has largely focused on standardized achievement test results. Other studies have found
teacher expectations rather than teacher judgments. Teacher that teachers were indirectly influenced by student gender when
expectation research has concluded that teacher expectations judging academic skills (Bennett, Gottesman, Rock, & Cerullo, 1993;
about student abilities are subject to bias related to the students' Beswick, Willms, & Sloat, 2005). Foremost, perceptions of students'
ethnicity, socioeconomic status (SES), gender, and special needs, behavior affected teachers' judgments in these North American
and English for Speakers of Other Languages (ESOL) status studies; boys were often rated lower in academic or literacy skills
(McKown & Weinstein, 2008; Rubie-Davies et al., 2012; Südkamp because their behavior was perceived to be more problematic than
et al., 2012). However, authors in this area have noted that stu- that of girls. A few studies have indicated that behavioral factors
dents' actual achievement has in earlier studies often not been such as student engagement and motivation can influence teacher
controlled for, limiting the power of such studies (Jussim & Harber, judgments (Benner & Mistry, 2007; Dompnier, Pansu, & Bressoux,
2005; Rubie-Davies et al., 2012). 2006; Kaiser et al., 2013).
Teacher expectation researchers also note that negatively biased Few studies have examined the effect that English language
expectations are likely to be problematic for reasons beyond simple learner or special needs status has on teachers' judgments, but
prejudice. It is argued that when teachers underestimate students' research thus far has indicated negative relations. Hurwitz, Elliott,
current achievement level, they are likely to plan lower level and Braden (2007) argued that teachers consistently under-
learning opportunities for those students (Rubie-Davies et al., estimated the performance of students with special needs status,
2006). In turn, this directly affects how much students learn, whereas Martínez et al. (2009) found that teachers marked stu-
because these differential learning opportunities accumulate over dents with English language learner or special needs status lower in
time and ultimately reduce students' life chances (Rubie-Davies, mathematics than standardized achievement test results
2014). Teacher judgments seem likely to be subject to similar bia- suggested.
ses, but since teacher judgments are more frequently formalized for
high stakes decisions such as students' placements in ability groups 2.3. Methodological differences in studies of teacher judgments
or admission to a particular educational track, the ramifications of
any bias in these judgments are particularly serious. As described in the previous section, studies of teacher judg-
ments have typically not considered group differences based on
2.2. Teacher judgments and the influence of student characteristics student characteristics. Where group differences have been
considered, objective measures of student achievement have
Research focusing on teacher judgments has tended to focus seldom been employed as a means of assessing teacher judgment
more strongly on the overall degree of correspondence between discrepancies for specific subgroups, thereby neglecting to control
teacher judgments and measured student achievement, and less on for differences in student achievement between these groups. In
differences in these judgments for specific student groups. Teacher general, these studies have followed a design where teachers have
judgment research considering the influence of student charac- judged students' current ability on a scale provided by the re-
teristics on those judgments remains largely inconclusive. Previous searchers (e.g., Kaiser et al., 2013; Ready & Wright, 2011).
reviews and meta-analyses indicate difficulties with the aggrega- In addition, data have seldom been collected within the usual
tion of results as information on student characteristics has been classroom context, comparing between group differences in
scarce or reported in aggregated ways (e.g., Hoge & Coladarci, 1989; teacher judgments of their own students. This is important because
Ready & Wright, 2011; Südkamp et al., 2014). Furthermore, studies when teachers make judgments of students in experimental
considering student characteristics have often used teacher judg- studies, the descriptions they read are not of their own students.
ments to validate rating scales and measures for certain domains Therefore, it is not possible to determine whether the responses of
and groups of students (e.g., Lembke, Foegen, Whittaker, & teachers in experimental studies would be the same as in natu-
Hampton, 2008; Li, Pfeiffer, Petscher, Kumtepe, & Mo, 2008; ralistic studies where they know their students well and interact
Methe, Hintze, & Floyd, 2008), rather than to examine discrep- with them daily (Rubie-Davies, 2014).
ancies in teacher judgments for specific learner groups. However, The studies by Kaiser et al. (2013) and Ready and Wright (2011)
several student characteristics warrant examination in terms of provide the most comprehensive naturalistic studies to date in
whether these characteristics inform or influence teachers' judg- actual classroom settings. In total, we could only identify three
ments, including gender, ethnicity, socioeconomic background, and studies that had used naturalistic data of teacher judgments made
students' special needs or ESOL status. in the classroom context while also examining standardized
Martínez et al. (2009) found that the gap between minority and achievement results (Kaiser et al., 2013; Martínez et al., 2009;
non-minority kindergarten students' academic abilities was judged Ready & Wright, 2011). Kaiser et al. (2013) reported on three
to be smaller by teachers than suggested by standardized test re- studies examining the reciprocal relationship between students'
sults in mathematics. The authors concluded that teachers may reading achievement, engagement, and teacher judgments. The
have been compensating for perceived inequities. In contrast, first study drew on data from 52 teachers and 1135 students taking
Ready and Wright (2011) concluded that teacher judgments for part in a German literacy development project, whereas the other
kindergarten students from low SES or minority backgrounds were two studies utilized simulated classrooms to gather experimental
negatively biased in the domain of literacy, whereas Feinberg and data. The first study, which is of most relevance to the current study
Shapiro (2003) found no effect of student ethnicity on teacher since it collected data in the field, found a moderate correlation of
judgments. r ¼ 0.49 between teacher judgments and student achievement. The
Previous research has generally indicated that student gender authors pointed out that the low level of alignment could stem
has no significant effect on teachers' judgments (Doherty & Conolly, from teachers not being informed about the achievement measure
1985; Hecht & Greenfield, 2002; Helwig, Anderson, & Tindal, 2001; with which their judgment was compared. The study used a
Hoge & Butcher, 1984; Hoge & Coladarci, 1989; Sharpley & Edgar, reading test from the German PIRLS study and teacher responses to
1986). However, Ready and Wright (2011) found a negative bias two four-point scales in regard to students' ability in literacy. Thus,
in teacher judgments for boys in the domain of literacy, while although teachers knew the students they were rating, their
Martínez et al. (2009) showed that teachers marked girls higher on judgments were made in regard to two specific items of
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 51

achievement without the use of recent standardized achievement In conclusion, it has been argued that teacher judgments allow
data, and were uninformed in relation to the comparison measure. for a more holistic understanding of student achievement, since
Besides the low correlation between teacher judgments and actual teachers' daily interactions with students provide them with a
achievement, the authors reported an effect of student engagement richer picture of what students know and can do than tests alone
on teacher judgments of student achievement and vice versa. The (Chamberlain, 2010; Südkamp et al., 2014). However, previous
study did not examine differences in judgments in regard to stu- research has indicated that teacher judgments might be biased on
dent characteristics. the basis of student characteristics and influenced by classroom
The studies by Martínez et al. (2009) and Ready and Wright and school contexts. Little research has examined teacher judg-
(2011) both used data from the Early Childhood Longitudinal Sur- ments in naturalistic settings (see e.g., Hopkins, George, &
vey, which followed 22,000 children in the United States from Williams, 1985; Meisels, Bickel, Nicholson, Xue, & Atkins-Burnett,
kindergarten through to fifth grade. Martínez et al. (2009) analysed 2001), and even less research has considered specific student
teacher judgments and math achievement whereas Ready and group characteristics, classroom, and school level factors within a
Wright (2011) focused on the literacy domain. Their analytic sam- naturalistic setting (see Ready & Wright, 2011). Systematic bias
ples included around 10,000 students each. Martínez et al. (2009) such as that signaled by Ready and Wright (2011), could serve to
concluded that teachers perceived smaller achievement gaps than exacerbate and perpetuate achievement gaps, in opposition to the
indicated by the standardized assessments for female, minority and notion of equality of opportunities (de Boer, Bosker, & van der Werf,
students from low SES backgrounds. They argued that this could 2010). However, Ready and Wright’s (2011) study focused on
reflect a bias within the tests, or a deliberate effort by teachers to kindergarten students and research has shown that the alignment
compensate for disadvantages faced by these student groups. between teacher judgments and measured achievement can differ
However, achievement was not specifically controlled for. In direct across grade levels, highlighting the need for further study with
contrast, Ready and Wright (2011) controlled for measured student respect to older students (Südkamp et al., 2014).
achievement and concluded that a negative bias appeared to be
evident. Although roughly half of the group differences were 3. The current study
accounted for by between group differences indicated by the
standardized assessments, the remaining differences indicated The following section provides background detail about the New
systematic biases. Teachers in the study had overestimated girls' Zealand context and the project from which the data were ob-
performance and underestimated Black, Asian, and Hispanic stu- tained, the use of teacher judgments within this context, as well as
dents, and especially students from a low socioeconomic back- a description of the aims and purpose of this study.
ground (over half a standard deviation) in their judgments. The
authors indicated concerns that their results suggested a system-
3.1. The New Zealand context
atic bias among teachers about already marginalized learners,
potentially exacerbating existing disparities.
Students in New Zealand attend primary school from Years 1e8
(~5e13 years old), while secondary schools cater for students in
2.4. Teacher judgments, classroom, and school composition effects Years 9e13. New Zealand schools and teachers are highly autono-
mous relative to educational systems globally (Hanushek, Link, &
Teacher judgments are made within the context of individual Woessmann, 2013; Wylie, Cosslett, & Burgon, 2016). Since 1989,
classrooms nested within schools. Martínez et al. (2009) and Ready individual schools have been independent, self-managing admin-
and Wright (2011) took this nestedness into consideration using istrative units. Each school is managed by a board of trustees made
hierarchical linear modelling and both found a high variation be- up predominantly by parents, a school staff member, a current
tween classrooms and less variation between schools. Whereas student (at the secondary school level), as well as the principal.
Martínez et al. (2009) examined overall differences between Boards are elected every three years, and are responsible for the
classrooms with respect to teachers' practice, Ready and Wright overall running and performance of the school. Staffing numbers
(2011) employed different classroom and school level variables in (including teacher aides and special education staffing) are deter-
regard to teacher and student characteristics. At the classroom mined nationally, with funding allocated by the Ministry of Edu-
level, variables included classroom composition (e.g., classroom cation, but each board is individually responsible for hiring staff,
average SES and literacy ability, and high minority proportion) and allocating operational funding and ensuring the budgets are met
teacher characteristics (e.g., teacher ethnicity, educational attain- (Wylie et al., 2016). Schools use a national curriculum which pro-
ment, and experience). Ready and Wright (2011) concluded that vides a common framework of learning areas, values, and key
classroom composition had a stronger effect on teacher judgments competencies but schools have the flexibility to design and adapt
than teacher characteristics. Teachers in higher-achieving and the curriculum to their particular school community.
higher-SES classrooms overestimated students' literacy abilities, New Zealand's population is approximately 4.5 million, with
even after controlling for child-level SES and measured achieve- 85% of the population residing in urban areas. The school-age
ment, and underestimated students' skills in lower achieving and population comprises 52% New Zealand (NZ) European, 24%
socioeconomically disadvantaged classrooms. In contrast, teacher Ma ori, 10% Pasifika, and 11% Asian students (Statistics New Zealand,
characteristics were unrelated to the degree of alignment between 2016). M aori are the indigenous group; Pasifika are those with
teacher judgments and measured achievement once the authors Pacific Island ancestry (e.g., Samoa, Tonga, Cook Islands); and Asian
controlled for student characteristics, except for beginning teachers students are those from North Asia to the Indian sub-continent.
who tended to overestimate students' ability. School level mea- New Zealand's educational system has been described as a high
sures included school average SES, school location (e.g. urban, ru- performance, but low equity system. While the highest achieving
ral), school sector (e.g., private, public), and school size. Ready and students excel, and mean performance is comparatively high,
Wright (2011) found that teachers in urban schools tended to un- achievement data typically show large disparities, with particular
derestimate students' ability, whereas teachers in small schools groups increasingly over-represented in the lowest quartile of the
tended to overestimate students' literacy skills. distribution (OECD, 2005, 2013; Ogle et al., 2003). Within New
52 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

Zealand, the largest disparities exist for Ma ori and Pasifika stu- studying teacher judgments and their alignment with standardized
dents, and those attending schools with low socio-economic achievement measures. New Zealand schools also typically use
catchment areas. standardized assessments, and teachers are advised that these as-
The introduction of overall teacher judgments (OTJs) in 2010 sessments can be used to inform teacher judgments, but there is no
was, ostensibly, intended as a way to measure and hopefully reduce requirement to do so.
these disparities. Performance is judged ‘holistically’ based on the The overarching research question was whether there were
evidence deemed appropriate for each student by the student's systematic differences in the judgments that teachers made about
teacher (Ministry of Education, 2010). There has been considerable the attainment of students belonging to priority learner groups,
commentary for and against the policy, with advocates arguing that which were not explained by differences indicated by standardized
holistic judgments retain breadth in the curriculum, while oppo- achievement. Priority learner groups in New Zealand are defined as
nents raise concerns about the potential for bias and unreliability those traditionally marginalized/underserved within the New
(Courtney, 2010; Eames, 2010; Ministry of Education, 2011, 2010; Zealand education system (i.e., males, Maori and Pasifika, students

OZerk & Whitehead, 2012; Smith, Anderson, & Blanch, 2016; with special needs, and students for whom English is their second
Thrupp, 2013). These judgments have important implications for language (ESOL)). In addition, of interest was whether there were
students' schooling. While students in New Zealand are not held contextual factors that mitigated or exacerbated any systematic
back, within class ability grouping is an entrenched practice, discrepancies detected within teacher judgments. Thus the relation
despite research suggesting it contributes to a perpetuation of between teacher judgments and student characteristics was
disparities (Schmidt, Burroughs, Zoido, & Houang, 2015; Wilson, examined by taking into account differences in standardized
Madjar, & McNaughton, 2016). Since ability groups are deter- achievement results.
mined by teachers' judgments, reliable and fair determinations are The specific research questions examined in this study were as
essential. In addition, aggregated National Standards (NS) data are follows:
published on the Ministry of Education website (Ministry of
Education, 2016) and media agencies provide NS school league 1. What is the correlation between standardized achievement re-
tables (Fairfax New Zealand Limited, 2012, 2016). Despite the sults and teacher judgments of student achievement?
importance of these judgments, minimal empirical research has 2. After accounting for standardized achievement differences, are
been conducted to assess the properties of the measure to date. there residual differences in teacher judgments of student
achievement that vary systematically by student characteristics
3.2. The professional learning project (i.e., gender, ethnicity, ESOL status, and special needs status)?
3. After accounting for standardized achievement differences, are
This study draws on data collected in a large national profes- there residual differences in teacher judgments of student
sional learning and development project, called the Consortium for achievement that vary systematically by contextual factors (i.e.,
Professional Learning (CPL; for complete details see cpl.org.nz). The classroom and school achievement composition, ethnic
project was funded by the New Zealand Ministry of Education, and composition, school socioeconomic profile, school size, and
aims to ensure that the professional learning produced measurable region)?
gains in student achievement, with a particular focus on improving
equity. Improvement was evaluated in terms of gains in both 4. Method
standardized achievement and OTJ results, but there was no focus
on ensuring alignment of standardized results with OTJs. Since New 4.1. Participants
Zealand schools are self-governing, participation within the project
was voluntary, meaning that schools chose whether to “opt-in”. This study utilized reading and writing achievement data
Due to the project's improvement focus, the majority of schools collected at the end of the 2012 and 2013 academic years as part of
choosing to take part did so out of a self-perceived need to improve an ongoing large-scale professional development project operating
student achievement results. This self-selection resulted in an over- across New Zealand. We selected all students from Years 4e8
representation of schools with a low SES profile, as well as an over- (approximately 8e13 years old; Grades 3 to 7) who had an end of
representation of Pasifika students relative to the national student year standardized achievement score from the same subject
profile. Despite this over-representation, a complete range of domain, in addition to their OTJ. The OTJs (see Section 4.2.1 for full
school types participated within the project. description) determined at the end of the year represented teacher
judgments after having worked with each student for a complete
3.3. Purpose of this study year. Data for students in Years 1e3 (aged approximately 5e8; K to
G2) were unable to be included since most schools did not provide
The current study examined the relations between teacher standardized assessment data for younger students. The data
judgments and psychometrically designed standardized achieve- specified which classroom students were in, but did not provide
ment test results in reading and writing. As previously noted, these teacher-level information. The final sample for reading comprised
measures are extremely unlikely to align perfectly, as both have of 4771 students nested within 194 classrooms attending 44
inherent error, while teacher judgments are intended to evaluate a schools, and for writing, 11,765 students nested within 561 class-
student's achievement more holistically than a single test (Ministry rooms across 105 schools. Note that while the majority of schools
of Education, 2012). Determining whether there are systematic provided data for only one domain, around 20% of schools (n ¼ 26)
discrepancies between teacher judgments and standardized chose to provide data for both reading and writing, introducing
achievement tests is, however, important since any discrepancies some overlap across the two samples. Hence, the data are not
would suggest that evidence unrelated to actual achievement dif- strictly independent. Participating schools tended to be in lower
ferences was playing a role in teachers' judgments. As New Zealand socioeconomic catchment areas and have an overrepresentation of
teachers are required to report OTJs for each student in Years 1 Pasifika students compared to the national student population.
through 8 (aged approximately 5e13 years old) for reading, writing, Table 1 presents the demographic information for the reading and
and mathematics at the end of each year (see Section 4.2.1 for full writing samples in more detail.
details), the educational system provides a naturalistic context for Within both subject domains, more than half the schools
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 53

Table 1
Student-level demographic characteristics by subject domain.

Characteristic Reading Writing New Zealand schools


(N ¼ 4771) (N ¼ 11,765)

n % n % %

Gender Male 2386 50 5995 51 51


Female 2385 50 5770 49 49

Ethnicity NZ European 999 21 3602 31 52


NZ Maori 1327 28 3107 26 24
Pasifika 1730 36 3431 29 10
Other 715 15 1625 14 14

Year level 4 685 14 2483 21 21


5 747 16 2604 22 20
6 694 15 2365 20 19
7 1496 31 2148 18 22
8 1149 24 2165 18 19

ESOL 327 7 976 8 e


Special needs 147 3 776 7 e

Decile group Low 3673 77 7062 60 24


Mid 764 16 3136 27 38
High 334 7 1567 13 38

(reading: n ¼ 28; writing: n ¼ 62) were situated in low SES areas observations of student learning, conversations with students,
(decile1 rating 1e3), around a third in mid SES areas (decile rating classroom tests, and standardized achievement results (Ministry of
4e7; reading: n ¼ 12; writing: n ¼ 31), and comparatively few in Education, 2011). Thus, OTJs are complex judgments which argu-
more affluent zones (decile rating 8e10; reading: n ¼ 4; writing: ably encapsulate a broader range of student ability than what a
n ¼ 12). Geographically, participating schools were mainly located single achievement measure can indicate. OTJs are a Ministry of
in the North Island of New Zealand (reading: n ¼ 42; writing: Education requirement for all schools working with students in
n ¼ 92). A much smaller number of schools were from the South Years 1e8 (5e13 year olds), and are made at the end of each year of
Island (reading: n ¼ 2; writing: n ¼ 13). The majority of the pop- schooling. Unfortunately, no research has yet been conducted to
ulation lives in the North Island (~75%), and is comparatively less investigate the reliability or validity of OTJs, so these metrics are not
affluent than the South Island; 91% of low SES schools are in the available.
North Island. The average school roll was around 250 students,
although this varied considerably (reading: M ¼ 247.12,
4.2.2. Standardized achievement scores
SD ¼ 239.55; writing: M ¼ 244.64, SD ¼ 203.11). On average, just
Most schools in the teacher professional development project
over half the students in each school were of either NZ M aori or
conducted a standardized achievement test towards the end of the
Pasifika descent (reading: 60%; writing: 56%), although these fig-
school year, close to when the OTJs were made. In most cases,
ures are higher in the northern region. Because the sample is drawn
teachers had access to the results of these tests prior to determining
from a professional development project into which schools opt
the appropriate OTJ for each student. Schools chose to use either
based on their own assessment of need, the sample is more heavily
the Assessment Tools for Teaching and Learning (e-asTTle) or the
weighted toward schools with lower achievement and SES profiles.
Progressive Achievement Tests (PATs). Both of these tests were
However, the diversity of the sample means that a complete range
constructed specifically for the New Zealand educational context
of students and schools remains represented.
using item-response theory (Brown, 2013; Darr, McDowall, Ferral,
Twist, & Watson, 2008; Darr, Neill, Stephanou, & Ferral, 2006).
4.2. Measures The use of item-response theory ensures that achievement is
measured on a common scale irrespective of the specific items used
4.2.1. Dependent variable within each test. However, while this ensures equivalence across
OTJ was entered as the dependent variable. OTJs for reading and different forms of the same test, it does not necessarily follow that
writing are each made on a four-point curriculum-referenced scale. PAT and e-asTTle are equivalent. Therefore, the alignment between
The scale is intended to reflect a student's achievement in relation OTJs and each test was assessed separately to determine whether
to the standard expected of their year level nationally: well below the relation differed. These tests are both curriculum-referenced
standard (coded as 1), below standard (coded as 2), at standard (allowing comparison against the curriculum expectations for
(coded as 3), and above standard (coded as 4). The standards are students at each year level) and norm-referenced (allowing com-
aligned to the New Zealand curriculum. Teachers are expected to parison against typical achievement nationally). The reliability of e-
determine the appropriate OTJ for each student using their own asTTle was reported to be a ¼ 0.96 (Ministry of Education & NZCER,
professional judgment of best fit, but are provided with suggestions 2012), and the reliability of PAT Reading Comprehension is a ¼ 0.90
of possible sources of achievement evidence; for example, (Darr et al., 2006, 2008).
Each standardized test was scored on its own equal-interval
common scale. Because older students should typically be ex-
1
A school decile indicates the extent to which a school intakes students from low pected to achieve higher scores, the end-of-school-year norm for
socioeconomic communities. Decile 1 schools are the 10% of schools with the students' respective year levels was subtracted from their actual
highest proportion e and decile 10 schools the 10% with the lowest proportion e of
students from low socio-economic communities. Decile is used primarily for
score to eliminate confounding maturational effects. To standardize
funding purposes; schools with poorer students receive more funding than those the e-asTTle and PAT scores for reading and writing, students' test
with students from more affluent backgrounds. scores relative to the norm were divided by the sample standard
54 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

deviation of the corresponding test. This placed all achievement an extension of conventional regression, and therefore estimates
scores on the same scale regardless of the test administered. the degree to which predictor variables relate to differential out-
The absolute values of skewness (reading: 0.25, SE ¼ 0.04; comes. As with regression, these models establish relations be-
writing: 1.06, SE ¼ 0.02) and kurtosis (reading: 0.86, SE ¼ 0.07; tween variables and cannot explain causality. The HLMs were
writing: 3.39, SE ¼ 0.05) for the standardized scores were below 2 estimated using MLwiN 2.26 with MCMC estimation, since MCMC
and 7, respectively, and thus were well within the typically tends to outperform likelihood methods (e.g., maximum likeli-
acceptable range for normality (Kim, 2013; Kline, 2005). However, hood) when data are non-normal (Gill, 2002).
as the Kolmogorov-Smirnov test showed statistically significant We conducted separate HLMs for each time point for reading
deviation from normality, Markov chain Monte Carlo (MCMC) and writing (i.e., Reading 2012, 2013; Writing 2012; and Writing
estimation was used for the Hierarchical Linear Models (HLMs) to 2013). Each model was built in the same way. First, the uncondi-
mitigate this departure (Gill, 2002). The negatively skewed distri- tional model was specified as:
bution is likely a reflection of the over-representation of low decile
schools and minority ethnic groups in the sample, as these groups Yijk ¼ g000 þ u00k þ r0jk þ eijk (1)
typically have lower achievement (Hattie, 2008).
where Y is the OTJ for student i in classroom j of school k, g000 is the
4.2.3. Student characteristics grand mean, u00k is the variance at the school level, r0jk the variance
Student-level characteristics were dummy-coded and incorpo- at the classroom level, and eijk the variance at the student level. The
rated in the HLMs as Level 1 predictors. Ethnicity was coded with unconditional model allows variance partitioning, which provides
New Zealand European as the reference group; and Ma ori (indig- an estimate of the degree of variance at each of the levels within the
enous New Zealanders), Pasifika (Pacific Islanders), and “other” model. If the unconditional model shows non-significant variance
ethnicities (e.g., Asian, Middle Eastern, Latin American, and African) at the higher levels within the specified hierarchy, conventional
as comparison groups. Other demographic variables were coded as regression may be sufficient, though some authors argue the hi-
binary dummy variables, including students' gender (male ¼ 0, erarchy should be specified even for low levels of clustering (e.g.,
female ¼ 1), ESOL status (no ¼ 0, yes ¼ 1), and special needs status Dorman, 2008).
(no ¼ 0, yes ¼ 1). Next, standardized achievement scores were added as a Level 1
predictor to account for student achievement as measured by a
4.2.4. Classroom characteristics standardized test. Thereafter, the models were built iteratively;
Classroom achievement composition, measured using the student-, classroom-, and school-level predictors were examined
average standardized achievement of the students in each class- separately, then included in the full model, in order to investigate
room, was examined as a Level 2 variable. As student-level stan- the variables which explained a significant amount of the variance
dardized achievement scores had already been standardized, there in OTJs after standardized achievement was accounted for. Factors
was no need to center the aggregate measure. We did not have can be entered allowing only the intercept term to vary at each
access to other classroom-level data such as teacher gender or level, or the individual slopes. It made theoretical sense to allow the
ethnicity. relation between standardized achievement and OTJs to vary across
classrooms so this was investigated. Interaction effects were also
4.2.5. School characteristics explored to determine whether discrepancies were compounded
Contextual effects were also examined by including school-level for students who were members of more than one priority learner
characteristics at Level 3 in the HLMs. These characteristics group, such as Pasifika who were also English language learners.
included: the school's decile band (low [1e3], mid [4e7] or high Parameter estimates are reported in terms of standard deviation
[8e10] decile); school region (central south, southern, or north- units (SDU). Note that interpretation of SDU differences is similar to
ern); school size (the number of students in the school roll); the Cohen's d effect size (1988), in that both provide an indicator of
percentage of minority students attending each school (‘minority’ difference in terms of standard deviations. However, Cohen's d is
was defined as students who identified as M aori and/or Pasifika; typically calculated as a single-level bivariate comparison, so the
students of ‘other’ ethnicities were not included as there is typically magnitude of SDU differences in a multilevel framework tend to be
no achievement gap between this group and New Zealand Euro- smaller and need to be interpreted with an awareness of what has
peans; Satherley, 2006); and school achievement composition been taken into account.
(average standardized achievement of the school). Decile band and
school region were entered into the models as polytomous dummy- 5. Results
coded variables in the same way as explained for ethnicity, with
low decile and the Northern region as the reference groups. These In our preliminary analyses for reading and writing, we built
groups were selected as the reference since both represented the separate models for each collection phase (2012 or 2013) and
largest proportion of students. School size and minority percentage standardized tool (e-asTTle or PAT). There were no significant dif-
were grand-mean centered to improve interpretability, but ferences in the parameter estimates across the models, so the data
retained the same scale (Kreft, de Leeuw, & Aiken, 1995). School were aggregated into a single reading dataset and single writing
achievement composition was not centered as student-level stan- dataset, then re-analysed using the same approach to simplify
dardized achievement scores had already been standardized. interpretation. The results of these models are presented below.

4.3. Analytic approach 5.1. Variance partitioning

We utilized three-level HLMs (Raudenbush & Bryk, 2002; In the unconditional models (see Table 2), the majority of the
Woltman, Feldstain, MacKay, & Rocchi, 2012), with students nes- variability in OTJs was at the student level for both reading (75%)
ted in classrooms nested within schools. These models account for and writing (78%). This is unsurprising given that students' aca-
clustering in the data and were necessary since OTJs are made by demic ability had not yet been accounted for. The remainder of the
classroom teachers and thus may vary in interpretation across variance was partitioned fairly evenly between the classroom and
classrooms (Ready & Wright, 2011). Hierarchical linear modelling is school levels (12% and 10%, respectively, for reading; and 13% and
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 55

Table 2
Variance decomposition from the unconditional models for reading and writing.

Level Reading Writing

Absolute variance % variance Absolute variance % variance

School (level 3), u00k 0.11*** 12% 0.09*** 11%


Classroom (level 2), r0jk 0.12*** 13% 0.10*** 12%
Student (level 1), eijk 0.70*** 75% 0.66*** 78%
Deviance (2*loglikelihood) 11822.34 28506.66

***p < 0.001.

12%, respectively, for writing). The moderate and significant clus- standard’ (coded 2) and ‘at standard’ (coded 3; M ¼ 2.69 in reading;
tering at both the classroom and school levels indicated the ne- M ¼ 2.57 in writing), but were lower for writing than for reading,
cessity of a three-level HLM for both domains. suggesting a degree of discrepancy between the standardized
measures and the OTJs. That is, teachers considered reading
5.2. Descriptive results achievement to be closer to the standard than writing achievement,
while the standardized test results suggested the opposite.
Descriptive statistics for standardized achievement scores and The correlation between the two achievement measures was
OTJs are shown in Table 3. As students' test scores have had the slightly greater than 0.70 for both reading and writing, both overall
appropriate year level norm subtracted, with the result divided by at the student-level, and as an average across schools. However,
the standard deviation of the achievement test, the standardized there was considerable variation across schools; school-level cor-
achievement scores reflect students' achievement relative to the relations ranged between 0.50 and 0.94 for reading, and 0.07
national norm in terms of SDU. and 0.94 for writing. For both reading and writing, the correlation
The average standardized achievement scores for reading and was negative in two schools. Negative correlations are surprising
writing were both below the national norm, with the reading since the two measures should fundamentally assess the same
sample further from the national norm (0.42 SDU; equivalent to domain; regardless of the fact the measures assess different aspects
approximately one academic year) than the writing sample (0.19 of achievement.
SDU). The average OTJs for both domains were between ‘below
5.3. Standardized achievement
Table 3
Descriptive statistics and correlation of standardized achievement and OTJ by Standardized achievement scores were added to the uncondi-
domain. tional model as a Level 1 predictor. As shown in the ‘Standardized
Achievement’ columns in Table 4, there was a positive and statis-
Reading (SD) Writing (SD)
tically significant relationship between standardized achievement
Standardized achievementa 0.42 (1.04) 0.19 (0.99) and OTJ, such that on average an increase of one standard deviation
OTJ 2.69 (0.97) 2.57 (0.92)
in standardized achievement score was associated with a 0.61 in-
Overall correlation (rs) 0.73*** 0.72*** crease in OTJ for reading, and 0.67 increase in OTJ for writing. The
Average school-level correlation (rs) 0.68*** (0.24) 0.72*** (0.16)
inclusion of individual standardized achievement explained half of
Range of school-level correlations (rs) 0.5 to 0.94 0.07 to 0.94
the student-level variance in OTJs in writing, and 42% of the
***p < 0.001.
a
student-level variance in reading. It also reduced much of the
Shows achievement scores relative to the norm in standard deviation units.
variance at the school level for writing (43%) and at the classroom
level for reading (59%). Of the residual variance, the proportion at

Table 4
Three-level HLM estimates for the standardized achievement, student characteristics, and final models for reading and writing.

Parameter Standardized achievement Student characteristics Final

Reading Writing Reading Writing Reading Writing

Fixed effects
Intercept 3.01(0.05)*** 2.74(0.03)*** 3.08(0.06)*** 2.75(0.03)*** 3.02(0.06)*** 2.70(0.03)***
Standardized achievement 0.61(0.01)*** 0.67(0.01)*** 0.58(0.01)*** 0.64(0.01)*** 0.58(0.01)*** 0.65(0.01)***
Female 0.05(0.02)** 0.10(0.01)*** 0.06(0.02)** 0.10(0.01)***
Maori 0.10(0.03)** 0.08(0.02)*** 0.10(0.03)** 0.09(0.02)***
Pasifika 0.15(0.03)*** 0.10(0.02)*** 0.16(0.03)*** 0.10(0.02)***
Other 0.10(0.04)** 0.01(0.02) 0.11(0.04)** 0.01(0.02)
ESOL 0.14(0.04)*** 0.12(0.02)*** 0.13(0.04)** 0.12(0.02)***
Special needs 0.55(0.06)*** 0.24(0.03)*** 0.53(0.06)*** 0.23(0.03)***
Classroom average achievement 0.09(0.05) 0.16(0.03)***
School average achievement 0.28(0.11)** 0.14(0.06)*
Random effects
School 0.09 [17%] 0.05 [11%] 0.09 [17%] 0.04 [9%] 0.08 [15%] 0.05 [11%]
Classroom 0.05 [9%] 0.08 [17%] 0.05 [9%] 0.07 [16%] 0.05 [10%] 0.07 [16%]
Student 0.40 [74%] 0.33 [72%] 0.39 [74%] 0.32 [74%] 0.39 [75%] 0.32 [73%]
Model summary
Deviance (2*loglikelihood) 9210.29 20259.46 9095.39 20034.54 9099.81 20014.00
Number of estimated parameters 5 5 11 11 13 13

Note. (Standard errors); [% residual variance]. *p < 0.05. **p < 0.01. ***p < 0.001.
56 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

each level remained relatively stable, with 72e75% of the variance achievement and student characteristic parameters.
at the student level in each model. Allowing the standardized
achievement slope to vary did not significantly improve model fit 5.6. School characteristics
suggesting a fairly consistent relation between standardized
achievement and OTJs across classrooms (p > 0.05 for reading and We also investigated the degree to which additional school-
writing); thus, this slope was estimated as a fixed effect in all level contextual factors explained the variance in OTJs after con-
subsequent analyses. trolling for standardized achievement, student-level characteris-
tics, and achievement composition. These contextual factors
5.4. Student characteristics included the socioeconomic profile of the school (low, mid, or high
decile), the region where the school was located, the school size,
At the student level, we explored whether there were system- and the proportion of minority students attending the school. None
atic differences in the OTJs assigned to priority learners, after of the contextual factors investigated was found to be significant. In
controlling for differences in standardized achievement scores. As the New Zealand context, the socioeconomic profile of the school
shown in the ‘Student Characteristics’ columns of Table 4, these has frequently been found to be significantly associated with
student characteristics were all significant predictors of OTJs even achievement, but in the current study school SES did not explain
after differences in standardized achievement were accounted for. additional variance, suggesting that controlling for standardized
To quantify the magnitude of the under or overestimate relative to achievement differences already encapsulated variability associ-
standardized achievement, effect sizes in SDU based on the HLM ated with decile, and also that average levels of bias were similar
parameter estimates are provided in parentheses. Specifically, for irrespective of the socioeconomic profile of the school. Interaction
both reading and writing, even when the standardized achieve- effects between student and school characteristics were also
ment evidence was the same, females typically received signifi- explored. None of the interaction effects explained a significant
cantly higher OTJs than males (SDU ¼ 0.06 for reading; SDU ¼ 0.11 proportion of the variance in OTJs after the inclusion of student-
for writing), Maori (SDU ¼ 0.10 for reading; SDU ¼ 0.09 for level predictors, further emphasizing the consistency of these re-
writing) and Pasifika (SDU ¼ 0.16 for reading; SDU ¼ 0.10 for sults across contexts.
writing) students received significantly lower OTJs than NZ Euro-
pean students, and ESOL students (SDU ¼ 0.14 for reading; 6. Discussion
SDU ¼ 0.13 for writing) and those with special needs
(SDU ¼ 0.57 for reading; SDU ¼ 0.26 for writing) received The current study explored the relations between standardized
markedly lower OTJs than those not in these categories. The degree tests and teachers' judgments about student achievement in
of discrepancy was typically similar for reading and writing, except reading and writing. The first research question investigated the
for students with special needs, whereby the discrepancy was correlation between the two measures. It was expected that the
much larger in reading than for writing. correlation would be high since, in contrast with much of the
Many students belong to more than one of these learner groups, previous research, teachers had access to the standardized
so we also explored interaction effects between these groups. For achievement results and could use these if desired. However, cor-
example, whereas OTJs were typically lower for boys and M aori relations were only marginally stronger than those previously re-
than suggested by the standardized achievement results, we ported. An extensive meta-analysis by Südkamp et al. (2012)
wanted to determine whether being a M aori boy led to an even reported average correlations of 0.63, with a wide range of corre-
greater discrepancy. We found no significant interaction between lations reported in individual studies (0.03 to 0.92). The overall
gender and ethnicity for either domain. Interactions including ESOL correlation in the current study was stronger (rs ¼ 0.72e0.73), but
or special needs status could not be explored because compara- also with a wide range of correlations in individual schools. In the
tively few students were identified as being in these learner groups current study, more than two-thirds of reading schools, and almost
when delineated by gender or ethnicity. three-quarters of writing schools had correlations below 0.8, indi-
cating that in most schools teachers did not rely on the test as the
5.5. Achievement composition sole source of evidence for determining the OTJ for each student.
This suggests that, provided possible bias is addressed and reli-
Classroom and school achievement composition were added ability improved, OTJs could provide useful information encapsu-
into the model containing standardized achievement and student lating much more than an individual test score.
characteristics to test whether mean achievement in the classroom The second research question investigated whether there were
and school explained additional variance in OTJs. As shown in the residual differences in teachers' judgments of student achievement
two rightmost columns of Table 4, school achievement composition as a function of specific student characteristics, after controlling for
had a significant, inverse relationship with OTJs; that is, after standardized achievement differences. The results indicated that
controlling for individual standardized achievement and student there were significant differences in the OTJs made by teachers
characteristics, when the average achievement of a school was about students of different groups, with priority learners being
comparatively high, the OTJs made in that school were typically systematically assigned lower OTJs even when standardized
lower for a student with the same standardized achievement as a achievement was the same. The effect size of these differences may
student in a school with lower average achievement. This effect was at first appear small. For example, Hattie (2008) indicated that an
stronger in reading (SDU ¼ 0.29) than in writing (SDU ¼ 0.16). In intervention effect size of less than 0.4 would be considered less
writing, however, there was also an inverse compositional effect than average. However, Hattie further notes that maturation typi-
relating to achievement at the classroom level, such that a student cally accounts for an effect size of approximately 0.25 per year,
in a classroom with comparatively high average achievement was suggesting interventions with an effect of 0.15 are sizable if
typically assigned a lower OTJ than a student with equivalent maturation has been deducted. Effect sizes are also typically re-
standardized achievement in a classroom with average achieve- ported as bivariate comparisons e effects are typically attenuated
ment (SDU ¼ 0.17). This classroom effect was not significant in when more than one covariate is included (e.g., SES and ethnicity
reading (SDU ¼ 0.10). Incorporation of the achievement composi- within a single model; individually each is much larger than when
tion variables only very minimally altered the existing standardized both are estimated together due to confounding). The effect sizes
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 57

reported in the current study represent the unique variance It could be argued that some students may perform compara-
explained by each variable after controlling for achievement dif- tively well within the relatively narrow bounds of a standardized
ferences and any confounding, and yet remain comparable to the assessment, while still having lower performance across the full
intervention effect reported by Hattie, suggesting these effects curriculum. Arguably, only the classroom teacher would have the
remain important. necessary insight to be able to draw on a wide enough range of
Our third question was whether there were contextual factors evidence to determine achievement across the curriculum. The
such as the achievement context of the classroom or school. It was possibility that OTJs draw on wider but concurrently different ev-
found that aggregate SES of the school had no significant effect on idence than what is captured by a standardized test provides a
students' OTJs after controlling for standardized achievement, nor justifiable reason for discrepancies between an individual OTJ and
did school location, roll size, or the proportion of students from the achievement level indicated by the standardized test e and
minority backgrounds. However, the aggregate achievement profile indeed, previous research has shown that primary teachers in New
of the school did affect teachers' judgments in both reading and Zealand often consider standardized tests to be assessments of
writing, with the same standardized achievement result typically surface learning (Brown, 2009). However, if priority learners sys-
resulting in a lower overall teacher judgment in schools with high tematically have lower performance across the full curriculum even
average achievement. In writing, there was also an achievement when they are able to perform equally well on standardized
context effect at the classroom, with students again typically achievement tests, this would provide further evidence that these
receiving lower OTJs if their classroom had higher average students are underserved by the current educational system, and
achievement. This is an important finding because it indicates that suggests that teachers' judgments may be informed by construct
teacher judgments are context-specific, suggesting that while irrelevant information. Therefore, an understanding of the source
teachers are being asked to make these judgments against specific of these discrepancies is imperative.
standards, judgments are at least partially norm-referenced. In It is possible that the discrepancies reported are at least partly
other words, if a teacher is working with a class of students in explained by different behavior among students within the priority
which all students are above the standard, the teacher's knowledge learner groups, especially among those with special needs (see e.g.
of differential ability within the classroom may well result in a Chang & Sue, 2003), though this would be a particularly concerning
tendency to report OTJs that attempt to reflect this differentiation. finding: behavior is not a construct relevant component of OTJs, nor
This ‘localization’ of judgments points to the complexities inherent should it be. Unfortunately, as behavioral indicators were not
in the creation of teacher judgments that are comparable at a na- available, this possibility could not be explored.
tional level, and suggests the need for improved means of aligning OTJs may also be partly informed by differential teacher ex-
expectations across contexts. However, the contextual effects do pectations. Expectation research has typically focused on teachers'
not explain why results are lower for priority learners e the in- initial impressions of students, whereas judgment research usually
clusion of the compositional effects had little effect on the looks at teachers' evaluations of student achievement later in the
parameter estimates for these groups. academic year. There is a substantial body of evidence both inter-
nationally, and in the New Zealand context, that indicates that
6.1. Possible explanations and response teachers typically have lower expectations of marginalized learners
even after controlling for achievement (Rubie-Davies et al., 2012).
Internationally, standardized tests are often argued to disad- However, the teacher judgment research has mostly focused on the
vantage cultural minorities (e.g., Erwin & Worrell, 2012; Kim & alignment of teachers' judgments with achievement measures,
Zabelina, 2015; McGrady & Reynolds, 2013), but to date, no rather than the specific implications for different groups of stu-
research has quantitatively determined presence or absence of dents, so it has been less clear whether teacher beliefs about
cultural bias within the standardized tests used in the current different groups of students might play a role in the judgments they
study, despite discussion both for (e.g., Fothergill & Taylor- make about students' achievement levels. It is plausible that
Jorgensen, n.d.; Haitana, 2007; May, n.d.) and against (e.g. teachers' initial impressions of students, which are influenced by
Keegan, Brown, & Hattie, 2014) the notion of bias within stan- the learner group of the student, may be sufficiently enduring that
dardized tests used in New Zealand. If standardized tests are indeed they impact on teachers' judgments, even after working with the
culturally biased, the misalignment between teacher judgments student for some time.
and standardized results for priority learners is especially con- Students evaluated to be less able than their peers are typically
cerning since teachers' judgments typically suggested considerably given fewer learning opportunities, and are generally provided
larger achievement gaps for priority learners than potentially with more restricted learning experiences, contributing to a self-
biased tests did. fulfilling prophecy of underachievement (Rubie-Davies, 2010).
It is worth noting that these relative differences were very much Subjectivity in teacher judgments risks exacerbating these effects
in line with typical societal understandings of which groups of further, since each subsequent teacher receives the OTJ from the
students do well in New Zealand schools, suggesting that these previous classroom teacher. Although prior judgment could pro-
results may be partly explained by a societal-level bias against vide useful information about a student's performance that sub-
certain learner groups, particularly given the consistency of the sequent teachers can use to inform their teaching, any inherent bias
results across contexts. Previous research has suggested that cul- in the OTJ being provided to the student's new teacher risks a
tural biases might be present in the teaching workforce, with recent compounding effect (Rubie-Davies et al., 2014), reducing expecta-
research highlighting the unconscious bias persisting against Maori tions even further.
within the educational system (Bishop & Berryman, 2006; Blank,
Houkamau, & Kingi, 2016; Peterson, Rubie-Davies, Osborne, & 6.2. Implications for teaching and teacher education
Sibley, 2016). In New Zealand as well as elsewhere, the majority
of teachers (79% in NZ) belong to the cultural majority group In the absence of a coherent and robust moderation process,
(Chubbuck & Zembylas, 2016; Education Counts, 2005). Chubbuck dependable and consistent teacher judgments are unlikely
and Zembylas (2016) report that teachers' bias or deficit views in (Raphael, Au, & Goldman, 2009). It should be noted that these re-
regard to diversity were persistently found in research examining sults reflect a system during early implementation, with facilitators
pre-service teacher dispositions. reporting that moderation was variable, and, in some schools
58 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

absent. Therefore, one possible response would be improved is large, especially for the New Zealand context, it is not a repre-
moderation, for example, by facilitating discussions within and sentative sample, since the data were collected as part of a pro-
between schools about what evidence would be considered fessional learning initiative. In addition, the priority learner
necessary for a student to meet each of the OTJ levels. In addition, categories used are the standard categories used by the Ministry of
we believe that clear guidelines should be provided to all schools Education, but may mask certain underlying patterns for subgroups
and teachers about how to moderate effectively. Effective moder- within these categories. For example, the category Pasifika is an
ation hinges on sharing and discussing the evidence within a high umbrella term that includes a wide range of backgrounds (NZ born
trust environment (Wyatt-Smith, Klenowski, & Gunn, 2010), which of Pacific descent, as well as those born in the Pacific islands; e.g.,
remains the exception rather than the norm. The Ministry of Edu- Tonga, New Caledonia, Samoa, Fiji [Indian and indigenous], among
cation already provides moderation guidelines (see http:// others). The use of an aggregate, school-level SES indicator is
assessment.tki.org.nz/Moderation), but the compositional effects another limitation, but since New Zealand schools do not collect
found regarding average achievement in each school suggested individual-level SES information, we were restricted to the school-
that many teachers and schools were making judgments in relation level indicator.
to the achievement within their own class. Considering moderation
was very limited at the time these data were collected, it could be 6.4. Conclusions
argued that these results represent ‘unfiltered’ teacher perceptions.
If teacher bias is the cause of the differences found in the current Our results indicate that priority learners received systemati-
study, as moderation processes improve and a framework around cally lower teacher judgments than other students in 2012 and
the setting of OTJs is developed further, these differences should 2013, even when their standardized achievement was the same.
reduce, or disappear. Indeed, in 2015 the Ministry of Education However, it is not possible to be certain about the cause of these
released an optional tool called the Progress and Consistency Tool differences in teacher judgments for specific student groups
(PaCT; https://pactinfo.education.govt.nz/), which provides a without further research. One possible explanation is that teachers
framework to support teachers to make more consistent judg- are biased against priority learners. However, it could also be that
ments, though uptake has been comparatively low, with around the standardized achievement tests are positively biased, or that
15% of schools nationally opting in during the first year (Gerritsen, the narrow focus of standardized tests measures an aspect of
2016). However, if teacher bias is indeed the cause of the results achievement distinct from that measured by teacher judgments.
described in this paper, more robust processes that ensure There may also be other possible causes that we have not yet
construct irrelevant information is reduced from OTJs would be considered. Future research should investigate alternative expla-
insufficient in and of itself, as it would not address the underlying nations for these results to develop a better understanding of how
bias. teachers make OTJs, and why these judgments indicate larger
Biases are seen as deep, cognitive and emotional responses that achievement gaps than standardized achievement tests. Removing
people have and it is argued that cultural competency training bias from teacher judgments should be a priority within education.
focusing on differences might be of little effect (Blank et al., 2016). All students deserve equitable educational opportunities and the
To change someone's dispositions has been described as a “life-long elimination of teacher bias would be one way to reduce the
journey of transformation” (Nieto, 2000, p. 183). Recent studies of perpetuation of the social structure and enable all students to have
interventions to develop social justice dispositions in pre-service the opportunity to succeed within the educational system. Ac-
teachers continue to show mixed results (Chubbuck & Zembylas, knowledgements We would like to acknowledge the Consortium
2016). Lai et al. (2014) conducted a meta-analysis of mainly US- for Professional Learning (Evaluation Associates and The University
based studies, showing that only eight of seventeen interventions of Auckland’s Team Solutions) who made this study possible by
were effective at reducing bias. Effective studies usually used a form trusting us to explore and develop an understanding of the data
of counter-stereotype interventions in which participants were gathered during their professional learning project. The dedication
primed to pair positive characteristics with groups that were usu- of every facilitator is hugely admired, and the provision of consis-
ally subject to bias. The researchers also found that increasing tently high quality data greatly appreciated. We also acknowledge
participants' critical thinking and moral reasoning tended to help to the New Zealand Ministry of Education for funding the work of the
reduce bias. In New Zealand some previous research has indicated Consortium for Professional Learning.
that teacher expectations about specific subgroups can be changed
via targeted professional learning interventions (Rubie-Davies Acknowledgements
et al., 2015). However, the consistency of bias across contexts
suggests that professional development should not be restricted to We would like to acknowledge the Consortium for Professional
specific contexts and it is unclear whether increased teacher ex- Learning (Evaluation Associates and The University of Auckland's
pectations would actually translate to reduced bias in teacher Team Solutions) who made this study possible by trusting us to
judgments. Initial teacher education programs would provide a explore and develop an understanding of the data gathered during
potential avenue for wider delivery, but there is clearly a need for their professional learning project. The dedication of every facili-
further research in this area. tator is hugely admired, and the provision of consistently high
quality data greatly appreciated. We also acknowledge the New
6.3. Limitations Zealand Ministry of Education for funding the work of the Con-
sortium for Professional Learning.
Although we believe that these results paint a clear picture of an
underlying systematic bias within the New Zealand education
References
system, there are some limitations and alternative explanations
that should be noted. In particular, the lack of previous research Alvidrez, J., & Weinstein, R. S. (1999). Early teacher perceptions and later student
about OTJs meant that reliability and validity statistics were not academic achievement. Journal of Educational Psychology, 91(4), 731e746.
available, making it difficult to ascertain precisely what OTJs mea- http://dx.doi.org/10.1037/0022-0663.91.4.731.
Angoff, W. H. (1974). Criterion-referencing, norm-referencing and the SAT. The
sure. The current study provides some insight into factors that have College Board Review, 92, 3e5, 21.
an effect on OTJs, but more research is needed. Although the sample Begeny, J. C., Eckert, T. L., Montarello, S. A., & Storie, M. S. (2008). Teachers'
K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60 59

perceptions of students' reading abilities: An examination of the relationship Exploring the relative lack of impact of research on ‘ability grouping’ in En-
between teachers' judgments and students' performance across a continuum of gland: A discourse analytic account. Cambridge Journal of Education, 1e17.
rating methods. School Psychology Quarterly, 23(1), 43. http://dx.doi.org/10.1080/0305764X.2015.1093095.
Begeny, J. C., Krouse, H. E., Brown, K. G., & Mann, C. M. (2011). Teacher judgments of Gerritsen, J. (2016). School groups wary of national standards computer system. Radio
students' reading abilities across a continuum of rating methods and achieve- New Zealand. Retrieved from www.radionz.co.nz/news/national/295621/
ment measures. School Psychology Review, 40(1), 23e38. school-groups-wary-of-national-standards-computer-system.
Benner, A. D., & Mistry, R. S. (2007). Congruence of mother and teacher educational Gill, J. (2002). Bayesian methods: A social and behavioral sciences approach. Boca
expectations and low-income youth's academic competence. Journal of Educa- Raton, Fla: Chapman and Hall.
tional Psychology, 99(1), 140e153. http://dx.doi.org/10.1037/0022-0663.99.1.140. Glock, S., Krolak-Schwerdt, S., & Pit-ten Cate, I. M. (2015). Are school placement
Bennett, R. E., Gottesman, R. L., Rock, D. A., & Cerullo, F. (1993). Influence of behavior recommendations accurate? The effect of students' ethnicity on teachers'
perceptions and gender on teachers' judgments of students' academic skill. judgments and recognition memory. European Journal of Psychology of Educa-
Journal of Educational Psychology, 85(2), 347e356. http://dx.doi.org/10.1037/ tion, 30(2), 169e188. http://dx.doi.org/10.1007/s10212-014-0237-2.
0022-0663.85.2.347. Haitana, T. N. (2007). Testing Tamariki: How suitable is the PPVT-III?. Master’s
Beswick, J. F., Willms, J. D., & Sloat, E. A. (2005). A comparative study of teacher Dissertation. Hamilton, NZ: University of Waikato.
ratings of emergent literacy skills and student performance on a standardized Hanushek, E. A., Link, S., & Woessmann, L. (2013). Does school autonomy make
measure. Education, 126, 116e137. sense everywhere? Panel estimates from PISA. Journal of Development Eco-
Bishop, R., & Berryman, M. (2006). Culture speaks: Cultural relationships & classroom nomics, 104, 212e232.
learning. Wellington, NZ: Huia. Harlen, W. (2005). Trusting teachers' judgement: Research evidence of the reli-
Blank, A., Houkamau, C., & Kingi, H. (2016). Unconscious bias and education: A ability and validity of teachers' assessment used for summative purposes.
comparative study of Ma ori and African American students. Retrieved from http:// Research Papers in Education, 20(3), 245e270. http://dx.doi.org/10.1080/
apo.org.au/resource/unconscious-bias-and-education-comparative-study- 02671520500193744.
maori-and-african-american-students. Hattie, J. A. C. (2008). Narrow the gap, fix the tail, or close the curves: The power of
de Boer, H., Bosker, R. J., & van der Werf, M. P. C. (2010). Sustainability of teacher words. In C. Rubie, & C. Rawlinson (Eds.), Challenging thinking about teaching
expectation bias effects on long-term student performance. Journal of Educa- and learning. Nova Science.
tional Psychology, 102(1), 168e179. http://dx.doi.org/10.1037/a0017289. Hattie, J. A., & Brown, G. T. L. (2003). Standard setting for asTTle reading: A comparison
Brown, G. (2013). asTTle - a national testing system for formative assessment: How of methods. asTTle Tech. Rep. #21. University of Auckland/Ministry of Education.
the national testing policy ended up helping schools and teachers. In M. Lai, & Hattie, J. A. C., Brown, G. T. L., Keegan, P., Irving, S. E., MacKay, A. J., Sutherland, T.,
S. Kushner (Eds.), A national developmental and negotiated approach to school et al. (2003). Validation evidence of asTTle reading assessment Results: Norms and
and curriculum evaluation (pp. 39e56). Bingley, UK: Emerald Group Publishing. criteria. AsTTle Tech. Rep. 22. University of Auckland/Ministry of Education.
Brown, G. T. L. (2009). Teachers' self-reported assessment practices and concep- Hecht, S. A., & Greenfield, D. B. (2002). Explaining the predictive accuracy of teacher
tions: Using structural equation modelling to examine measurement and judgments of their students' reading achievement: The role of gender, class-
structural models. In T. Teo, & M. S. Kline (Eds.), Structural equation modelling in room behavior, and emergent literacy skills in a longitudinal sample of children
educational research: Concepts and applications (pp. 243e266). Rotterdam, NL: exposed to poverty. Reading and Writing, 15(7e8), 789e809. http://dx.doi.org/
Sense Publishers. 10.1023/A:1020985701556.
Chamberlain, M. (2010). Blueprint for national standards. Retrieved from New Zea- Helwig, R., Anderson, L., & Tindal, G. (2001). Influence of elementary student gender
land Education Gazette http://www.edgazette.govt.nz/Articles/Article.aspx? on teachers' perceptions of mathematics achievement. The Journal of Educa-
ArticleId¼8187. tional Research, 95(2), 93e102. http://dx.doi.org/10.1080/00220670109596577.
Chang, D. F., & Sue, S. (2003). The effects of race and problem type on teachers' Hoge, R. D., & Butcher, R. (1984). Analysis of teacher judgments of pupil achieve-
assessments of student behavior. Journal of Consulting and Clinical Psychology, ment levels. Journal of Educational Psychology, 76(5), 777e781. http://dx.doi.org/
71(2), 235e242. http://dx.doi.org/10.1037/0022-006X.71.2.235. 10.1037/0022-0663.76.5.777.
Chubbuck, S. M., & Zembylas, M. (2016). Social justice and teacher education: Hoge, R. D., & Coladarci, T. (1989). Teacher-based judgments of academic achieve-
Context, theory, and practice. In John Loughran, & Mary Lynn_Hamilton (Eds.), ment: A review of literature. Review of Educational Research, 59(3), 297e313.
International handbook of teacher education (Vol. 2, pp. 463e501). Singapore: http://dx.doi.org/10.3102/00346543059003297.
Springer. Hopkins, K. D., George, C. A., & Williams, D. D. (1985). The concurrent validity of
Clark, C. M., & Peterson, P. L. (1986). Teachers' thought processes. In M. C. Wittrock standardized achievement tests by content area using teachers' ratings as
(Ed.), Third handbook of research on teaching (pp. 255e296). New York, NY: criteria. Journal of Educational Measurement, 22, 177e182.
Macmillan. Hurwitz, J. T., Elliott, S. N., & Braden, J. P. (2007). The influence of test familiarity and
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: student disability status upon teachers' judgments of students' test perfor-
Laurence Erlbaum. mance. School Psychology Quarterly, 22(2), 115.
Courtney, B. (2010). National standards: A parent's perspective. New Zealand Journal Jussim, L., & Harber, K. D. (2005). Teacher expectations and self-fulfilling prophe-
of Teachers' Work, 7, 8e14. cies: Knowns and unknowns, resolved and unresolved controversies. Person-
Darr, C., McDowall, S., Ferral, H., Twist, J., & Watson, V. (2008). Progressive ality and Social Psychology Review, 9(2), 131e155. http://dx.doi.org/10.1207/
achievement Test: Reading - teacher manual. Wellington, NZ: New Zealand s15327957pspr0902_3.
Council for Educational Research. Kaiser, J., Retelsdorf, J., Südkamp, A., & Mo €ller, J. (2013). Achievement and engage-
Darr, C., Neill, A., Stephanou, A., & Ferral, H. (2006). Progressive achievement Test: ment: How student characteristics influence teacher judgments. Learning and
Mathematics e teacher manual. Wellington, NZ: New Zealand Council for Instruction, 28, 73e84. http://dx.doi.org/10.1016/j.learninstruc.2013.06.001.
Educational Research. Keegan, P., Brown, G. T., & Hattie, J. A. (2014). 4 A psychometric view of sociocultural
Doherty, J., & Conolly, M. (1985). How accurately can primary school teachers factors in test validity: The development of standardised test materials for
predict the scores of their pupils in standardized tests of attainment? A study of Ma ori medium schools in New Zealand/Aotearoa. In S. Phillipson, K. Y. L. Ku, &
some non-cognitive factors that influence specific judgements. Educational S. N. Phillipson (Eds.), Constructing achievement: A sociocultural perspective (pp.
Studies, 11(1), 41e60. http://dx.doi.org/10.1080/0305569850110105. 42e54). London: Routledge.
Dompnier, B., Pansu, P., & Bressoux, P. (2006). An integrative model of scholastic Kim, H. Y. (2013). Statistical notes for clinical researchers: Assessing normal dis-
judgments: Pupils' characteristics, class context, halo effect and internal attri- tribution using skewness and kurtosis. Restorative Dentistry & Endodontics,
butions. European Journal of Psychology of Education, 21(2), 119e133. http:// 38(1), 52e54. http://dx.doi.org/10.5395/rde.2013.38.1.52.
dx.doi.org/10.1007/BF03173572. Kim, K. H., & Zabelina, D. (2015). Cultural bias in assessment: Can creativity
Dorman, J. P. (2008). The effect of clustering on statistical tests: An illustration using assessment help? The International Journal of Critical Pedagogy, 6(2).
classroom environment data. Educational Psychology, 28(5), 583e595. Kline, T. J. (2005). Psychological testing: A practical approach to design and evaluation.
Eames, D. (2010). National standards policy: How parents mark it. Retrieved from Thousand Oaks, CA: Sage.
nzherald.co.nz http://www.nzherald.co.nz/nz/news/article.cfm?c_ Kreft, I. G. G., de Leeuw, J., & Aiken, L. S. (1995). The effect of different forms of
id¼1&objectid¼10624503. centering in Hierarchical Linear Models. Multivariate Behavioral Research, 30,
Education Counts. (2005). Teacher census 2004. Retrieved 24/11/2016, from www. 1e21.
educationcounts.govt.nz/publications/schooling/teacher_census. Lai, C. K., Marini, M., Lehr, S. A., Cerruti, C., Shin, J. L., Joy-Gaba, J. A., … Nosek, B. A.
Erwin, J. O., & Worrell, F. C. (2012). Assessment practices and the underrepresen- (2014). Reducing implicit racial preferences: I. A comparative investigation of 17
tation of minority students in gifted and talented education. Journal of Psy- interventions. Journal of Experimental Psychology: General, 143(4), 1765e1785.
choeducational Assessment, 30(1), 74e87. http://dx.doi.org/10.1037/a0036260.
Fairfax New Zealand Limited. (2012). How New Zealand schools rate. Retrieved from Lembke, E. S., Foegen, A., Whittaker, T. A., & Hampton, D. (2008). Establishing
stuff.co.nz http://www.stuff.co.nz/national/education/7715044/How-New- technically adequate measures of progress in early numeracy. Assessment for
Zealand-schools-rate. Effective Intervention, 33(4), 206e214. http://dx.doi.org/10.1177/
Fairfax New Zealand Limited. (2016). School report. Retrieved from stuff.co.nz http:// 1534508407313479.
www.stuff.co.nz/national/education/school-report. Li, H., Pfeiffer, S. I., Petscher, Y., Kumtepe, A. T., & Mo, G. (2008). Validation of the
Feinberg, A. B., & Shapiro, E. S. (2003). Accuracy of teacher judgments in predicting gifted rating scaleseschool form in China. Gifted Child Quarterly, 52(2), 160e169.
oral reading fluency. School Psychology Quarterly, 18(1), 52e65. http://dx.doi.org/10.1177/0016986208315802.
Fothergill, J., & Taylor-Jorgensen, T. (n.d.). Assessment in New Zealand primary Martínez, J. F., Stecher, B., & Borko, H. (2009). Classroom assessment practices,
schools. Retrieved June 2016, from https://sites.google.com/site/ttjjmf/home. teacher judgments, and student achievement in mathematics: Evidence from
Francis, B., Archer, L., Hodgen, J., Pepper, D., Taylor, B., & Travers, M.-C. (2016). the ECLS. Educational Assessment, 14(2), 78e102. http://dx.doi.org/10.1080/
60 K. Meissel et al. / Teaching and Teacher Education 65 (2017) 48e60

10627190903039429. Rubie-Davies, C. M. (2010). Teacher expectations and perceptions of student attri-


May, S. (n.d.). Assessment: What are the cultural issues in relation to Pasifika, Asian, butes: Is there a relationship? British Journal of Educational Psychology, 80(1),
ESOL, immigrant and refugee learners? University of Waikato. Retrieved June 121e135.
2016, from http://assessment.tki.org.nz/Media/Files/May-S.-Assessment-what- Rubie-Davies, C. (2014). Becoming a high expectation teacher: Raising the bar. New
are-the-cultural-issues-in-relation-to-Pasifika-Asian-ESOL-immigrant-and- York, NY: Routledge.
refugee-learners-University-of-Waikato. Rubie-Davies, C. M., Hattie, J., & Hamilton, R. (2006). Expecting the best for stu-
McGrady, P. B., & Reynolds, J. R. (2013). Racial mismatch in the classroom beyond dents: Teacher expectations and academic outcomes. British Journal of Educa-
Black-white differences. Sociology of Education, 86(1), 3e17. tional Psychology, 76(3), 429e444.
McKown, C., & Weinstein, R. S. (2008). Teacher expectations, classroom context, and Rubie-Davies, C. M., Peterson, E. R., Flint, A., Garrett, L., McDonald, L. M., Watson, P.,
the achievement gap. Journal of School Psychology, 46(3), 235e261. & O'Neill, H. (2012). Ethnicity and teacher expectations in New Zealand. Pro-
Meisels, S. J., Bickel, D. D., Nicholson, J., Xue, Y., & Atkins-Burnett, S. (2001). Trusting cedia-Social and Behavioral Sciences, 69, 256e261.
teachers' judgments: A validity study of a curriculum-embedded performance Rubie-Davies, C. M., Peterson, E. R., Sibley, C. G., & Rosenthal, R. (2015). A teacher
assessment in kindergarten to grade 3. American Educational Research Journal, expectation intervention: Modelling the practices of high expectation teachers.
38(1), 73e95. http://dx.doi.org/10.3102/00028312038001073. Contemporary Educational Psychology, 40, 72e85.
Methe, S. A., Hintze, J. M., & Floyd, R. G. (2008). Validation and decision accuracy of Rubie-Davies, C. M., Weinstein, R. S., Huang, F. L., Gregory, A., Cowan, P. A., &
early numeracy skill indicators. School Psychology Review, 37(3), 359e373. Cowan, C. P. (2014). Successive teacher expectation effects across the early
Ministry of Education. (2010). National standards. Retrieved from TKI: Te Kete school years. Journal of Applied Developmental Psychology, 35(3), 181e191.
Ipurangi http://nzcurriculum.tki.org.nz/National-Standards. Satherley, P. (2006). Student outcome overview 2001e2005: Research findings on
Ministry of Education. (2011). Overall teacher judgment. Retrieved from www. student achievement in reading, writing and mathematics in New Zealand schools.
nzcurriculum.tki.org.nz/National-Standards/Key-information/Fact-sheets/ Wellington, NZ: Ministry of Education, Research Division.
Overall-teacher-judgment/ http://www.nzcurriculum.tki.org.nz/National- Schmidt, W., Burroughs, N., Zoido, P., & Houang, R. (2015). The role of schooling in
Standards/Key-information/Fact-sheets/Overall-teacher-judgment/. perpetuating educational inequality: An international perspective. Educational
Ministry of Education. (2012). National Standards: Key information. Retrieved from Researcher, 44(7), 371e386. http://dx.doi.org/10.3102/0013189X15603982.
http://www.nzcurriculum.tki.org.nz/National-Standards/. Sharpley, C. F., & Edgar, E. (1986). Teachers' ratings vs standardized tests: An
Ministry of Education, & NZCER. (2012). e-asTTle technical manual. Wellington, NZ: empirical investigation of agreement between two indices of achievement.
Ministry of Education. Psychology in the Schools, 23(1), 106e111. http://dx.doi.org/10.1002/1520-
Ministry of Education. (2016). Find a school. Retrieved from Education Counts 6807(198601)23:1<106::AID-PITS2310230117>3.0.CO;2-C.
https://www.educationcounts.govt.nz/find-school. Smith, L. A., Anderson, V., & Blanch, K. (2016). Five beginning teachers' reflections
Nieto, S. (2000). Placing equity front and center some thoughts on transforming on enacting New Zealand's national standards. Teaching and Teacher Education,
teacher education for a new century. Journal of Teacher Education, 51(3), 54, 107e116.
180e187. Statistics New Zealand. (2016). Time series data for student numbers. Retrieved 23/
Ogle, L., Sen, A., Pahlke, E., Jocelyn, L., Kastberg, D., Roey, S., & Williams, T. (2003). 11/2016, from Education Counts www.educationcounts.govt.nz/statistics/
International comparisons in fourth grade reading literacy: Findings from the schooling/student-numbers/6028.
Progress in International Reading Literacy Study (PIRLS) of 2001 (NCES 2003-073). Südkamp, A., Kaiser, J., & Mo €ller, J. (2012). Accuracy of teachers' judgments of
Washington, DC: U.S. Government Printing Office. students' academic achievement: A meta-analysis. Journal of Educational Psy-
Organisation for Economic Co-operation and Development (OECD). (2005). PISA chology, 104(3), 743e762. http://dx.doi.org/10.1037/a0027627.
2003 Technical Report. Paris, France: OECD Publications. Südkamp, A., Kaiser, J., & Mo €ller, J. (2014). Teachers' judgments of students' aca-
Organisation for Economic Co-operation and Development (OECD). (2013). PISA demic achievement. In S. Krolak-Schwerdt, S. Glock, & M. Bo € hmer (Eds.),
2012 results: Excellence through equity: Giving every student the chance to succeed Teachers' professional development: Assessment, training, and learning (pp. 5e25).
(Vol. 2). Paris, France: OECD Publications. http://dx.doi.org/10.1787/ Rotterdam, NL: Sense Publishers.
9789264201132-en. Thrupp, M. (2013). National standards for student achievement: Is New Zealand's

OZerk, K., & Whitehead, D. (2012). The impact of national standards assessment in idiosyncratic approach any better. Australian Journal of Language and Literacy,
New Zealand, and national testing protocols in Norway on indigenous 36(2), 99e110.
schooling. International Electronic Journal of Elementary Education, 4(3), 545. Wiliam, D., & Bartholomew, H. (2004). It's not which school but which set you’re in
Parsons, S., & Hallam, S. (2014). The impact of streaming on attainment at age that matters: The influence of ability grouping practices on student progress in
seven: Evidence from the Millennium Cohort Study. Oxford Review of Education, mathematics. British Educational Research Journal, 30(2), 279e293. http://
40(5), 567e589. http://dx.doi.org/10.1080/03054985.2014.959911. dx.doi.org/10.1080/0141192042000195245.
Peterson, E. R., Rubie-Davies, C. M., Osborne, D., & Sibley, C. (2016). Implicit and Wilson, A., Madjar, I., & McNaughton, S. (2016). Opportunity to learn about disci-
explicit teacher expectations: Relations with ethnicity and achievement. plinary literacy in senior secondary English classrooms in New Zealand. The
Learning and Instruction, 42, 123e140. http://dx.doi.org/10.1016/ Curriculum Journal, 27(2), 204e228. http://dx.doi.org/10.1080/
j.learninstruc.2016.01.010. 09585176.2015.1134339.
Raphael, T. E., Au, K. H., & Goldman, S. R. (2009). Whole school instructional Woltman, H., Feldstain, A., MacKay, J. C., & Rocchi, M. (2012). An introduction to
improvement through the standards-based change process. In J. Hoffman, & hierarchical linear modeling. Tutorials in Quantitative Methods for Psychology,
Y. Goodman (Eds.), Changing literacies for changing times (pp. 198e229). London: 8(1), 52e69.
Routledge. Wyatt-Smith, C., Klenowski, V., & Gunn, S. (2010). The centrality of teachers'
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and judgment practice in assessment: A study of standards in moderation. Assess-
data analysis methods (2nd ed.). Thousand Oaks: Sage Publications. ment in Education: Principles, Policy & Practice, 17, 59e75.
Ready, D. D., & Wright, D. L. (2011). Accuracy and inaccuracy in teachers' percep- Wylie, C., Cosslett, G., & Burgon, J. (2016). New Zealand principals: Autonomy at a
tions of young children's cognitive abilities: The role of child background and cost. In A decade of research on school principals (pp. 269e290). Springer In-
classroom context. American Educational Research Journal, 48(2), 335e360. ternational Publishing.
http://dx.doi.org/10.3102/0002831210374874.

Вам также может понравиться