Вы находитесь на странице: 1из 1

On the Chopping Block: Examining the Fairness of Observational Data of Teacher Effectiveness

Cody B. Wilson, Rebekah J. Thomas, & Laura P. Capps Purdue University Introduction
Teacher accountability has recently become one of the greatest concerns in educational policy. No Child Left Behind, the 2002 reauthorization of the Elementary and Secondary Education Act, added extensive accountability requirements for teachers and schools to adhere to, or risk the loss of federal funding. The current federal education initiative, Race to the Top, also requires that states use teacher effectiveness (TE) measures with which they can assess educators in order to receive full funding. Measures of TE have been used in research for decades but are only now finding themselves at the forefront of policy and decision making. This calls for rigorous research and evaluation of these measures in order to determine their appropriateness for teacher accountability purposes. The accuracy of measures of TE needs to be supported and substantiated by research since they are often used to make high-stakes decisions in the careers of teachers. These decisions could include (but are not limited to) compensation, promotion, public labeling, and termination. Use of TE measures for purposes beyond the ones for which they were originally developed, could lead to erroneous decisions, some of which may have devastating consequences for teachers. A number of currently used measures of TE suggest that the observation of teachers once or twice a semester can result in a reliable score of TE (Pianta, 2008; RISE Evaluation Development System, 2013; Sawada & Piburn, 2000). These suggestions assume that teachers do not have lowerthan-average or exceptional days, or that a small chunk of time is capable of representing a teachers overall effectiveness. Our research investigates this particular issue with the Classroom Assessment Scoring System (CLASS; Pianta, LaParo, & Hamre, 2008), a measure of TE now used in classrooms in over 40 states (Teachstone, 2013). Curby, Grimm, and Pianta (2010) have previously investigated the concept of stable TE scores over time, although their study focused primarily on observing and evaluating teacher-student interactions during the first two hours of a preschool day. When analyzing the data, this study also averaged TE scores observed within the same day across a large number of teachers. Because the real-world implications of a teachers score remain on the individual level, our research uses a single-subject design to evaluate individuals variability in TE scores over time. We hypothesize that if CLASS can be used as a reliable observational assessment of TE, scores should remain stable for an individual teacher over time.

Results
Analytic Strategy
Previous tests of CLASS have used large samples of teachers to obtain averages across the three domains in order to generalize their findings to the entire population of teachers. However, aggregated scores are minimally informative when used for real-world purposes (i.e., accountability) because they provide no information about an individual teachers progress or patterns of performance over time. Thus, to evaluate our data, we chose a time-series, single-case research design (Gast, 2010). This allowed for an in-depth examination of the two teachers over time and is consistent with the use of the scores for teacher evaluation purposes in educational contexts. We examined trends in the data by evaluating two pieces of evidence for each of the three CLASS domains. First we examined the plots for evidence of changes in slope (i.e., evidence of decline or growth in scores over time). Next, we evaluated the variability (fluctuations) in each teachers daily scores by creating a stability envelope, (Gast, 2010) around each teachers average score. To construct the stability envelope, we followed recommendations (Gast & Spriggs, 2010) pertaining to stability ranges for free operant behaviors (i.e., teacher behaviors that are be shaped by classroom consequences and may recur in the course of instruction) when the data are based on more than five observations per participant. We thus created a 10% confidence interval around each teachers mean score in each CLASS domain. Specifically, after calculating and plotting each teachers score for each lesson, we computed the stability envelope by adding 10% to the mean scores and 10% minus the mean, and plotted these constants. The space between these two lines is considered our stability range. To measure how stable a teachers scores were, we documented the percentage of times that her scores fell within her own stability range.

Teacher B
Table 2.1: Descriptive Data
Teacher B

Emotional Support 5.58 5.88 0.50 4.75 6.25 6.14 5.03

Classroom Organization 5.62 5.67 0.66 3.83 6.5 6.18 5.05

Instructional Support 3.96 4.33 0.78 2.33 4.83 4.36 3.57

,)-&.)(%@%/%01'2'+-3%$455'(6%% $6-783869%:-+;)%
(" '"

Mean Median SD Min Max Avg. +10% Avg. -10%

!"#$$%$&'()%

&" %" $" #" !" !" #" $" %" &" '" (" )" *" !+" !!"

")**'+%

Table 2.2: Percentage Scores within Stability Ranges


Emotional Support 73% Classroom Organization 73% Instructional Support 27%
(" '"

Teacher A
Table 1.1: Descriptive Data
Teacher A

Table 1.2: Percentage Scores within Stability Ranges


Classroom Organization 5.31 5.33 0.31 4.67 5.78 5.84 4.78 Instructional Support 3.74 3.67 0.53 3.11 4.67 4.12 3.37
("

,)-&.)(%@%>%!3-**(''1%<(;-+8=-2'+%% $6-783869%:-+;)%

Emotional Support 5.18 5.07 0.53 4.5 6.13 5.70 4.66

Emotional Support 62.5%

Classroom Organization 88%

Instructional Support

!"#$$%$&'()%

Mean Median SD Min Max Avg. +10% Avg. -10%

50%

Emotional Support:
,)-&.)(%#%/%!3-**(''1%<(;-+8=-2'+% $6-783869%:-+;)%
!"#$$%%$&'()*%
'" &" %" $" #" !" !" #" $" %" &" '" (" )"

&" %" $" #" !" !" #" $" %" &" '" (" )" *" !+" !!"

Description of CLASS
CLASS is an observation-based system in which observers rate teachers numerically (1-7) in ten dimensions across three broader domains (Emotional Support, Classroom Organization, and Instructional Support) each of which is based on a set of dimensions specific to the domain (see Figure 1). Ratings on each dimension are derived from specific behavior markers that are observable during the classroom events. The CLASS scoring is based on 15-20 minute cycles of the lesson. That is, following the first cycle, observers provide ratings on each of the ten dimensions, then proceed to next observation cycle, and so on. Raters first decide if a teacher is in the High, Mid, or Low range for a particular dimension. A rating in the High category would mean that many of the behavior markers for that dimension are nearly constantly present, while a Low rating would indicate that few, if any, of the behavior markers are present or are rarely present. Once an effectiveness range is determined, raters give a specific numerical rating (1-7). Scores of 1-2 are in the Low range of TE whereas scores of 3-5 and 6-7 are in the Mid and High TE range respectively. This process is repeated for each 15-20 minute observation cycle. Once all cycles have been completed and scored, dimensional scores are averaged together to determine a composite score in each of the three domains (Emotional Support, Classroom Organization, and Instructional Support) .
!"#$$%%$&'()%

! ! !

9 / 11 lessons were in the mid-range 2 / 11 lessons were in the high range 3 / 11 lessons were outside the teachers stability range

(" '" &" %" $" #" !" !"

,)-&.)(%#%/%01'2'+-3%$455'(6% $6-783869%:-+;)%

Classroom Organization:
! ! !
")**'+%

")**'+%

8 /11 lessons were in the mid-range 3/ 11 lessons were in the high range 3 / 11 lessons were outside the teachers stability range
!"#$$%$&'()%
(" '" &" %" $" #" !" !" #" $" %" &" '" (" )" *" !+" !!"

,)-&.)(%@%>%?+*6(4&2'+-3%$455'(6%% $6-783869%:-+;)%

Instructional Support:
! 9 / 11 lessons were in the mid-range but all scores were lower than the the upperend (5) of Instructional Support 2 / 11 lessons were in the the low range 7 / 11 lessons were outside the teachers stability range

#"

$"

%"

&"

'"

("

)"

")**'+% !"#$$%%$&'()%

(" '" &" %" $" #" !" !"

,)-&.)(%#%>%?+*6(4&2'+-3%$455'(6% $6-783869%:-+;)%

! !

Figure 1: Overview of CLASS Domains and Dimensions

")**'+%

Method
Participants
Two teachers (Teacher A and Teacher B), purposively chosen from a database of 12 teachers who participated in a study of early science learning (Mantzicopoulos, Patrick, & Samarapungavan, 2008). The specific criteria for selection were as follows: ! ! ! ! ! ! gender (both teachers are female); years of experience (both teachers have over 20 years of experience); school context (both teachers have worked in the same school for over 20 years); grade-level assignment (both Teacher A and Teacher B taught half-day kindergarten in the morning); student background (students taught by both Teacher A and Teacher B come from comparable socioeconomic backgrounds and comparable achievement levels); time of year (both teachers were observed during the spring semester to avoid variables that could include teacher inexperience with specific student behavior and other situations unique to the beginning of the school year). curriculum (the content of the curriculum presented by the teachers was comparable. Each teacher taught a sequence of lessons on life science and we observed and scored each lesson, from the beginning of the unit to its conclusion).

Emotional Support:
! Observation Measure and Procedures
The CLASS, version K-3, was used to code a sequence of video recorded lessons from each teacher. Eight and 11 consecutive lessons were observed for Teacher A and Teacher B respectively. The lessons varied in duration from 19 to 59 minutes. Teacher As lessons ranged from 19 to 51 minutes (M = 34), whereas Teacher Bs lessons ranged from 25 to 68 minutes (M = 46). Consistent with the CLASS observation protocol, each lesson was divided into cycles of equal duration, each varying from approximately 15-20 minutes. To ensure reliability, the lessons were scored by trained and certified CLASS observers. Formal certification requires observers to be able to score with a theoretical true score at least 80% of the time. This theoretical true score is determined by Teachstone using master coders for the purposes of reliability measurement (Teachstone, 2013). Observers of our particular research were also trained to remain within 90% reliability of each other.

1 / 8 lessons was in the high range 7 / 8 lessons were in the mid-range 3 / 8 lessons were outside the teachers stability range

! !

Conclusions
#" $" %" &" '" (" )"

")**'+%

Classroom Organization:
! ! All lessons were in the mid-range 1 / 8 lessons were outside the teachers stability range

! Both Teacher A and Teacher B show some degree of instability in all three domains. The most significant instability is found in the Instructional Support domain. Only 50% of Teacher As lesson scores fell within her own stability range, while only 27% of Teacher Bs lesson scores fell within her own stability range. This means that the data do not support the inference that teachers instructional strategies are highly stable from lesson to lesson, even when the lessons follow a thematic sequence within a single unit. ! Our research gains some external validity through comparable studies finding similar results. In a study of 38 teachers in Germany, Dr. Praetorius et al, (2014) of the University of Augsburg found that observational scores of certain dimensions of instructional quality (comparable to what CLASS has operationalized Instructional Support to be) were highly variable. This supports our own findings of high instability with the Instructional Support domain. ! Of interest, and consistent with other studies of the CLASS (e.g., Curby et al., 2010; Plank et al, 2013) scores on Instructional Support were not only variable from one lesson to the next, but also at lower overall mean levels than scores on Emotional Support and Classroom Organization. This is concerning, considering that Instructional Support reflects strategies that support student learning (e.g., higher level questions, use of advanced language, supporting connections with the real world) and should be predictive of achievement. ! The documented variability of the Instructional Support domain is enough to reject our original hypothesis: that if these measures are reliable sources of information for high-stake decisions, scores for individual teachers should remain stable over time. Recall that A and Teacher B were selected specifically for their predicted stability (having taught for over 20 years within the same school, same gradelevel each, teaching a thematically grouped set of lessons in a single unit). Therefore, our findings raise many concerns regarding the fairness of this, and other similar, observation-based assessment of TE.

Instructional Support:
! All lessons were in the mid-range but all scores were lower than the upper-end (5) of Instructional Support 4 / 8 lessons were outside the teachers stability range
Source. Reproduced from Table 2.1 in Pianta, LaParo, and Hamre (2008, p. 17.)

Вам также может понравиться