Вы находитесь на странице: 1из 46

A Multi-Disciplined Review of the Student Teacher Evaluation Process

There have been divisive issues within academics in the past, but few have been as well researched and long lasting as the debate on the student teacher evaluation process (SET). Seldin reported in 1993 that four of five (80%) campuses use some sort of student evaluation of instructors. Business education appeared to be ahead of the curve; one year later about 95% of the deans of accredited business schools used the evaluations as a source of teaching information (Crumbley 1995). Shortly thereafter, a study by the Carnegie Foundation found 98% of universities were using some form of student evaluation of classroom teaching (Magner 1997). About the same time another study found that 99.3% of business schools used some form of student evaluation of teaching, and that deans placed a higher importance on these evaluations than either administrative or peer evaluations (Comm & Manthaisel 1998). On many campuses, student evaluations of instruction are the most important, and in many cases, the only measure of teaching ability (Wilson 1998). Seldin (1999) reports a California dean as saying, If I trust one source of data on teaching performance, I trust the students [p. 15]. As would be expected, the explosive utilization of an assessment that could establish tenure, promotion, and merit pay would be extensively researched. One source stated that there was close to 3,000 articles published on SET from 1990 to 2005 alone (Al-Issa & Sulieman 2007). Reports on the topic are so voluminous that many researchers have for some time been using the method of meta-analysis in which the case is not a subject but an entire published article (see Clayson 2009; Cohen 1980; Cohen 1981; Feldman 1989 as examples). Nevertheless, little agreement has been made on key points. The defenders of the system are usually found in the colleges of education, in the national teachers unions, and among those who consult in the area. Some have defended the evaluations almost as if they were religious tenets. These advocates typically have an advantage in the publication process since instructional research is the essential academic work of their profession. Other disciplines generally look upon research of instruction as less prestigious, and those opposed to the evaluation process are more dispersed among academic disciplines, and more isolated in their publication sources. They are, however, equally emphatic. In such an environment, it becomes relatively easy to select research findings that reinforce a point of view, or to find reports that counter positive findings for the opposition. The following summary of the evaluation process is not free of these problems, but it does attempt to present information from a wider assortment of venues than much that is found in the traditional educational discipline outlets.

Gender
Do men get better evaluations than women?

Research on the question of whether gender creates differences in student ratings of professors and instruction has provided mixed results. Defenders of the process have generally stated that there is little evidence of gender related effects (Feldman 1993), while admitting that there may be a difference in the way male and female students rate faculty (Cashin 1995). Differences were found early in gender research, leading researchers to generally conclude that male instructors were evaluated more highly than females (Bernard et al. 1981). Others (Bennett 1982), however, suggested that the existing literature base before the 1980s offered little evidence that women received systematically lower marks from students than men, but that students did seem to prefer same gender instructors. It was found that sex role stereotypes more strongly influence evaluation of female instructors than those of male professors. This finding was reinforced by a later study (Kierstead et al. 1988) suggesting that women instructors had to work harder to attain the same ratings given to men. Basow and Silberg (1987) reported that male students gave lower ratings to female professors. There is also a tendency for female students to select women instructors as better teachers (Basow 1998). Female students have been shown to give female instructors higher evaluations than they did to male instructors, and higher evaluations than males for the same female instructors. Both male and female students thought that instructors of their own gender showed more interest in students (Elmore & LaPointe 1975). Basow (2000) later published findings that female professors were chosen as the best instructor more by women students than male students, who chose more males as the best instructor. She found no gender differences in selections of the worst professors. The best female instructors were considered to be much more helpful than the best male professors, indicating a subtle gender stereotyping by the students. In a large study of almost 3,000 respondents, Langbein (1994) found that female instructors were rewarded, relative to men, for being supportive and displaying nurturing behavior. Women instructors were punished, relative to men, for objective and authoritarian behavior. She also found that women were given less of a boost in their evaluation for good grades than were men. In a study of introductory management and marketing classes, Foote, Harmon and Mayo (2003) found that student gender role attitudes did not affect student evaluation of instructors. The only marketing study using students to look at the gender of instructors (Clayson & Glynn 1993) found no gender main effects of the global ratings either by sex of instructor or by sex of the student, however significant interaction effects were found. Students perceived instructors of their own gender to be better teachers. Moreover, male and female students used different combinations of personality traits in their determination. Male students indicated that a good male instructor would be experienced, approachable, likable, direct, and hard working. Male students perceived a female professor as being a better instructor if she was confident, fair, and not talkative. Female students found a good male instructor to be ambitious, yet sensitive and concerned. A good female instructor was judged on only two social interactive variables; concerned and likable. Students, both male and female, did, however, expect male professors to be more successful. When students were allowed to pick professors by looking at their pictures, a male photo was chosen 73% of the time as representing the best professor. A male photo

was chosen 80% of the time as representing the person most likely to be making more consulting money. A female photo was chosen 71% of the time as representing a professor that was denied tenure (Clayson 1992). A recent unpublished study found that there is a positive relationship between judged attractiveness of faculty and the evaluations they receive. The effect seems to be stronger for male instructors than for female (Hamermesh & Parker 2004). Summary 1. Although historically, male professors may have been preferred in evaluations, there currently appears to be few global differences between the mean scores for male and female instructors. However, students seem to believe that male professors will be more successful. 2. Students do seem to prefer instructors of the same gender. 3. Female instructors may need to display more gender stereotypic behaviors for high evaluations than males. 4. If a male and a female instructor receive the same global evaluation, the reasons for that evaluation are highly likely to be different.

Research
Do professors who do research get better student evaluations? Many faculty will argue that doing research makes them better teachers. There are logical reasons for holding this belief. A researcher is more likely to be current with the discipline, to have more information to impart to students, and should be able to guide students better within the community of the discipline. On the other hand, some seem to believe that being a good teacher is incompatible with being a good researcher, primarily because of time constraints. Are these beliefs reflected in student evaluation of instruction? The answer appears to be no. Although there is no evidence that faculty research harms the evaluation, but it has been found that creative researchers and effective teachers have distinctively different personality profiles, with the teachers pattern more conducive to higher evaluations (Jackson 1994). Feldman (1987) concluded, the likelihood that research productivity actually benefits teaching is extremely small or that the two, for all practical purposes, are essentially unrelated Productivity in research and scholarship does not seem to detract from being an effective teacher (p. 274). This opinion was confirmed by Hattie and Marsh (1996). They performed a meta-analysis of 58 studies looking at the relationship between research and teaching. The average correlation was close to zero with a median of zero. Zero relationships are more likely to occur at research universities and when the rapport aspects of the evaluation are emphasized. Good

researchers were rated as more enthusiastic and more knowledgeable. This literature did find a negative relationship between time spent on research and time spent on teaching. Later Marsh and Hattie (2002) conducted a study of over 12,000 student evaluations and again found no association between teaching evaluations and research productivity. In contrast to the apparent academic myth that research productivity and teaching effectiveness are complementary constructs, results of the present investigation coupled with the findings of Hattie and Marsh (1996) meta-analysis provide strong support for the typical finding that the teaching-research relationship is close to zero [p. 628]. Marsh and Hattie suggest several explanations, but NEVER hypothesis that the evaluations may not be valid. Summary 1. Research does not seem to benefit or harm student evaluation of instruction. 2. Except for some personality related issues related to research, research productivity itself appears to be unrelated to the evaluation of instruction.

Personality
Do personality traits influence the evaluation process? To the extent that personality is considered to be an intrinsic personal and long lasting variable, its influence on the student evaluation of instruction can create troubling questions. For example, if personality is highly related to the evaluation that an instructor receives, is it possible to make long-term changes in an individual instructors evaluations? Can traits that would raise evaluations be taught? If not, does the current evaluation process suggest that teaching should be thought of as a vocation, and if so, what is the purpose of a college of education? It is understandable, therefore, that defenders of the evaluation process would be sensitive to this issue. In general, researchers publishing from within educational disciplines report founding few personality traits that correlate with student ratings (Boice 1992; Braskamp & Ory 1994; Centra 1993). Yet it has also been reported that students form their opinions of a class and instructor very early in a course, and subsequent class and learning experiences do little to change that opinion (Ortinau & Bush 1987; Hewett, Chastain & Thruber 1988; Sauber & Ludlow 1988). Further, some studies have manipulated classroom conditions and found interesting effects of instructors personalities (Naftulin, Ware, & Donnelly 1973; Widmeyer & Loy 1988). One study by Harvard psychologists (Ambady & Rosenthal 1993) investigated students reactions to randomly selected 30-second clips of soundless videotapes of actual classroom instruction and found them highly correlated with end-ofcourse evaluations. Evaluations based on 30-second exposures were no more significant than judgments based on 6-second clips. Personality traits identified by independent raters were highly correlated with the evaluations. In order was optimistic, confident, dominant,

active, enthusiastic, and likeable. The raters judgment of professional was not significantly correlated with the student evaluations. Given its centrality to the issue of validity, surprisingly few research articles have attempted to statistically relate personality to the evaluations. Those that have been published have found a large association between personality variables and the evaluation outcomes, estimating that personality accounts for over 50% of the total variance of the evaluations (Erdle, Murray, & Rushton 1985; Feldman 1986; Murray, Rushton, & Paunonen 1990; Sherman & Blackburn 1975, Marks 2000)! Feldman (1986) summarized the research through the mid-80s. He found that there was no evidence for a personality/evaluation association if personality was measured from the instructors perspective. He did find that the instructors colleagues perceptions were more related to the evaluations given by students. The students perceptions, however, were strongly related to the evaluations. The total associative measures of the students perception of personality, according to Feldman, range from 0.77 to 0.88 accounting for from 60 to 75% of the variance. Feldman (1986) gives three interpretations for his findings. 1) The faculty personality is validly related to their teaching effectiveness. The instructor may show a different personality in class than in other environments thus accounting for the lack of associations with the faculty members own evaluations. 2) Both the measures of personality and teaching effectiveness are contaminated by other variables. For example, an instructor may be liked for any number of reasons, so students rate them as a good teacher and then state that they have a pleasing personality. 3) This explanation agrees that the results of the studies do say something valid about personality, but only the students perception of personality. As an example, an instructor presents well-ordered lectures. The lectures actually help the students to learn more. Thus the instructor gains a high evaluation and the students also believe that the instructor is an organized person. In a study of business students, Clayson and Sheffet (2006) compared measures of personality and evaluations at four different times during a term. They compared change in the students perception of personality with change in the evaluations in the last six weeks of the term. The changes were highly related, and changes were found in all instructors going up and down during the same time period. In other words, even after the midterm, changes in evaluations (both negative and positive) for each instructor were highly related to changes in the students perception of the instruction. The study ruled out the possibility that the personality/evaluation association was a statistical artifact resulting from insufficient control of secondary variables. The researchers maintained that these findings are incompatible with Feldmans explanation one and three. One could still argue that true personality should not be changing over short periods of time, therefore Clayson et al. (2005) merely showed that the students perception of personality is contaminated, but this explanation only reinforces the contention that student see instruction primarily as an aspect of personality. One large study of marketing students has been conducted using a path analysis. This research found that the total effect of personality on the student evaluation of faculty was very high, with each standard deviation change in personality resulting in a 0.83 standard

deviation change in the evaluations. Personality was also found to be significantly related to every other factor in the study, including the students perception of the instructors knowledge and fairness. It was negatively related to rigor, and positively related to the students perception of how much they had learned (Clayson & Haley 1990). Further research reinforced these findings. The evaluations have been found to be remarkably consistent for each instructor, even over periods of time as long as 13 years. Previous teaching experience was not related to this consistency (Marsh & Hocevar 1991). Since most professionals are assumed to improve their performance with constant practice, what could the evaluations be measuring that would not change? Business students were asked to evaluate instructors, based on their own experience, on a number of traits in terms of how those traits would change over time. Instructor and instructional related attributes were perceived as changing over time with experience. The knowledge of the instructor, and the ability to organize a class was seen as increasing over time. The students felt that they would learn more from an instructor with these characteristics. They also felt that experienced instructors would be fairer in testing and grading than less experienced professors. Characteristics that reflected some aspect of interpersonal relationships were least likely to change. Instructor descriptions that contained words such as responsive, interesting, cares, stimulating, and open were selected as characteristics that the students had not seen changing with more experienced instructors. Since the mean scores change very little over time, the students were essentially reinforcing the notion that the evaluations are heavily biased towards personality variables and are less influenced by the instructors perceived knowledge, fairness, or even the perception of student learning (Clayson 1999).

0.18**

0.63***

PW0
0.05
0.50 ***

PW10
0.78*** 0.17**

PW16
0.12 0.26*** 0.21**

EVW0
0.24***

EVW10
0.73***

EVW16

To put these relationships into perspective, the data of Clayson and Sheffets (2006) longitudinal data are presented below as a cross-lagged model. The measurements were made of the instructors personality (Pi) and the evaluation of the instruction (EVi) before the class began (PW0 and EVW0), at midterm (PW10 and EVW10), and at the end of the term (PW16 and EVW16). The measures shown are associational coefficients with ** meaning p < .

01 and *** meaning that p < .001. The model accounts for 66 percent of the total variance of personality at the end of the term and 70 percent of the total variance of the final evaluation (AGFI = 0.87). The pertinent aspect to this discussion is the coefficients between personality measures and the evaluation. Note that the initial evaluation is a significant predictor of the measure of future personality at midterm, but that the initial measure of personality is not directly related to the future measure of evaluation at midterm. The same pattern can be found between midterm measures and the final evaluations. This model suggests that the evaluation of an instructor is more predictive of subsequent measures of personality than is personality of the evaluations. In other words, students appear to making future judgments of an instructors personality from their current impression of the instructor as a teacher, not the other way around. It is also interesting to note in light of future discussions of validity, that this model predicts that the initial evaluations of the instructor made before any instruction and even before the syllabus was distributed is significantly related to the final evaluations (Final Eval = 0.06 PW0 + 0.16 EVW0; t-values for the coefficients: 0.06, t = 1.20, and of 0.19, t = 3.89, R2 = 0.06). Many educators simply make, a priori, a relationship between effective teaching and personality characteristics. Lantos (1997) encourages instructors to use humor, fun and games, learning students names, and being genuine as methods of motivating students. After reviewing the literature and their own study, Foote, Harmon, and Mayo concluded, those [instructors] who score highly on evaluations may do so not because they teach well, but simply because they get along well with students (2003, p. 17). Summary 1. The issue of personality as a causal factor in the outcome of the evaluations has been one that has divided the defenders and the critics of the process. Defenders have suggested that the effects of personality are small and that, in general, those traits that do influence the evaluations can be modified. Critics maintain that the effect is large and that certain traits that influence the outcome of the evaluations may be relatively permanent and ultimately not pertinent to what the evaluations should be measuring. 2. Although there are fewer studies in this area of the evaluations than others, they have consistently found highly significant effects by student perceptions. To a certain extent, personality seems to be the lens through which students look at other instructor variables when making their evaluations of instruction. In studies of business students, the impact of personality has been so strong and consistent that the SET instrument could be replaced with a personality inventory with little loss of predictive validity.

Rigor
Does academic rigor lower the evaluations?

According to Cashin (1995), there is a correlation between rigor (defined as workload and difficulty) and the evaluations, but contrary to faculty belief the correlation is positive (Sixbury & Cashin 1995). Others, especially with business students (Clayson & Haley 1990; Clayson 2002), have found significant negative associations between rigor and the evaluations. A survey of students from three universities across the United States found that 30% admitted to purposely inflating evaluations because an instructor gave good grades, another 30% indicated that they had purposely lowered evaluations because the tests in the course were too hard. Fifty percent had done one or the other. Students did not believe that a demand for rigor was an important characteristic of a good teacher (Clayson 2001, 2005). Part of the differences may be due to methodology. For over twenty years, research in the evaluation process has been criticized for jumping to simple conclusions from the data (Howard & Maxwell 1980), using simple OLS solutions (Seiver 1983), and being subject to interpretation errors created by correlational studies (Gaski 1987). Another part may be due to definitions of rigor. Marsh and Roche (2000), defining rigor as light to heavy workload, found a positive relationship between rigor and the instructor evaluation, but a negative relationship with grades. Greenwald and Gillmore (1997) also found a negative workload/grade relationship. The results are counter-intuitive at two levels. First, as Greenwald and Gillmore point out, workload should be related to how much, on average, students learn. Second, learning is related to evaluations, so one would expect that workload that is also reported to be positively related to the evaluations would be positively related to learning. Clayson and Haley (1990) found a non-significant relationship between rigor (defined primary as course and time demands of class) and the evaluation using a relational LISREL model that essentially replicated a multiple regression between concepts, but found a negative significant relationship when the data was cast into a model consistent with theory, nomonological networks, and data from students. Rigor, even defined in these harsh terms, was found to be significantly related to the student perception of learning, but was negatively linked to fairness, which made its total effect on the evaluation negative. Marks (2000) essentially replicated this study and found a significant negative relationship between workload difficulty and fairness in grading, which was in turn significantly related to the overall evaluation. In another study (Clayson 1994), it was found that rigor had a negative OLS correlation with the evaluations (t = -5.87). When the same data was analyzed with a multiple regression, rigor still had negative association with the evaluations, but its contribution was greatly reduced to marginal significance (t = -2.09). A path analysis model that maximized the chi-square value of the data also resulted in a significant association with rigor, but at a much stronger degree (t = -4.58). A later interesting piece of research (Bacon & Novotny 2002) has indicated that rigor interacts with a student personality variable. Their study found that a lenient instructor would increase his or her evaluations by attracting low achievement striving students, but less so with high achievement motivated students. The overall effect of leniency among undergraduates, although relatively small combined across all respondents, was enough to

make a teacher appear distinctly above or below average because of the highly skewed nature of most evaluations. Students are attracted to classes that have a reputation for not being rigorous in grading. Wilhelm (2004) compared course evaluations, course worth, grading leniency, and course workload as factors of business students choosing classes. A conjoint analysis showed that, students are 10 times more likely to choose a course with a lenient grader, all else being equal (p. 24). Johnson (2003), after looking at the results of a very large study completed at Duke University concluded, The influence of grading policies on student course selection decision is substantial; when choosing between two courses within the same academic field, students are about twice as likely to select a course with an A- mean course grade as they are to select a course with a B mean course grade, or to select a B+ mean course grade over a B- mean course grade [p. 193]. To a large extent, rigors relationship to the evaluation process is dependent upon how it is measured and analyzed. A hint of what may be happening is found in two unpublished studies conducted by the writer. When rigor was measured during the course of the term, the relationship was directly negative with the evaluation. When the measurement was made after the fact (students had completed the course), the association between rigor and the evaluations were positive. There is also a hint in the data, reinforced by the findings of Marsh and Roche (2000), that the relationship has both linear and curvilinear components that may suggest that different samples could be measuring differing effects dependent upon the unique characteristics of the sample. Irrespective of the actual effect of rigor on SET, if faculty believe that a negative association exists, that belief alone will change behavior. Crumbley and Fliedner (2002) surveyed accounting administrators, almost 40% of the respondents were aware of instructors who had reduced their grading standards and course work content to improve SET scores. Given the pressure on administrators to measure classroom performance, this figure is probably an underestimation. There is, however, one aspect of rigor that has considerable research report. Rigor can also be seen as reflected in grades. The grade-evaluation relationship will be discussed in the next section. Summary 1. Rigor is a controversial variable in the evaluation debate. Some have claimed that there is a positive relationship between the two. Others claim the relationship is negative. 2. Even though the degree of impact may depend upon the achievement motivation of the student, research with business students has shown a consistent negative relationship between rigor and evaluations. 3. The apparent impact of rigor is highly dependent on research and analysis methodology.

10

Grades
Do the grades given to students change the evaluations? Many faculty believe that there is a relationship between the grades they give and evaluations students give to them. A survey conducted at a Western university found that over 65% of the surveyed faculty believed that higher standards for grades in classes would lower student evaluations (Birnbaum 2000). When asked if the student evaluation process encourages faculty to water down the content of their courses, 72% responded in the affirmative. Almost 49% of the faculty said they present less material in class than they used to, and about one third said they have lowered standards for students to get a passing grade (only 7% said they had raised standards). These perceptions are nothing new; over 30 years ago, 39% of the faculty at a Midwestern university reported that other faculty had indicated to them personally that they had changed grading standards, while 13% had heard reports of raising grading standards, as a response to being evaluated by students (Ryan, Anderson & Birchler 1980). Over ten years ago, 22% of academic psychologists admitted to giving easy courses or tests to ensure popularity with students (Redding 1998). Surveys of students have reinforced the facultys perceptions of a grade/evaluation connection. In one study, 70% of the students indicated that the grade they thought they would receive influenced the level at which they rated their instructors (Goldman 1985). A survey of members of the Academy of Marketing Science found that many faculty appear to believe that student evaluation of instruction has encouraged faculty to lower academic standards (Simpson & Siguaw 2000). Although educators admit there is a grade/evaluation association (Braskamp & Ory 1994; Marsh & Dunkin 1992), many continue to state that rigorous academic standards do NOT significantly change student teacher evaluations, especially if other variables like rigor and prior student interest are taken into account (Cashin 1995; Marsh & Dunkin 1992; Kaplan, Marsh & Roche 2000; Mets & Cook 2000). Marsh and Roche (1999) referred to the idea that academic rigor would result in lower student teacher evaluations as a presumption that is not supported by the research. They state that the grade/evaluation expectation correlation is small (r = 0.2, approximately 4% of the variance). The authors suggest that instructors use proven methods of improving their evaluation ratings rather than, . the ethically dubious, counterproductive tactics (easy workloads and grading standards) that apparently do not improve teaching effectiveness of SETs (p. 518). The debate has been complicated by research findings. Expected grades have been found to create a highly significant difference in the evaluations of business instructors (Goldberg & Callahan 1991). The final course grade has also been shown to have a negative impact on the evaluations (Bharadwaj et al. 1993). Gillmore and Greenwald (1999) reported that out of six published studies that manipulated grading leniency in actual classrooms, all found higher evaluations from students in the more lenient conditions. Marsh, Hau, Chung, and Siu (1997) recognized a highly significant difference between the grades

11

students indicated they received from those instructors chosen as good and poor teachers. They interpreted this finding as a reflection of course mastery and therefore as more evidence of the validity of the instruments. While they found that course grades were positively correlated with the students perception of learning, they were negatively correlated with rigor. An earlier study using meta-analysis (Cohen 1981) reported a correlation of 0.43 between achievement (measured by grades) and the instructor ratings, but found no correlation (r = -0.02) between difficulty and achievement. Clayson and Haley (1990) established that in marketing classes, academic rigor was not significantly related directly to teaching evaluations. However, they found that academic rigor was significantly positively related to learning, which was positively related to the evaluations. Rigor was negatively related to personality and fairness, both of which were positively related to the final evaluation outcome. The combined overall effect of rigor on the evaluations was significant and negative. In other words, students believed that they would learn more in a course with rigor, but they also reported that rigor would decrease their evaluation of the instructors personality and his or her fairness. In this study, personality accounted for so much of the variation of the final evaluation outcome that the overall effect of rigor was negative. Theories As it stands today, the major debate about the grade/evaluation association is not whether it exists, but what causes it. Five theories have been proposed (Clayson, Frost & Sheffet 2006; Greenwald & Gillmore 1997; Marsh & Roche 1997; Marsh & Roche 2000; Stumpf & Freedman 1979). 1. Leniency: The essence of this hypothesis is that students reward professors who give good grades. In other words, lenient professors will get better evaluations than more rigorous graders. It is important to this hypothesis to recognize that, it is not the grades per se that influence SETs, but the leniency with which grades are assigned (March & Roche 2000, p. 204). The validity of the evaluation process is compromised to the extent that this theory is true. 2. Interaction with prior characteristics: The leniency effect exists, but it is not real. It results either as a statistical artifact of other determining variables, or is largely modified to the point of practical insignificance by other variables (Sevier 1983). Variables would include the rigor of the instructors grading policies (Powell 1977; Stumpf & Freedman 1979), class workloads (Greenwald & Gillmore 1997; Marsh & Roche 2000; Schwab 1976), and prior student interest in the class (Marsh & Roche 2000). 3. Teaching effectiveness: Teaching effectiveness influences both the evaluations and grades. Good instructors create positive learning environments that are reflected in more positive grades. Defenders of the evaluation system tend to support this hypothesis (Cohen 1981; Marsh & Roche 1997) with some statistical support (Sevier 1983). 4. Motivation: The students level of motivation influences both evaluations and grades. More highly motivated students are expected to do better academically and to more

12

appreciate the efforts of the instructor. Certain instructors may attract motivated students or be better at motivating students than other instructors (Greenwald & Gillmore 1997). 5. Attribution: Since learning and achievement are difficult to evaluate, students may infer the ability of the instructor to teach and their level of learning from the grade they receive. This can be looked at in two different ways. Greenwald and Gillmore (1997) describe attribution as grades providing information to students about course quality and ability. Thus a student getting a good grade would attribute it to good performance and to good teaching. A poor grade would indicate a lack of learning and a poor instructor. Marsh and Roche (2000) looked at attribution as a psychological variable that predisposes students to attribute good grades to themselves and poor grades to an external source, i.e., the teacher. Thus students getting a good grade would attribute it to their own good performance while the teachers role is minimized. On the other hand, a low grade would be attributed to the instructor who becomes a poor teacher. 6. Reciprocity: There is another option that at first glance appears to be a leniency explanation, but which is distinctly different. The reciprocity hypothesis states that students have a tendency to give the instructor what they receive. Students reward instructors that reward them and punish those who apparently punish them. The general level of grading leniency of the instructor is not relevant, only the individual students grade. This difference is also reflected in methodology. As Marsh and Roche (2000) point out, the appropriate case for a study of leniency is a class. The appropriate case to study reciprocity is the student. These are statistically distinct concepts (Clayson 2007; Stumpf & Freedman 1979). Greenwald and Gillmore (1997) addressed most of these hypotheses. The teaching effectiveness explanation was the easiest to deal with. The hypothesis generated by this theory would be supported by a between-class positive correlation between evaluations and grades, but invalidated by a within-class positive correlation. Both of these have been found. Numerous studies conducted before 1977 were reviewed by Stumpf and Freedman (1979). The weighed average of these studies was r = 0.39 for the grade/evaluation relationship when the correlation was calculated between-classes and r = 0.11 when the correlation was calculated within-classes. Further, the theory would be damaged by a difference in correlations between relative vs. absolute grades, halo effects that exist within-class but not with between-class evaluations (Tang & Tang 1987; Orsini 1988), and negative between-class correlations between grades and workloads/difficulty, all of which have been found (Greenwald & Gillmore 1997). The halo effects and the negative gradeworkload/difficulty correlations create problems for both the motivation and attribution theories. The only hypotheses supported by all the data are from the reciprocity theories. Johnson (2003) reports on the results of a large study conducted at Duke University that found strong evidence of a leniency effect by using between class data. He also found evidence for a grade/evaluation effect with between student data suggesting the presence of a reciprocity effect.

13

Unfortunately, Greenwald and Gillmore (1997) did not differentiate clearly between a leniency and a reciprocity effect. According to Marsh and Roche (2000), Greenwald and Gillmores test of the attribution effect was not complete. Further, Marsh and Roche did not find the leniency effect in their own study. Clayson (2002) conducted a between-student study of 700 students. The correlation between the evaluations after the class was completed and actual grades was r = 0.500 (Eval = 0.825*Grade 0.262; F(1,698) = 232.65 p < 0.000). Knowing the actual given grade accounted for 25% of the evaluation variance. The relationship was stronger for instructors that were women, than for men (r = 0.600 for women, r = 0.460 for men; Z = 2.35 p = 0.019). There was a perfect rank order correlation between the average students grades and the average faculty evaluation. The rank order was maintained between grades and even within grades (i.e., the average faculty evaluation of a student receiving a C+ was higher than the average evaluation of students receiving a C). The relationship was maintained across the sex of the instructor, the academic area, and the area of the country. Another study found that even after ten weeks of instruction, students continued to change their evaluations systematically with changes in their expected grades. The effect was found to be universal and not related to the instructors general level of leniency, or to student demographics, including age, gender, or even GPA. In other words, the very best and the very worst students (as measured by their own demographics) reacted in a similar manner. The methodology of this study ruled out any explanation of a reciprocity effect being an artifact of uncontrolled variables (Clayson, Frost & Sheffet 2006). These studies strongly suggest that the association is largely due to a reciprocity effect. Large correlations are consistent with Greenwald and Gillmores (1997) finding of an average standardized path coefficient of 0.45 at the University of Washington. A correlation of this magnitude indicates that if an instructor were to lower grades by one standard deviation, s/he could expect a decrease in the evaluation of almost half a standard deviation. Evaluations are highly negatively skewed. A small change can have dramatic effects. For example, at the writers university, a half standard deviation drop in the evaluations would take a 90th percentile teacher to the 40th percentile. Methodological Issues The magnitude of the grade-evaluation relationship depends upon how it is measured. The correlation for between-class data ranges from 0.1 to about 0.3 (Feldman 1997), while within-class correlations are slightly smaller, if measured before the class is finished. If data is collected after the class is completed, the correlations between grades and evaluation become more robust. The writer looked at 2,500 evaluations taken from the web site Teacher Review (www.teacherreview.com) and a correlation of 0.49 between grade given to the student and the evaluation given by the student was found. Another sample of 750 evaluations found a correlation of 0.47. One study found a correlation of 0.44 between given grades and evaluations that remained highly significant after other learning and motivational variables were controlled (Clayson 2004).

14

When looking at grade effects on the evaluations, it is important to step back and review the basics. Classes do not fill out student teaching evaluations, students do (Clayson 2009). While Marsh and Roches (1997, 2000) insistence that classes are the appropriate cases for studying leniency is accurate, the requirement comes with inherent statistical problems that make such studies problematic. Insisting on between-class data as cases increases the probability of Type I errors. Leniency can appear to exist when it does not and the findings are severely confounded by reciprocity effects. The magnitude of the leniency effects can be shown to be influenced by the variance of the grades given and the size of the class. As grade variance and class size increases, the correlation also increases in size even when no association exists. Grade boundary effects also come into play. Given grades typically are calculated on a zero to four scale, and many modern institutions give almost half of their grades in the A range. Classes that award most of their grades near a boundary will find smaller grade/evaluation associations than classes whose average grades are more in the middle of the grade range. Larger introductory classes where grade inflation is controlled and students show wide differences in ability would mathematically expect larger apparent grade/evaluation effects than smaller, more homogenous classes with small variation of high grades. This would be true even if the true leniency effect in both types of classes was identical (Clayson 2007). In other words, researchers looking at a large number of intro classes in business would expect to find a larger leniency effect than a similar study of education classes. The difference could simply be a statistical artifact, with no evidence about whether the true condition is more likely to be the business or education classes. Only an analysis of the within-class reciprocity effect could clarify the issue. Summary 1. The correlation between grades and the evaluations has traditionally been assumed to be between 0.10 to 0.30. Recent research has found an even higher value of between 0.45 to 0.50. 2. It appears that the cause of the relationship is complex. There is some evidence that part of it is due to attribution. In other words, a student receiving a grade lower than was expected may blame it on the instructor, but take credit themselves when receiving a good grade. However, the reciprocity effect is the only hypothesis that conforms to all the data. In blunt terms, students have a tendency to give an instructor what they received from that instructor. 3. The results of studies about the grade/evaluation effect are variable and are subject to methodological error. The metric of grading must be carefully selected. Reciprocity effects confound leniency effects and grading averages and variability along with boundary effects create statistical artifacts. Further, measures of expected grades give different results than do actual grades, which can modify results based on measurement before or after a class is completed.

Reliability

15

Are the evaluations reliable? In general, the evaluations are considered to be remarkably consistent. The questions answer is, however, dependent upon what is meant by reliability. The evaluation instruments are extraordinarily reliable in a statistical sense (Feldman 1983, Cashin 1995, Marsh & Roche 1997). Some evaluation forms have created inter-rater reliabilities between 0.70 to 0.90 (Sixbury & Cashin 1995), leading some consultants to suggest that forms creating coefficients less than 0.70 be viewed with caution (Cashin 1995). Reliability can also be looked at in terms of stability. A longitudinal study (Overall & Marsh 1980) compared ratings at the end of a course with ratings given by students a year later and found a correlation of 0.83. This finding is reinforced by the findings in a careful investigation by Marsh and Hocevar (1991). They reviewed a study by Feldman (1983) in which student evaluations of teaching effectiveness were found to be so consistent that they were only weakly related to seniority; in other words, they seemed not to change with experience. Marsh and Hocevar (1991) investigated this suggestion by looking at 195 teachers from 31 different academic departments who taught 6024 classes over a period of 13 years. They found in their longitudinal study that for both undergraduate and graduate level courses, there were almost no changes over time in the means for any of nine content-specific dimensions, overall course rating, or instructor rating. The results between ratings and experience were essentially linear and not the result of a "U" type relationship. The findings were consistent for teachers who had little, moderate, or extensive amounts of experience at the beginning of the study. They concluded that over a 13-year longitudinal period, teaching effectiveness as perceived by students was stable. Further research, however, has indicated that the consistency of the evaluations is over class averages and not necessarily from agreement among individual class members. It has been found, for example, that the correlation between the same instructor teaching the same class is between 0.70 and 0.80, but for the same instructor teaching two different classes, the correlations range from 0.33 to 0.48 (Gillmore, Kane & Naccarato 1978). These findings indicate that knowing present evaluations from a class can account for about 50 to 60% of the variability in the evaluations if the same class is taught again by the same instructor. The knowledge of instructors present evaluations would account for only about 10 to 20% of the variability of teaching a new class. This is strong evidence that not all the variation accounted for in the correlation analysis can be attributed to the instructor. At the very least, some of it is attributable to a class effect. Gillmore et al. (1978) found that about 40% of the variation in their study could be attributed to the teaching effect, six percent to a class effect, and the remaining 54% to unexplained interactions. Since these correlations are from class averages of the same instructor, the data implies that a considerable amount of inter-class (i.e. individual student rater) variability exists. To minimize these effects, Marsh and Roche (1997) suggest that using data from individuals instead of class averages is inappropriate.

16

Yet this approach covers up an interesting phenomenon. Follman (1984) reported that a consistent 20% of teachers are rated as the best and the worst instructor by members of the very same class. This finding has been reported consistently from grade school through graduate programs. In other words, the evaluations seem to have a certain amount of between-class reliability, but seem to suffer from within-class unreliability. Also, it has been found that students exposed to identical stimuli and standardized procedures produced data of such wide variation that some statistical techniques became unusable (Clayson & Frost 1997). Extreme reliability has created another interesting problem. Langbein (1994) used an established instrument to measure student satisfaction with instruction. The form had 19 questions which formed one factor and resulted in a Cronbachs alpha of 0.99 when summed. Using sixteen different student, faculty, and situational factors only accounted for approximately 12% of the total variance of the evaluations. This implies that random measure error is not likely to be the major source of the unexplained variance. Students are responding to something consistently, but it does not appear to be readily apparent what that is, or if, when found, the consistency would be compatible with instructional theory. In other words, the instruments may be suffering from a lack of discriminate validity. One study looked at students from across the United States (Clayson 2002). Exactly onethird of the professors had evaluations with at least one student awarding an A and at least one student giving an F, and 60% of the instructors had at least one instance in which a three letter grade range (A to F, A to D, B to F) was found. Nine percent of the instructors got the same evaluation grade from all respondents. If X represents the probability of instructors obtaining at least one A to F, and Y represents the probability of instructors not receiving an A to F per student so that (X + Y) = 1, then, from this data, X ranges from 2 to 9%, with a regressional average of 5%. Extrapolating to classes of 30 students, it would be expected that 79% of the instructors would receive A to F ranges from the same class [(1 0.9530)]. Summary 1. There is strong consensus that the evaluation process is remarkably consistent for individual instructors. 2. The consistency, however, does not take into account a surprising amount of within-class differences between raters. 3. The strong long term consistency of individual instructors ratings may be due to maintaining a rather consistent percentage of students who give good and/or bad evaluations, not because of a strong consensus of raters.

Are Students Truthful

17

Do students tell the truth when they fill out the evaluations? The answer to this question is both simple and complex. Generally, students seem willing to report information on the evaluations that they know are not true (Clayson & Haley 2011). Whether this is malicious or even done consciously is not always evident. It is known that there is a very strong halo effect on the evaluations (Tang & Tang 1987; Orsini 1988). As a demonstration of this, in an analysis of 506 separate evaluations from a large on-line database, the writer found a correlation between the instructors handwriting, and adequate office hours of 0.64. Researchers looking at almost 7,000 professors found a correlation of 0.64 between instructors Hotness and the quality of the class (Felton et al. 2004). The honesty problem was demonstrated in an experiment conducted by Stanfel (1995). The evaluation used by his university contained several questions that could be collaborated: The instructor explains clearly to students how they are evaluated, and Tests and written assignments are graded and returned in a reasonable period of time. Stanfel explained his evaluation procedures to the students and then gave them a quiz to test their knowledge of his instruction. One hundred percent of the students were able to correctly outline exactly what his evaluation procedures were. All assignments were returned at the first possible opportunity, and students signed documents indicating that they had received their graded work. In other words, students could not receive their graded material any earlier, and each student acknowledged that fact. How did the students later respond on the instructor evaluations? Only 3% of the students strongly agreed that the instructor had explained his evaluation procedure. Sixty-four percent either disagreed or strongly disagreed. Only 3% strongly agreed that assignments were handed back in a reasonable period of time, over 46% disagreed or strongly disagreed. Stanfel concluded that the students either forgot what had been clearly presented to them, did not understand what the evaluation was asking, or that they intentionally made false responses. He maintains that the latter is the most logical explanation, but the previous explanations would also call into question the accuracy of student responses. Sproule (2000) recounts how 50% of his students in one class would not acknowledge that work was returned reasonably promptly when all work was marked and returned the very next class period. The examples are not always negative for the instructor. An antidotal account is warranted here. The writer once taught at a small elite private school. One term he taught an elective course to seniors. No student needed to take the class and all students were thoroughly familiar with the instructor after three and half years of interaction. No student who disliked the instructor would have taken the course. The class was held directly after the lunch hour across the campus from the instructors office. That term, the instructor would meet with the campus chaplain, and a sociology professor for lunch and a vigorous debate was commonplace. The instructor realized that he was showing up to class late almost every meeting. On the colleges SET was an unambivalent question, Does the instructor show up to class on time? As a test of the accuracy of student responses, the instructor continued to come to class a few minutes late the remainder of the term. On the evaluation, 100% of the students reported that the instructor showed up to class on time.

18

Some evidence suggests that a majority of students, if asked, will evaluate class presentations and instructors that dont exist. Reynolds (1977) reported an incident when nearly 1000 students completed an evaluation of a course in which there were ten invited speakers. One speaker never appeared. Nevertheless, 80% of the students evaluated the non-existing lecture, ranking it worse than six, but better than three. Some of the same students were shown a film in class, while others did not see the film. About 55% of the students who did not see the film evaluated it anyway, giving it a slightly above average evaluation. Reynolds notes, with tongue-in-cheek, an interesting strategy that could be adopted if the evaluations are to be taken at face value: further reflection reveals a subtle means of cost cutting without curtailing academic programs. Without prior announcement, a few lectures or films can be quietly eliminated. Gradually, on a random basis, further lectures can simply disappear, first in one department and then in another. Who knows how long this may go on before anyone notices? (Reynolds 1977, p. 83) Stanfel (1995) makes an interesting observation comparing most institutions policies towards test taking and the evaluations. It would be highly inconsistent of university officials to act on the one hand as if students were likely to seize every opportunity at dishonesty and on the other, to attribute to their opinions about course and instructor quality an unimpeachable purity (p. 121). A recent study found that students believe that 30% of all evaluations contain scores and/or written comments that the responding student knows is untrue. Furthermore, the majority of students did not feel that purposeful misreporting of an instructor was a form of cheating (Clayson & Haley 2011). Summary 1. There is a strong halo effect indicating that students will ignore the actual content of a question or statement on the evaluation by answering them in a manner consistent with a more global student concern or issue. 2. Research is surprisingly limited on this topic, but the research that does exist is consistent in finding that students will purposely falsify answers.

Validity
Are the evaluations valid? This is a difficult question to answer. There are a number of issues that have not been resolved.

19

What is the purpose of the evaluation? An instrument found to be valid in one application does not necessarily insure validity when that instrument is applied to other usages. SET instruments have a number of applications. They could be utilized for: 1) instructional improvement, 2) evaluation of performance with personnel and managerial implications, and/or 3) necessary feedback to comply with legislative, administrative, or student demands. One question that needs to be addressed early in any discussion of validity is the use to which an instrument may be utilized. Lack of Actionable Feedback Research into the evaluations has shown that the instruments generally fail to improve instruction. Although it has not been investigated as thoroughly as some other aspects of SET, most research has failed to find improvement of instruction by using feedback from the instruments (as measured by the same instruments) over a period of time, or in the instructors' ability to modify or even to understand their own performance when the instruments have been used for feedback. Cohen (1980) used a meta-analysis to look at faculty who received feedback from evaluations at midterm and then at the end of the term. Significant improvements (as compared to control groups in which instructors did not receive feedback) were found with only two of six outcomes. In addition, students whose instructors received midterm rating feedback did not learn more than students whose instructors did not receive feedback. Miron (1988) found that there was only a modest relationship between instructors' self-ratings of their classes and the ratings given by students. Further, various training in teaching did not improve the instructors' accuracy, nor did years of teaching experience. In fact, beginning teachers were more accurate in judging their own instruction (as seen by students) than were more experienced teachers. Another study found that new teachers typically get higher scores than more experienced instructors (Carrell & West 2010). Finkelstein (1995) reviews some literature to suggest that the instruments will not lead to instructional improvement unless skilled consultants help the faculty interpret the ratings, a point reinforced by even ardent defenders of the SET system (Marsh 2007). The literature seems to be suggesting that the instruments may be valid, but that the experience that faculty members have with these instruments is invalid, especially if the propose of evaluation was the improvement of instruction. This raises a logical problem which led Becker and Watts (1999) to ask, How can something that has little or no information value for the agent have great information value for the principle [p. 347]? Clayson and Haley (2011) asked students several interesting questions. When asked, Do you think that instructors who are evaluated become better teachers? Two-thirds of the students said yes. The students were also asked if they thought that the faculty and administrators read the written comments on the evaluations. There was no statistically significant relationship between the students opinion about the usefulness of the evaluation and whether or not they thought they were actual read. In other words, the students believed that SET improved instruction in some global sense, but did not relate that to anything specific. This seems to sum up the feelings of many faculty and

20

administrators who continue to suggest that the SET system is necessary by alluding to vague and ethereal performance improvements which they have never actually measured. Evaluation of Performance Even though the evidence for teaching improvement utilizing SET is weak, educators can still ironically salvage the system because of unresolved philosophical and definitional issues. Educators cannot agree whether teaching should be measured by what it produces, or if effective teaching should be demonstrated by the acts of teaching rather than the consequences of those actions (Abrami, dApollonia & Rosenfield 2007). Effective teaching could be measured in terms of student performance, in which case the unit of measure should be student behavior, or as a process, i.e., something the instructor does. The latter is the most commonly utilized paradigm in SET systems, but that choice does not always appear to have been selected based on sound pedagogical concerns. There are a number of philosophical issues involved in this choice. If, for examples, students are customers, then measuring student satisfaction would be a logical goal of the evaluations. If students are products produced by the institution, then the instruments should be centered on student performance. If students are partners, as suggested by Clayson and Haley (2005), then evaluations centered only in a classroom would be inadequate and generally invalid. Combining scales on the evaluations does not resolve the issue. Extracting information about several objectives at once increases the probability of obtaining invalid measures because it raises the issue of diagnostic validity, or the ability of the instruments to clearly identify any given dimension of instruction as distinct from others, something SET arguably does not do well (Kember & Doris 2008). This is related to other concerns. It has been shown that when the instructional process is the object of measure, students are likely to invoke teacher attributes rather than instructional actions (Moore & Kuol 2007), and SET is often utilized as if the scales produced an effective representation of comparative professor value even though the SET contains measures made about areas beyond the instructors control (Mason, Steagall & Fabritius 1995). Undergirding these concerns is a fundamental problem. No one has given a widely accepted definition of what good teaching is (Germain & Scandura 2005; Marsh & Roche 1997). Consequently, instruments designed to measure the construct are hampered by definitional and methodological restraints. Further, few studies can be found which involve student input as part of the analysis (Kember & Leung 2008). Qualitative data is generally absent. Surprisingly, students are generally not asked how they interpret questions or what they are thinking when they fill out the evaluation instruments. In the extensive debate about the nature of the grade/evaluation relationship between Marsh and Roche (1997, 1999, 2000) and Greenwald and Gillmore (1997) that spread over several journals for several years, there is no evidence that anyone simply asked the students what they do or think when they respond to the instruments. The halo effect is another example of the problem. Students have a tendency to ignore individual items on the evaluation instruments to answer all items under a single rubric. The halo effect seems to be related to some variable that the students believe override

21

others. For example, it has been found with marketing students that the amount of halo in the evaluations is highly related (p < 0.001) to the expected grade (Clayson 1989). Learning One attempt to establish validity has been the general assumption that students will learn more from a "good" instructor. As Cohen (1981) affirms, Even though there is a lack of unanimity on a definition of good teaching, most researchers in this area agree that student learning is the most important criterion of teaching effectiveness [p. 283]. Some researchers have reported finding a positive relationship between learning and student ratings of instructors (Baird 1987; Lundsten 1986; Marlin & Niss 1980). Dowell and Neal (1982) did a careful review of the literature and concluded that the correlation between learning and the evaluations was about 0.20 to 0.26, accounting for about four to six percent of the variance. Nevertheless, prior to the mid-70's, about half of previous studies had found no correlation, or a negative correlation between some measure of learning and the evaluations (Sullivan & Skanes 1974). Other data, however, is more disturbing. The most widely cited and debated example was reported by Rodin and Rodin (1972) in Science, who actually found a negative correlation of -0.75. A study from the UK, utilizing the evaluations of over 30,000 students, found a correlation no different from zero between the evaluations and student learning in economics (Attiyeh & Lumsden 1972). A recent meta-analysis (Clayson 2009) found no published findings after 1990 that contained a statistically significant positive association between learning and the evaluations. Student learning has been found to be negatively associated with mean prerequisite course grades and consequent SET (Johnson 2003). In a study of accounting students, it was found that a statistically significant negative relationship existed between student evaluations of their instructors in intro classes and how well they performed in a subsequent class even when controlling for student GPA and ACT scores (Yunker & Yunker 2003). In an interesting study from the field of physics, what students knew about certain principles in physics was compared before and after a class was completed. The researchers (Halloun & Hestenes 1985) found that the common sense beliefs held by the students at the beginning of the term were significant predictors of their performance in class, but that class instruction induced only a small change toward correcting errors. The students failure to improve was not related to their instructors class evaluations. The researchers concluded, the basic knowledge gain under conventional instruction is essentially independent of the professor [italics theirs]. These findings were re-confirmed in a large study of over 6,500 students (Jake 1998). In Claysons (2009) meta-analysis, it was found that the association between SET and student learning decreased as measurements of learning became more objective, and as research methods of controlling for secondary variables became more sophisticated. The research concluded that while there was a relationship between SET and perceived learning, there was none between objective measures of learning and SET. If learning is seen as an improvement in subsequent performance, then recent findings suggest that learning may actually be negatively related to SET. In a study of accounting

22

students, it was found that a significant negative relationship existed between student evaluations of their instructors in introductory classes and how well they performed in a subsequent class (Yunker & Yunker 2003). Johnson (2003), utilizing a university-wide database, reported that stringent grading is associated with higher levels of achievement in follow-up courses (p. 161), but that stringent grading was strongly associated with lower evaluations. At the United States Air Force Academy, students in calculus classes in which learning can be objectively measured, gave higher evaluations to instructors of classes in which they were getting higher grades, but lower evaluations to instructors who produced students who did well in subsequent calculus classes. The authors concluded, the correlation between introductory calculus professor value added in the introductory and follow-on courses is negative. Students appear to reward contemporaneous course value added but punish deep learning (Carrell & West 2010, p. 429). Consistent with this, they found that inexperienced instructors got better evaluations in introductory classes than did more seasoned instructors, who produced students who did better in subsequent classes. There may be an exception. When students enroll in a class designed to prepare them to pass an external instrument of importance, students may appreciate teachers who prepare them well. For example, accounting students take certain classes to prepare them to take the CPA exam. One study found a small but statistically significant association between the course evaluation and student grades on an external posttest that had serious consequences for the students (Belche, Farris & Marks 2010).

Attribution Concerns There is also a problem of attribution. Tang and Tang (1987) found that students would transfer their own behavior to the instructor. For example, students that reported reading the textbook before coming to class rated the instructors knowledge higher. In other words, students associated their own level of participation and involvement in the classroom with faculty performance. They concluded that the evaluations might give a better indication of the students self-concept than the instructors actual performance. Consistent with this, there is evidence that students ignore their own contribution to a class when making an evaluation of the course. Gremler and McCollough (2002) reported that the students rating of their adequacy as a student, along with their satisfaction with their own effort and participation in class, were only weakly related to course evaluations using structural modeling. Grimes, Millea, and Woodriff (2004) reported that students with an external locus of control gave their instructors lower evaluations than students with an internal orientation. Benz and Blatt (1996) found in an analysis of student comments that students satisfied with their grades described their experiences with I. For example, I get good grades for doing good work. Student dissatisfied with their grades tended to use he or she. For example, He never gives good grades. Types of Validity

23

SET instruments are typically utilized to measure a hypothetical construct: i.e., good and/or effective teaching. Just as a person can be said to have a certain personality structure based on the results of paper and pencil tests, instructors are said to be a good or bad teacher based on the evaluations. Unless users of SET want to fall into the tautological trap of suggesting that good teaching is whatever the instrument says it is, the evaluations must be shown to exhibit a valid relationship with the construct it attempts to measure. But since hypothetical constructs have no form or dimension that can be objectively measured, a number of different associations must be established. This raises the possibility that an instrument may be valid in certain ways and simultaneously invalid in others. This has been the general findings with SET. Face Validity Face validity exists when an instrument appears to be measuring what the respondent thinks it should be measuring. It is related to the respondents experience with the instrument. Since SET instruments are created and sanctioned by the institution, ask questions generally associated with instruction, and are administrated in formalized manners, it would be assumed that SET has face validity. Content Validity Content validity is said to exist if the questions on an instrument can be logically said to cover the domain of the construct that the instrument is intended to measure. Since most institutions do not have a clearly defined notion of what they are attempting to measure (Onwuegbuzie et al. 2009; Ory & Ryan 2001), the content validity of SET is suspect. It has been pointed out that in many institutions there is a clear gap between what developers of SET consider to be characteristics of effective instructors and what student believe (Ongwuegbuzie et al. 2009). Chonko (2004), for example, listed six characteristics that students expect from a good instructor, but students did not put the same importance on these aspects of instruction as did professors. For example, the proportion of students who thought that enthusiasm was an important characteristic was more than twice that of instructors (38% of students vs. 18% of instructors). The same discrepancy was found in other research for communication skills (34% vs. 20%), and caring/empathy (24% vs. 14%) (Kelley, Conant & Smart 1991). When Clayson and Haley (2011) asked students, Do you think that the questions on the evaluation allow you to express what you really want to say on the evaluations? 45% of the students said no. One source began their own investigation by stating, Researchers do not know if what they are asking students to evaluate has any relative importance to them when assessing their course or instructor (Hills, Naegle & Bartkus 2009, p. 297). Even when learning is defined as the end result of effective teaching, definitions of learning must be carefully considered (Clayson 2009). This problem is so common and so difficult to resolve that some critics have claimed that the lack of specific definitions is purposeful. Moore and Flinn (2009) state, many people do not want a definitive

24

measurement of student learning, and they will fight long and hard to prevent one from coming into being [p. 102]. Concurrent and Predictive Validity SET results are related to other measures of teaching effectiveness and seem to be predictive of measures made by former students, self reports, and even of trained observers (Feldman 1989; Howard, Conway & Maxwell 1985; Marsh & Duncan 1992; Onwuegbuzie et al. 2009). Construct Validity: Convergent, Discriminant, and Divergent As could be indicated by the finding of concurrent validity, SET appears to have convergent validity in that the results are related to other measures that could logically be assumed to be associated with the domain of effective teaching. Convergent validity is not sufficient, however, to establish construct validity. As an extreme example, a measure that was correlated with everything would, by default, be related to the characteristics of any randomly selected construct. Consequently, a demonstration of construct validity must also show that the instrument is not associated with aspects of unrelated constructs (divergent validity), and that it shows differences between the construct of interest and other constructs that are distinct but similar (discriminant validity). After reviewing the results of a study of over 2,000 business students using path analysis, Marks (2000) concluded that. student evaluations lack discriminant validity. No matter how reliable the measures, student evaluations are no more than perceptions and impressions [p. 117]. Greenwald and Gillmore (1997) earlier pointed out that while evaluations of instructors have convergent validity, they are lacking in discriminant validity. More recent critics suggest that the evaluations also lack divergent and outcome validity (Onwuegbuzie et al. 2009; Sproule 2002). In other words, SET is correlated with attributes that a concept of good teaching would be expected to be related with, but they are also correlated with numerous attributes with which they shouldnt be related. This is partially a cause of, and is complicated by the halo effect (Orsini 1988). Further, teaching is a complex activity and the construct of good teaching is multidimensional (Feldman 1977, 1989; Marsh & Roche 1993[?]). Convergent validity and discriminant and divergent invalidity would be expected if the evaluations were effectively measuring one, or a few, of these dimensions, but not all. Some researchers have suggested that what the evaluation most effectively produces is a likeability scale (Clayson & Haley 1990; Marks 2000; Tang & Tang 1987). Sproule (2002) insists that the problems with construct validity cannot be remedied in a SET system because the model utilized is underdetermined in at least two ways. SET does not provide a unique or an unequivocal explanation of a body of data, and two or more models of SET can result in equally plausible explanations of the same data. The lack of discriminant and diverge validity can result in situations that are almost farcical. As indicated previously, scales on SET instruments create a strong association between teaching and hotness, and large correlations between handwriting clarity and

25

office hours. In another case, students in three classes taught by the same instructor were required to attend discussion sections taught by a teaching assistant. The faculty instructor was evaluated by the students in the discussion classroom rather than in class. Half of the students received chocolates prior to filling out the evaluation and half did not. The SET administrator emphasized that he was the source of the chocolate, not the instructor. Eight of nine scales about the instructor were evaluated higher by the chocolate group than the control group. The chocolate group stated that the class was more intellectually challenging, students were encouraged to participate more in class, the class was better compared to other classes, and the instructor was friendlier toward students than the ratings made by the control group (Youmans & Jee 2010). In still another example, two classes were combined into one. One class was composed of honor students and the other was second semester majors. Both groups were in the same classroom and received the same lectures from the same instructor. For administration purposes, the evaluations were separated as if there had been two classes taught. The honors students rated the instructor and class higher than the majors on 19 of 22 measures, which may make sense, but the honors students also stated that the exams were more representative of the material covered, and that the instructor demonstrated more enthusiasm, both measures that should not be influenced by anything except being the in the class (Oliver-Hoyo 2008). Nomological Validity Nomological validity is the degree to which a construct behaves as it should within a system (network) of related constructs. This network would include the theoretical framework for what one is trying to measure, an empirical framework for how it is to be measured, and specification of the linkages among and between these two frameworks (Cronbach & Meehl 1955). A number of attempts have been made in business education to investigate such a network (Clayson & Haley 1990; Marks 2000; Paswan & Young 2002). Clayson and Haleys (1990) study is an early attempt. Students were asked to describe their best instructor. This data was then utilized to identify underlining factors that were then linked to an actual SET instrument utilizing structural modeling techniques. It was found that a factor that most closely represented faculty personality had almost double the influence on the instrument of any other factor. They concluded that the evaluations created what could most accurately be described as a likeability scale. As shown in this example, these attempts to establish validity have encountered two difficulties. First, they attempt to establish a network from students perceptions of teaching either by gathering primary data and/or by looking at SET instruments that already exist. It could be argued that instead of finding that A is related to B, the procedures simply demonstrate that A is related to A. At the very best, these finding can demonstrate that the SET instrument is compatible with student perceptions. This occurs because of problem two. Again, the question arises, what is good or effective teaching? Predictably, instead of finding a logical nomological network, the research reaffirms that SET lacks validity if a standard of teaching excellence is not present in student cognitive structures and perceptions. Utilitarian Validity An instrument could, hypothetically, be useful as a tool to achieve an end irrespective of any validity to related theoretical constructs. For example, a politician may get elected by

26

making claims about himself which cannot be validated, but which, nevertheless, allow him to achieve office. If SET is going to be utilized as an evaluation tool even with validity issues, then we would be justified in asking if the measures are of value in achieving some goal, and if that goal is valid to education (Clayson 2008). Like other standards of validity, this could also result in contradictory findings. There are real and compelling reasons why SET is universally utilized. 1) It provides a method of providing feedback about the quality of instruction, which is being increasingly demanded by accreditation and funding bodies. 2) It simplifies administrative decision making. 3) It involves the student in a customer-like relationship with the institution and with its instructors. 4) It is believed by many administrators that the instruments are useful in identifying the very best and the very worse instructors. One could argue that SET has good utilitarian validity in an administrative context. However, the balance of evidence seems to suggest that on a more global level, utilitarian validity of SET may be questionable. In other word, the total negative effects created by SET in the institutional and educational process may outweigh the total positive good that could be achieved. This conclusion is suggested by inspecting the following issues: 1. Professors are generally considered to be experts in their fields of study. The SET process implies, at least in part, that a student is in a position of evaluating the content of that expert. This implications is not only illogical, it lowers the perception of the status of the instructor. The author recently took an advanced statistical course from two internationally-known experts. No one expected that we, as students, would evaluate their class, and no evaluation was offered to us. 2. There are negative consequences of the procedure utilized for SET. Gray and Bermann (2003) reviewed how the evaluations are administered and noted that the faculty is explicitly forbidden to touch the evaluation sheets after they are completed. This procedure, they claim, tells the students that the teacher is more than likely to be a cheat and a sneak, who will cook the books if given a chance. Both the students and teacher pretend not to notice the shaming involved, but it is palpable in such a situation (p. 56). Although the evaluations are utilized to make important administrative decisions that can have lasting effects on careers, less care is typically called for in the process than would be required to use human subjects for studies at the same institution (Sproule 2000). 3. Ironically, while the SET process assumes that the faculty may be dishonest, it ascribes to the students impeccable trustworthiness. As shown previously, this assumption is not justified by the evidence. 4. As previously discussed, there is a tendency for SET to create a collapsed scale. While it is true that carefully selected questions and large forms will create multidimensional scales, it is also true that without very careful manipulation, the evaluations have a tendency to form one or two dimensions, irrespective of what is asked (Greenwald & Gillmore 1997b;

27

Langbein 1994). In an academic setting, we sometimes have to remind our students that words have meaning. Occasionally, we need to remind ourselves that statistics also have meaning. Philosophically, if SET is a valid method of assessing instruction, then collapsing scales suggests that there is only one correct way to teach. It could be argued that the present evaluation system reinforces teaching done in one particular way and punishes other approaches. Not only does this have negative potential for learning, it also creates a system in which teaching is defined by the instrument rather than the instrument being defined by good teaching. It has a tendency to create a system in which the students rather than the professionals define the process, which can change at any time, not by new insights into learning and pedagogy, but by the whim of the students. 5. There is a potential civil rights and diversity issue that has gone largely unaddressed (Clayson & Haley 2005; Haskell 1997). As stated by Edmundson (1997), A controversial teacher can send students hurrying to the deans and counselors, claiming to have been offended (p. 45). Research has shown that political science students who perceive their professors to be politically similar to themselves rate courses more favorably. As the perceived political difference gets larger, the evaluations get progressively lower (KelleyWoessner & Woessner 2006). A writer from a law school echoed these concerns, relating them to race and gender, Few studies engage the eloquent critiques that individual minority professors have raised, and schools do not seem to have examined their practices in response to these concerns (Merritt 2007, p. 5). Research at a southern university in the United States found that students rated both the teacher and the course higher for white instructors than for black instructors (Smith 2007). Sinclair and Kundra (2000) reviewed literature that shows that people will judge a member of a stigmatized group who evaluates them negatively to be less competent than they would judge a person from a group that was not considered stigmatized. They hypothesized that students would evaluate a female instructor more negatively who gave a low grade, than they would evaluate a male instructor who gave a low grade. In a study of almost 200 students, their hypotheses were statistically affirmed. It is not surprising that international and cultural differences have also been found relating to gender issues in SET (Al-Issa & Sulieman 2007). As indicated earlier, there are a very consistent 20 percent of instructors who are rated by students in the same class as the best or the worst teachers that a student has ever had (Follman 1984). This extreme reaction may be due to a number of factors. The instructors who consistently fall into this 20 percent are highly likely to be different in some way; particularly in personality and in teaching styles, but in some instances, perhaps even in geographical or racial backgrounds. Yet diversity is one of the stated goals of almost all of higher education. What happens to instructors who present controversial ideas, or who espouse opinions or ideologies which students may see as incorrect or threatening at any particular moment in time? What if students decide that only white males, or African-American females are good instructors, and if the evaluations are used uncritically, who is justified, and on what basis, in telling the students they are wrong? To suggest that students will separate and remove their biases unrelated to instruction on some anonymous and subjective measure of instructional proficiency is extraordinarily nave. Does it matter?

28

Even with these problems, the evaluations are now strongly entrenched within the educational establishment. Much of the thinking of the defenders of the SET system, such as the NEA, relies on older findings. Finkelstein (1995), writing in The NEA 1995 Almanac of Higher Education, which summarizes this literature base, states that the evaluation instruments are "highly valid" in at least three aspects. They measure what they purport to measure. They are positively related to student learning and academic achievement, and they are highly correlated with colleague ratings. Indeed, Finkelstein cites several sources to say that if bias exists, the effects are relatively small, with patterns that are well-recognized and controllable. Some commentators, such as Machina (1987), dismiss any argument of what is being measured and seemingly take a pure consumer orientation. He states that when students underrate a faculty member, that means there is a breakdown in the teaching process somewhere and it does not mean that the evaluation is an inaccurate measure of the thing it is really geared to measure [p.22]. Ironically, as pointed out earlier, a common theme in the literature was the lack of student input in the design of the instruments, or any student input in their interpretation (Clayson 1993).

Summary 1. The validity of the evaluation process has been hard to ascertain and fundamentally complicated by lack of standardized definitions. 2. Advocates, such as Herbert Marsh, defend the evaluations and their usefulness, and generally interpret findings as being positive on issues of validity. 3. If the purpose of the evaluations is to identify teachers from whom students learn, then recent research indicates that the instruments are invalid. They do not discriminate between instructors whose students are learning from those who dont. 4. The evaluations are seen generally as having face and convergent validity. They have been consistently criticized for lacking content, discriminant, and divergent validity. Research on predictive validity has shown mixed results, depending to a large extent on what the instrument is utilized to predict. 5. Nomologically, what the evaluations seem most consistently to measure is likeability. 6. Why SET systems have managerial advantages, there are concerns that the system may detrimental to other educational benefits.

Summary

29

Does using an extensive system of evaluation improve instruction? A defender (Cranton 2001) of the evaluation process stated, The fact that student ratings are generally reliable and valid is an outcome of at least three factors: (1) The people who create forms agree, more or less, on what should be included; (2) Students agree, more or less, on what good teaching is within a specific context; (3) Individual differences among students are usually statistically removed [p. 15]. Even though these statements are carefully worded, their assumptions have not proven to be accurate. Although educators have debated for centuries what good teaching is, there seems to be no concrete definition about what constitutes good instruction. Students do, to a certain extent, agree on what they perceive to be good teaching, but that often breaks down into a questionable perception of learning (Clayson 2005, 2009; Kennedy, Lawton, & Plumlee 2002), and/or a perception of the instructors personality. At the same time, calls to statistically control for individual differences rather than group averages have until recently gone largely unheeded (Clayson 2004, Greenwald & Gilmore 1997). Stake (2000), a law professor at Indiana University expressed a strong negative opinion. Almost anything that can be done to undermine the administrative practice of getting students to evaluate teaching ought to be done. One of my major concerns is that the process of asking students their opinions undermines the trust and faith they need to place in the teacher. Instead of saying, Here is a great scholar and teacher; learn from her what you can, the administration of evaluation forms says to students, We hired these teachers, but we are not sure they can teach or have taught you enough. Please tell us whether we guessed right. As my father likes to say The overexamined life is not worth living either. In this case, asking students for their opinions focuses the attention of students on the acting and special effects, rather than the message. I think students need to have trust in teachers to learn much from them. The evaluation forms undermine that trust. I also believe that student evaluations can strongly influence the behavior of teachers, and for the worse. I changed my teaching dramatically because I was told by my Dean at the time that I had to keep the customers satisfied if I wanted to get tenure. (And I have not changed back since getting tenure.) I would not contend that the changes I made improved my teaching. The procedure for gathering the evaluation information has also been criticized for creating distrust between the students and faculty (Gray & Bermann 2003). Armstrong (1998) of the Wharton School listed four concerns he has with the evaluation system. 1) The current system is aimed at changing the behavior of instructors not students. This could be seen as a second-order relationship in which a change in the instructors behavior will result in a change in learning of the students, however, we have little evidence that this actually takes place. 2) Since the evaluations are highly skewed, instructors might begin to tailor their classes to the least common denominator, essentially dummying down their classes. There is evidence that this has taken place. 3) The evaluations may reduce experimentation by teachers. This writer once spent almost a decade creating and refining an innovative and nontraditional senior capstone course,

30

which received rave reviews from other faculty, recent graduates, and older alumni. A small group of students in each class considered the course to be unfamiliar and threatening. Because of the highly skewed nature of the evaluations, this small negative group was able to lower the instructors percentile rankings to the point at which merit pay was threatened. Without administrative support and protection, the instructor finally dropped the course. 4) The evaluations system signals to the students that the responsibility for learning lies with the instructors and administrators, not with them. Armstrong (1998) concludes, I expect that teacher ratings will reduce teachers interest in helping people learn, while reducing student responsibility [p. 1224]. More recently, another critic (Moore 2009) outlined six problems for SET. 1) They are easy to manipulate. 2) They have dubious validity as a measure of student learning. 3) They can be abused by administrators. 4) They can create perverse incentives. Faculty can lower standards, or even bring chocolate and other snacks to class. 5) They give unqualified students too much power over faculty, and 6) they create stress for faculty being evaluated.

The duh factor One of the frustrations of studying an area such as SET is the necessity of researching the obvious. When Valen Johnsons (2003) book was reviewed in the Wall Street Journal, the writer, upon reading that students give higher evaluations to teachers that give them higher grades, responded with an editorial duh. The same response was made by an interested student, His [Johnsons] exhaustive quantitative study, which I highly recommend to statistics majors and chronic insomniacs like myself, gives bales of empirical support to a number of "duh" observations (Gillum 2004). Many readers, for example, appeared to be surprised when it was discovered that students would give purposeful misinformation on the evaluations (Clayson & Haley 2011). The same institutions that require instructors to be present during an examination because of the fear of students cheating, were and are accepting SET information at face value. It seems illogical that similar criteria utilized for testing or for peer review of research is not applied to instruments which many times are utilized to establish merit pay and tenure. The SET system also creates an interesting lapse in logic that even the peer-review process of major publications does not eliminate. Herbert Marsh, for example, has carefully and thoughtfully produced SET research that has had a strong impact on the field. Yet, even he has fallen into a tautological error when reporting SET findings. In one study published in the leading journal of his field, he found that the evaluations are very consistent over look time periods and concluded that, the results provide a serious challenge for existing programs that assume that SET feedback alone is sufficient to improve teaching effectiveness (Marsh 2007, p. 775). What he actually found was that the scales produced by the SET instruments did not change appreciably over time. The entire debate is about whether those scales have anything to do with teaching effectiveness. If a researcher of Marshs stature is making these errors, one can only imagine those made by casual users of

31

the system. This writer once had a conversation with an administrator who stated that a certain faculty member was not teaching well. The evidence was a low average on a SET instrument. When asked what the SET was measuring, the administrator replied, I dont have a clue. This elicited the next logical question, So we dont know how Professor X is teaching? No, the administrator replied, he wasnt teaching well because his SET was low. Evaluations are so automatically and mechanically applied that this very bright person remained unaware of the contradiction and tautological error that was being made. Another logical and procedural lapse in the usage of SET comes when the evaluations are challenged in cases involving tenure and promotion. Instead of the institution demonstrating the validity of their instruments, many times the faculty member must show evidence that the instruments are invalid. In almost all other areas of HR interaction, it is the responsibility of the institution utilizing a measuring instrument to demonstrate that it has validity. Even this evidence would not be sufficient if the instrument contained items that were not admissible in a dispute. Suppose, for example, that a company found that a test was 100 percent accurate in predicting a workers future success by asking a potential employee if their grandmother smoked and if the applicant liked ham sandwiches. The results of that instrument would not be admissible even if it offered almost perfect predictability. This analogy is not extreme. Keep in mind what it means to find that SET instruments lack discriminant and divergent validity. The defenders of the present system have not been deterred. They insist that the evaluation process is valid and useful. Some, like Theall and Franklin (2001) have even found it necessary to postulated why others stubbornly refuse to embrace the procedure. They have suggested that perhaps: 1) the data generated by the evaluations have been misused on a regular basis; 2) that negative opinions about the evaluations is related to ignorance; and 3) that the notion of someone else determining the quality of ones work is threatening. Finally the authors suggest that if instructors receive negative evaluations they may develop negative attitudes toward the student ratings. A situation, they contend, that can lead to pathological behavior and serious psychological problems. They do not acknowledge that a negative orientation toward the evaluation process may be based on empirical evidence, logic, and/or years of practical experience. Given the prestige of certain researchers and the careful analyses of the SET process that created these reputations, it once was reasonable to defend the evaluations, but given the evidence from the past 20 years, that no longer appears to be justified. In fact, why SET is still so vigorously defended has become the question that remains to be investigated. One explanation is the considerable invested interest in the evaluations. Some researchers consult in this area and have even created evaluations for general use. They may feel a vested interest in their continued utilization. In a related fashion, many administrators benefit from the way the SET process is currently structured, as do a large percent of faculty. It should not have to be said, but 50 percent of all faculty fall into the top half of SET scores and a full 20 percent are in the top 20 percent. These people benefit from the current system and find reasons to perpetuate their usage. There are also philosophical issues inherent in the perceived need for faculty evaluation, and even for the need to have a major input from students that cannot be easily dismissed.

32

Nevertheless, after over 20 years of research, the writer knows of no persuasive published findings showing that the adoption of the evaluation system has improved instruction, students are learning more, employers find better prepared workers, society has improved, or that faculty or students are happier because of the student evaluation of instruction. This may be seen as an unfair standard to apply to the process, but when evaluations are used by almost all schools, consume considerable resources, and constitute a major (and in some cases, the only) measure of faculty instructional competency, more than just theory and good faith is warranted. ______________________ In the Appendix, the reader can find a compilation of sources offering advice on the usage of SET. Along with the references found below, an interested reader may wish to view an excellent summary of references complied in an Indiana University-Purdue University document: http://ctl.iupui.edu/common/uploads/library/CTL/CTL289206.pdf

33

References
Abrami, P., dApollonia, S., & Rosenfield, S. (2007) The Dimensionality of Student Ratings of Instruction: What We Know and What We Do Not. The Scholarship of Teaching and Learning in Higher Education: An Evidence-Based Perspective, (R.P. Perry & J.C. Smart, Eds.), Section II, 385-456. Al-Issa, A., & Sulieman, H. (2007) Student Evaluations of Teaching: Perceptions and Biasing Factors. Quality Assurance in Education, 15(3), 302-317. Ambady, N., & Rosenthal, R. (1993) Half a Minute: Predicting Teacher Evaluations from Thin Slices of Nonverbal Behavior and Physical Attractiveness. Journal of Personality and Social Psychology, 64, 431441. Armstrong, J. S. (1998) Are Student Ratings of Instruction Useful? American Psychologist, 53, 12231224. Attiyeh, R., & Lumsden, K.G. (1972) Some Modern Myths in Teaching Economics: The U. K. Experience. American Economic Review, 62, 429-433. Bacon, D. R., & Novotny, J. (2002) Exploring Achievement Striving as a Moderator of the Grading Leniency Effect. Journal of Marketing Education, 24(2), 4-14. Baird, J.S. (1987) "Perceived Learning in Relation to Student Evaluation of University Instruction." Journal of Educational Psychology, 79(1), 90-1. Basow, S.A., & Spielberg, N.T. (1987) Student Evaluations of College Professors: Are Female and Male Professors Rated Differently? Journal of Educational Psychology, 79(3), 308-314. Basow, S.A. (1998) Student Evaluations: The Role of Gender Bias and Teaching Roles. In L.H. Collins, J.C. Chrisler, and K. Quina (Eds.), Career Strategies for Women in Academe: Arming Athena, (PP. 135156). Thousand Oaks, CA: Sage. Basow, S.A. (2000) Best and Worst Professors: Gender Patterns in Students Choices. Sex Roles: A Journal of Research, (Sept.) [located at FindArticles.com]. Becker, W.E., & Watts, M. (1999). How departments of economics evaluate teaching. American Economic Review, 89, 344349. Beleche, T., Fairris, D., & Marks, M. (2010) Do Course Evaluations Truly Reflect Student Learning? Evidence from an Objectively Graded Post-test. Unpublished paper presented at the All-California Labor Economics Conference, UCSB, Sept. Retrieved December 2010 from www.econ.ucsb.edu/conferences/aclec10/docs/marks.doc Bennett, S.K. (1982) Student Perception of and Expectation for Male and Female Instructors: Evidence Rating to the Question of Gender Bias in Teaching Evaluations. Journal of Educational Psychology, 74(2), 170-170. Benz, S., & Blatt, S.J. (1996) Meaning Underlying Student Ratings of Faculty. The Review of Higher Education, 19(4), 411-433. Bernard, M.E., Keefauver, L.W., Elsworth, G., & Naylor, F.D. (1981) Sex-role Behavior and Gender in Teacher-student Evaluations. Journal of Educational Psychology, 73(5), 681-696.

34

Birnbaum, M.H. (2000) A Survey of Faculty Opinions Concerning Student Evaluation of Teaching. http://psych.fullerton.edu/mbirnbaum/faculty3.htm Bharadwaj, S., Futrell, C.M., & Kantak, D.M. (1993) Using Student Evaluations to Improve Learning. Marketing Education Review. 3(Summer), 16-21. Boice, R. (1992) Countering Common Misbeliefs about Student Evaluation of Teaching. ADE Bulletin 101, Spring, 2-8. Braskamp, L.A., & Ory, J.C. (1994) Assessing Faculty Work: Enhancing Individual and Institutional Performances. San Francisco: Jossey-Bass. Carrell, S. E., & West, J.E. (2010). Does Professor Quality Matter? Evidence from Random Assignment of Students to Professors. Journal of Political Economy, 118(3), 409-432. Cashin, W.E. (1995) Student Ratings of Teaching: The Research Revisited. Idea Paper No. 32. Publication of the Center for Faculty Evaluation & Development, Division of continuing Education, Kansas State University. Centra, J.A. (1993) Reflective Faculty Evaluations: Effectiveness. San Francisco: Jossey-Bass. Enhancing Teaching and Determining Faculty

Chonko, L. B. (2004). If it walks like a duck: Concerns about quackery in marketing education. Journal of Marketing Education, 26, 4-16. Clayson, D.E. (1989) Halo effects in student evaluations of faculty: A question of validity. Marketing: Positioning for the 1990s. Proceedings of the Annual Meeting of the Southern Marketing Association, 271275. Clayson, D.E. (1992) Student Evaluation of Faculty by Faculty Competencies by Gender. Marketing: Perspectives for the 1990s. Proceedings of the Annual Meeting of the Southern Marketing Association, 160-163. Clayson, D.E. (1993) Student teaching evaluations in marketing: A review and critique of the Journal of Marketing Education. Marketing and Education: Partners in Progress. Proceedings of the Atlantic Marketing Association, 142-147. Clayson, D.E. (1994) Contrasting Results of Three Methodological Approaches on the Interpretation of a Student Evaluation of Instruction. Midwest Marketing Association 1994 Proceedings, Proceeding of the Midwest Marketing Association, 209-214. Clayson, D.E. (1999) Students Evaluation of Teaching Effectiveness: Some Implication of Stability. Journal of Marketing Education, 21(1), 69-75. Clayson D.E. (2001) Academic Rigor and the Student Teacher Evaluation Process: Student Perceptions. Riding the Wave of Innovation in Marketing Education. Marketing Educators Association Conference Proceedings, (Stuart Van Auken and Regina P. Schlee, eds.), Madison, Wisc: Omnipress Press, 19-22. Clayson, D.E. (2001) The Problem of Within-class Reliability in Student Teacher Evaluations: Prevalence and Problems. Working Paper. Clayson, D.E. (2002) Reciprocity in Student Evaluations of Faculty: Do They Give You What You Give Them. Working Paper.

35 Clayson, D. E. (2004) A Test of the Reciprocity Effect in the Student Evaluation of Instructors in Marketing Classes. Marketing Education Review, 14(2), 11-21. Clayson, D. E. (2005) Performance Overconfidence: Metacognitive Effects Or Misplaced Student Experience. Journal of Marketing Education, 27(2), 11-21. Clayson, D. E. (2005) Within-Class Variability in Student-Teacher Evaluations: Example and Problems. Decision Sciences Journal of Innovative Education, 3(1), 109-124. Clayson, D.E. (2007) Conceptual and Statistical Problems of Using Between-Class Data in Educational Research. Journal of Marketing Education,29(1), 34-38. Clayson, D. E. (2008) A New Concept of Validity: Evaluation of Teaching, and the Production of Loincloths. Reaching New Heights in Marketing Education: Marketing Educators Association 2008 Conference Proceedings, (Robert A. Lupton & Barbara L. Gross, Eds.), Salt Lake City, 7-8. (Published by Omnipress: Madison, Wisc.). Clayson, D.E. (2009) Student Evaluation of Teaching: Are They Related to What Students Learn? A Meta-Analysis and Review of the Literature. Journal of Marketing Education, 31(1), 16-30. Clayson, D.E., & Frost, T.F. (1997) An Empirical Study of the Influence of Performance and Grades on Students' Evaluation of Instruction. Psychological Reports, 81, 507-512. Clayson, D. E., & Glynn, K.A. (1993) Student Perceptions of Marketing Faculty: The Effects of Gender, Attractiveness, and Age. Improving the Quality of Marketing Education: Breaking Old Paradigms. Proceedings of the Western Marketing Educators Association, 39-42. Clayson, D. E., & Haley, D.A. (1990) Student Evaluations in Marketing: What is Actually Being Measured? Journal of Marketing Education, 12(Fall), 9-17. Clayson. D. E., & Haley, D.A. (2005) Marketing Models in Education: Students as Customers, Products, or Partners. Marketing Education Review, 15(1), 1-10. Clayson, D. E., & Haley, D.A. (2011). Are Students Telling us the Truth? A Critical Look at the Student Evaluation of Teaching. Marketing Education Review,21(2), 103-114. Clayson, D.E., & Sheffet. M.J. (2006). Personality and the Student Evaluation of Teaching. Journal of Marketing Education, 28(2), 149-160. Clayson, D. E., & Frost, T.F., & Sheffet, M.J. (2006) Grades and the Student Evaluation of Instruction: A Test of the Reciprocity effect. Academy of Management Learning & Education, 5(1), 52-65. Cohen, P.A. (1980) Effectiveness of Student-Rating Feedback for Improving College Instruction: A Meta-analysis. Research in Higher Education, 13(4), 321-341. Cohen, P.A. (1981) Student Ratings of Instruction and Student Achievement: A Meta-analysis of Multisection Validity Studies. Review of Educational Research, 51, 281-309. Comm, C. L., & Manthaisel, D.F.X. (1998) Evaluating Teaching Effectiveness in Americas Business Schools: Implications for Service Marketers. Journal of Professional Service Marketing, 16(2), 163-70. Cranton, P. (2001) Interpretive and Critical Evaluation. New Direction for Teaching and Learning: Fresh Approaches to the Evaluation of Teaching (Christopher Knapper and Patricia Cranton, Eds.), 88(Winter), 11-18.

36

Cronbach, L., & Meehl, P. (1955). Construct validity in psychological tests, Psychological Bulletin, 52(4), 281-302. Crumbley, D.L. (1995) On the Dysfunctional Atmosphere of Higher Education: Games Professors Play. Accounting Perspectives, 1(Spring), 67-77. Crumbley, D. L., & Fliedner, E. (2002) Accounting Administrators Perception of Student Evaluation of Teaching (SET) Information. Quality Assurance in Education, 10(4), 213-222. Dowell, D.B., & Neal, J.A. (1982) "A Selective Review of the Validity of Student Ratings of Teaching." Journal of Higher Education, 53(1), 51-62. Edmundson, M. (1997) On the uses of a liberal education: Essay I. As lite entertainment for bored college students. Harpers Magazine, (September), 39-49. Elmore, P. B., & LaPointe, K.A. (1975) Effect of Teachers Sex, Student Sex, and Teacher Warmth on the Evaluation of College Instructors. Journal of Educational Psychology, 67(3), 368-374. Erdle, S., Murray,H.G., & Ruston, J.P. (1985) Personality, Classroom Behavior and Student Ratings of College Teaching Effectiveness: A path analysis. Journal of Educational Psychology, 77(4), 394-407. Feldman, K.A. (1983) The Seniority and Instructional Experience of College Teachers as Related to the Evaluations They Receive from Their Students. Research in Higher Education, 5, 243-288. Feldman, K.A. (1986) The Perceived Instructional Effectiveness of College Teachers as Related to Their Personality and Attitudinal Characteristics. Research in Higher Education, 28, 291-344. Feldman, K.A. (1987) Research Productivity and Scholarly Accomplishment of College Teachers as Related to Their Instructional Effectiveness: A Review and Exploration. Research in Higher Education, 26, 227-298. Feldman, K.A. (1989) "Instructional Effectiveness of College Teachers as Judged by Teachers Themselves, Current and Former students, Colleagues, Administrators, and External Observers." Research in Higher Education, 30, 137-94. Feldman, K.A. (1989) The Association between Student Ratings of Specific Instructional Dimensions and Student Achievement: Refining and Extending the Synthesis of Data from Multisection Validity Studies. Research in Higher Education. 30, 583-645. Feldman, K.A. (1993) College Students Views of Male and Female College Teachers: Part II Evidence from Students Evaluations of Their Classroom Teachers. Research in Higher Education, 34, 151-211. Feldman, K.A. (1997) Identifying Exemplary Teachers and Teaching: Evidence from Student Ratings. In R. P. Perry & J. C. Smart (Eds.), Effective Teaching in Higher Education: Research and Practice (pp. 368395). New York: Agathon. Felton, J., Koper, P.T., Mitchell, J., & Stinson, M. (2004) Web-based Student Evaluations of Professors: The Relations between Perceived Quality, Easiness and Sexiness, Assessment & Evaluation in Higher Education, 29(1), 91-108. Finkelstein, M.J. (1995) College Faculty as Teachers. In The NEA 1995 Almanac of Higher Education. Washington, DC: National Education Association, 33-47.

37 Follman, J. (1984) Pedagogue Paragon and Pariah 20% of the Time: Implications for Teacher Merit Pay. American Psychologist, September, 1069-1070. Foote, D. A., Harmon,S.K., & Mayo, D.T. (2003) The Impacts of Instructional Style and Gender Role Attitude on Students Evaluation of Faculty. Marketing Education Review, 13(2), 9-19 Germain, M., & Scandura, T. A. (2005) Grade Inflation and Student Individual Differences as Systematic Bias in Faculty Evaluations. Journal of Instructional Psychology, 32(1), 58-67. Gillum, M. (2004). Grade inflation in humanities a dangerous trend. The Chronicle (Duke University), Opinion, March 22. Retrieved January 29, 2010 from http://dukechronicle.com/article/ grade-inflationhumanities-dangerous-trend Gray, M., & Bermann, B.R. (2003) Student Teaching Evaluations: Inaccurate, Demeaning, Misused. Academe, (Sept.-Oct.), 44-46. Gaski, J.F. (1987) On Construct Validity of Measures of College Teaching Effectiveness. Journal of Educational Psychology, 79, 326-330. Gillmore, G. M., & Greenwald, A.G. (1999) Using Statistical Adjustment to Reduce Biases in Student Ratings. American Psychologist, 54(7), 518-19. (Original data published: Greenwald, Anthony G. 1991. American Psychologist, 52, 1182-86.) Gillmore, G.M, Kane, M.T. & Naccarato, R.W. (1978) The Generalization of Student Ratings of Instruction: Estimation of the Teacher and Course Components. Journal of Educational Measurement, 15(1), 1- 13. Goldberg, G., & Callahan, J. (1991) Objectivity of Student Evaluations of Instructors. Journal of Education for Business (July/August), 377-78. Goldman, L. (1985) The Betrayal of the Gatekeepers: Grade Inflation. Journal of General Education, 37, 97-121. Greenwald, A.G., & Gillmore. G.M. (1997) Grading Leniency is a Removable Contaminant of Student Ratings. American Psychologist, 52(11), 1209-217. Gremler, D.D, & McCollough, M.A. (2002) Student Satisfaction Guarantees: An Empirical Examination of Attitudes, Antecedents, and Consequences. Journal of Marketing Education, 24(2) 150-160. Grimes, P.W., Millea, M., & Woodriff, T.W. (2004) Whos to Blame? Locus of Control and Student Evaluation of Teaching., Journal of Economic Education, 35(2), 129-147. Hake, R. R. (1998) Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. American Journal of Physic, 66(1), 64-74. Halloun, I. A,, & Hestenes, D. (1985) The initial knowledge state of college physics students. American Journal of Physics. 53(11), 1053-1055. Hamermesh, D. S., & Parker, A.M. (2004) Beauty in the Classroom: Professors Pulchritude and Putative Pedagogical Productivity. Found at www.eco.utexas.edu/faculty/hamermesh/ Haskell, R. E. (1997) Academic freedom, tenure, and student evaluation of faculty: Galloping polls in the 21st century. Education Policy Analysis Archives 5(6). Retrieved Jan. 10, 2007, from http://epaa.asu.edu/epaa/v5n6.html.

38

Hattie, J., & Marsh. H.W. (1996) The Relationship Between Research and Teaching: A Meta-Analysis. Review of Educational Research, 66(4), 507-542. Hewett, L., Chastain, G., & Thurber, S. (1988) Course Evaluations: Are Students Ratings Dictated by First Impressions? Paper presented at the Rocky Mountain Psychological Association, Snowbird, Utah. Hills, S. B., Naegle, N., & Bartkus, K. R. (2009) How Important are Items on a Student Evaluaton? A Study of Item Salience. Journal of Education for Business,(May/June), 297-303. Howard, G.S., & Maxwell, S.E. (1980) Correlation Between Student Satisfaction and Grades: A Case of Mistaken Causation? Journal of Educational Psychology, 72, 810-820. Howard, G.S, Conway, C.G., & Maxwell, S.E. (1985). Construct Validity of Measures of College Teaching Effectiveness. Journal of Educational Psychology, 77, 187-196. Jackson, C. (1994) How Personality Profiling can Change Your Life. Physics World, 7(4), 101-103. Johnson, V. E. (2003). Grade Inflation: A Crisis in College Education. New York: Springer. Kaplan, M., Mets, L.A., & Cook, C.E. (2000) Questions Frequently Asked about Student Ratings Forms: Summary of Research Findings. http://www.crit.umich.edu/crit.faq.html Kelley, C.A., Conant, J. S., & Smart, D.T. (1991). Master teaching revisited: Pursuing excellence from the students perspective. Journal of Marketing Education, 13, 1-10. Kelley-Woessner, A., & Woessner, M.C. (2006) My professor is a partisan hack: How perceptions of a professors political views affect student course evaluations. PS: Political Science & Politic, July, 495 501. Kember, D., & Leung, D.Y.P. (2008) Establishing the Validity and Reliability of Course Evaluation Questionnaires. Assessment & Evaluation in Higher Education, 33(4), 341-353. Kennedy, E.J., Lawton, L., & Plumlee, E.L. (2002) Bliss Ignorance: The Problem of Unrecognized Incompetence and Academic Performance. Journal of Marketing Education, 24(3), 243-252. Kierstead, D., DAgostino, P.,& Dill, H. (1988) Sex Role Stereotyping of College Professors: Bias in Students Rating of Instructors. Journal of Educational Psychology, 80(3), 342-344. Langbein, L.I. (1994) The Validity of Student Evaluations of Teaching. PS: Political Science in Institutional Politics, Sept., 545-552. Lantos, G. P. (1997) Motivating Students: The Attitude of the Professor. Marketing Education Review, 7(2), 27-38. Lundsten, N,L. (1986) "Student Evaluations in a Business Administration Curriculum: A Marketing Viewpoint." AMA Developments in Marketing Science, 9, 169-73. Machina, K. (1987) Evaluating Student Evaluations. Academe, May-June, 19-22. Magner, D. K. (1997) Report says Standards used to Evaluate Research should also be used for Teaching and Service. The Chronicle of Higher Education, 44(2), A18-A19. Marks, R. B. (2000) Determinants of Student Evaluations of Global Measures of Instructor and Course Value. Journal of Marketing Education, 22(2), 108-119.

39 Marlin, J.W., & Niss, J.F. (1980) "End-of-Course Evaluations as Indicators of Student Learning and Instructor Effectiveness." The Journal of Economic Education, Spring, 16-27. Marsh, W. H. (2007) Do University Teachers Become More Effective With Experience? A Multilevel Growth Model of Students Evaluations of Teaching Over 13 Years. Journal of Educational Psychology, 99(4), 775-790. Marsh, H.W., & Dunkin, M. (1992) Students Evaluations of University Teaching: A Multidimensional Perspective. In J.C. Smart (Ed.) Higher Education: Handbook of Theory and Research, (Vol. 8, pp. 143233). New York: Agathon. Marsh, H.W., Hau, K., Chung, C., & Siu, T.L. (1997) Students Evaluations of University Teaching: Chinese Version of the Students Evaluations of Educational Quality Instrument. Journal of Educational Psychology, 89(3), 568-72. Marsh, H.W., & Hattie, J. (2002). The Relation Between Research Productivity and Teaching Effectiveness. The Journal of higher Education, 73(5), 603-641. Marsh, H.W., & Hocevar, D. (1991) Students Evaluation of Teaching Effectiveness: The Stability of Mean Rating of the Same Teachers over a 13-year Period. Teaching & Teaching Education, 7(4), 303314. Marsh, H.W., & Roche, L.A. (1997) Making Students Evaluations of Teaching Effectiveness Effective. American Psychologist, 52(11), 1187-97. Marsh, H. W., & Roche, L.A. (1999) Reply upon SET Research. American Psychologist, 54(7), 517518. Marsh, H. W., & Roche, L.A. (2000) Effects of Grading Leniency and Low Workload on Students Evaluations of Teaching: Popular Myth, Bias, Validity, or Innocent Bystanders? Journal of Educational Psychology, 92(1), 202-228. Mason. P., Steagall, J., & Fabritius M. (1995) Student Evaluations of Faculty: A New Procedure for Using Aggregate Measures. Economics of Education Review, 12, 403-416. Merritt, D.J. (2007) Bias, the brain, and student evaluations of teaching. ExpressO Preprint Series, Paper 1939. The Berkeley Electronic Press. Retrieved March 3, 2007 from http://law.bepress.com/expresso/eps/1939. Miron, D.G. (1976) Students Evaluation of Instructors Self-evaluation of University Instruction. Higher Education, 17, 175-181. Moore, P. (2009). Why We Should Measure Student Learning: A Glossary of Collegiate Corruption. In Flinn, R.E. & Crumbly, D.L (Eds.) Measure Learning Rather than Satisfaction in Higher Education. Sarasota, Florida: American Accounting Association. Moore, P., & Flinn, R.E. (2009). The limitations of measuring student learning. In Flinn, R.E. & Crumbly, D.L (Eds.) Measure Learning Rather than Satisfaction in Higher Education. Sarasota, Florida: American Accounting Association. Moore, S., & Kuol, N. (2007). Retrospective Insights on Teaching: Exploring Teaching Excellence Through the Eyes of the Alumni. Journal of Further and Higher Education, 31(2), 133-163. Murray, H.G., Rushton, J.P, & Paunonen, S.V. (1990) Teacher Personality Traits and Student Instructional Ratings in Six Types of University Courses. Journal of Educational Psychology, 82(2), 250-261.

40

Naftulin, D.H., Ware, J.E., & Donnelly, F.A. (1973) The Doctor Fox Lecture: A Paradigm of Educational Seduction. Journal of Medical Education, 48, 630-35. Oliver-Hoyo, M. (2008) Two Groups in the Same Class: Different Grades. Journal of College Science Teaching, 38(1), 37-39. Onwuegbuzie, A. J., Daniel, L.G., & Collins, K.M.T. (2009) A Meta-Validation Model for Assessing the Score-Validity of Student Teaching Evaluations. Qual Quant, 43, 197-209. Orsini, J. L. (1988) Halo Effects in Student Evaluations of Faculty: A Case Application. Journal of Marketing Education, 10(Summer), 38-45. Ortinau, D.J., & Bush, R.P. (1987) The Propensity of College Students to Modify Course Expectations and Its Impact on Course Performance Information. Journal of Marketing Education, 9(Spring), 42-52. Ory, J.C., & Ryan, K. (2001) How do student ratings measure up to a new validity framework? New Directions In Institutional Research, 109, 27-44. Overall, J.U., & Marsh, H.W. (1980) Students Evaluations of Instruction: A Longitudinal Study of Their Stability. Journal of Educational Psychology, 72, 321-325. Paswan, A.K., & Young, J.A. (2002) Student Evaluation of Instructors: A Nomological Investigation Using Structural Modeling. Journal of Marketing Education, 24, 193-202. Powell, R.W. (1977) Grades, Learning, and Student Evaluation of Instructors. Research in Higher Education, 7, 193-205. Redding, R.E. (1998) Students Evaluation of Teaching Fuel Grade Inflation. American Psychologist, 53(11), 1227-228. Reynolds, D. V. (1977) Faculty Forum, Teaching of Psychology, 4(2), 82-83. Rodin, M., & Rodin, B. (1972) Student Evaluation of Teachers. Science, 177(Sept. 29), 1164-166. Ryan, J.J., Anderson, J.A.,& Birchler, A.B. (1980) Student Evaluation: The Faculty Responds. Research in Higher Education, 12(4), 317-33. Sauber, M.H., & Ludlow, R.R. (1988) Student Evaluation Stability in Marketing: The Importance of Early Class Meetings. The Journal of Midwest Marketing, 3, 41-49. Seiver, D.A. (1983) Evaluations and Grades: A Simultaneous Framework. The Journal of Economic Education, Summer, 220-223. Seldin, P. (1993) The Use and Abuse of Student Ratings of Professors. Chronicles of Higher Education, 21(July), A40. Seldin, P. (1999) Changing Practices in Evaluation Teaching: A Practical Guide to Improving Faculty Performance and Promotion/Tenure Decisions. Bolton, MA.: Anker Publishing Co., Inc. Schwab, D. P. (1976). Manual for the Course Evaluation Instrument. Madison: University of Wisconsin, School of Business. Sherman, B.R., & Blackburn, R.T. (1975) Personal Characteristics and Teaching Effectiveness of College Faculty. Journal of Educational Psychology, 67(1), 124-131.

41 Simpson, P. M., & Siguaw, J.A. (2000) Student Evaluations of Teaching: An Exploratory Study of the Faculty Response. Journal of Marketing Education, 22(3), 199-213. Sinclair, L., & Kunda, Z. (2000) Motivated stereotyping of women: She fine if she praised me but incompetent if she criticized me. Personality and Social Psychology Bulletin, 26(11), 1329-1342. Sixbury, G.R., & Cashin, W.E. (1995) IDEA Technical Report No. 9: Descriptions of database fro the IDEA Diagnostic Form. Publication of the Center for Faculty Evaluation & Development, Division of Continuing Education, Kansas State University. Smith, B. P. (2007) Student Ratings of Teaching Effectiveness: An Analysis of End-of-Course Faculty Evaluations. College Student Journal, 414), 788-800. Sproule, R, (2000) Student Evaluation of Teaching: A Methodological Critique of Conventional Practices. Educational Policy Analysis Archives, 8(50), //epaa.asu.edu/epaa/v8n50.html. Sproule, R. (2002) The Under-Determination of Instructor Performance by Data from the Student Evaluation of Teaching. Economics of Education Review, 21(3), 287-295. Stake, J. E. (1997) Response to Haskell: Academic Freedom. Tenure, and Student Evaluation of Faculty. Education Policy Archives, 5(8). Found at www//epaa.asu.edu/epaa/v5n8.html Stanfel, L E. (1995) Measuring the Accuracy of Student Evaluations of Teaching. Journal of Instructional Psychology, 22(2), 117-125. Stumpf, S.A., & Freedman, R.D. (1979) Expected Grade Covariation with Student Ratings of Instruction: Individual versus Class Effects. Journal of Educational Psychology, 71(3), 293-302. Sullivan, A.M., & Skanes, G.R. (1974) Validity of Student Evaluations of Teaching and the Characteristics of Successful Instructors. Journal of Educational Psychology, 66(4), 584-590. Tang, T.L., & Tang, T.L. (1987) A Correlation Study of Students Evaluations of Faculty Performance and Their Self-Ratings in an Instructional Setting. College Student Journal, 21(Spring), 90-97. Theall, M. & Franklin, J. (2001) Looking for Bias in All the Wrong Places: A Search for Truth or a Witch Hunt in Student Ratings of Instruction? New Directions for Institutional Research, 27(5), 45-56. Widmeyer, W.N., & Loy, J.M. (1988) When Youre Hot, Youre Hot! Warm-cold Effects in First Impressions of Persons and Teaching Effectiveness. Journal of Educational Psychology, 80, 118-21. Wilson, R. (1998) New Research Casts Doubt on Value of Student Evaluations of Professors. The Chronicle of Higher Education, 44(19), A12-A14. Wilhelm, W. B. 2004. The relative influence of published teaching evaluations and other instructor attributes on course choice. Journal of Marketing Education, 26, 17-30. Youmans, R. J. & Jee, B.D. (2010) Fudging the Numbers: Distributing Chocolate Influences Student Evaluations of an Undergraduate Course. Teaching in Psychology, 34(4), 245-247. Yunker, P. J., & Yunker, J. (2003) Are Student Evaluations of Teaching Valid? Evidence From an Analytical Business Core Course. Journal of Education for Business, 78(6), 313-317.

42

Appendix Researchers Advice on the Utilization of Student Evaluations of Teaching _______________________________________________________________________ Caution from Supporters of SET The following researchers are all public defenders of the SET system. there is little evidence that teachers become either more or less effective [as measured by SET] with added experience (Marsh 2007, p. 786). The author has maintained that teachers need expert advice to become better teachers (Marsh & Roche 1997). Marsh argues that the least appropriate use of student ratings is as a summative judgment for purposes that might incur pecuniary or personal penalty (Johnson 2000, p. 426). Centra (1994), who believes that SET actually reflects learning, does not believe that they should be used alone to evaluate teaching. Despite the generally supportive research findings, student ratings should be used cautiously, and there should be other forms of systematic input about teaching effectiveness, particularly when they are used for tenure/promotion decisions (Marsh 1984, p. 749). A very early source in marketing, who advocates the use of SET to improve the effectiveness of the institution still maintained that, there are inherent dangers in using this information for administrative purposes (Lill 1979, p. 252). ________________________________________________________ Sources Recommending Eliminating SET A number of sources warn about the utilization of SET. Since many U.S. colleges and universities use student evaluations as a measure of teaching quality for academic promotion and tenure decisions, this findings draws into question the value and accuracy of this practice (Carrell & West 2010, p. 430). Too many powerful people do not want a precise instrument (or set of instruments) for measuring student learning and evaluation teaching because such an instrument (or instruments) would make too many people second-raters, third-raters, or worse (Moore & Flinn 2009, p.110). That attempts to create valid SET are, wrong-headed institutionalization of pseudoscience, but also the successful but wrong-headed transformation of the

43

pedagogical environment into one that is more consumer- and hence less learningorientated (Sproula & Valsan 2009, p. 143). When transformed into numbers, student ratings become misconstrued as quantified measures of teaching excellence that institutionally serve to provide an illusion of objectivity and penalize faculty who adhere to an alternative critical model of pedagogy. Through the use of rating forms, academic control and professional authority are transferred from faculty to students, who, as discriminating consumers, are granted the power to further their own interests and, consequently, shape the nature and form of higher education that serves them (Titus 2008, p. 397). To help reduce the demand pressures for higher grades, student evaluations should be banned from college campuses (Pressman 2007, p. 99). Our findings continue to call into question the use, fairness, and validity of evaluating instruction in college and university classes with measures of students opinion about teaching [Mohanty et al. 2005, p. 147). In view of the present assessment, it would seem appropriate and reasonable for all constituencies of the university community to come clean by acknowledging publicly and unequivocally that the SET data are contaminated with non-trivial, and incalculable, systemic errors, and that the presence of these errors render the FEC decision rule invalid, unreliable, and otherwise hopelessly flawed (Sproule 2002, p. 292). Instead of asking instructors to improve teaching evaluations, schools should be asking themselves whether they should be asking instructors to make the courses more or less demanding, interactive, or structured and organized (Paswan & Young 2002, p. 200). Since SET is widely thought not to offer improvement of teaching, Becker and Watts ask, How can something that has little or no information value for the agent have great information value for the principle? (Becker & Watts 1999, p. 347). Surveys in the 1990s showed that 75 per cent of academics judged SET as unreliable and imprecise measures of performance, yet almost 100 per cent of schools use the instruments (Reckers 1995). I cannot think that the habit of evaluating ones teacher can encourage a young person to long for the truth, to aspire to achievement, to emulate heroes, to become just, or to do good. To have ones opinions trusted utterly, to deliver them anonymously, to have no check on their truth, and no responsibility for their effect on the lives of others are not good for a young persons moral character. To have ones opinions taken as knowledge, accepted without question, inquiry, or conversation is not an experience that encourages self-knowledge (Platt 1993, 33 34).

44

__________________________________________________ SET Should Never be Utilized as a Sole Measure of Teaching Even when student evaluations are used, the statistical data obtained are often misused (Madu & Juei 1993, p. 329). The results reported here should cause academic administrators to question the role of student teacher evaluation of professors, especially if the goal is to compare professors on individual teaching items (Madden, Dillon & Leak 2010, p. 272). it would not be prudent to rely solely on course evaluations as a means of gauging student learning (Marks, Fairris & Belche 2010, p.19). This [lack of relationship between SET and learning] would imply that on balance, universal and group-weighted SET results should not be utilized, or they should be interpreted with great care Clayson 2009, p. 27). we suggest that unless more positive evidence regarding the psychometric properties of STEs is provided, any decisions stemming from STEs should be made very tentatively and alongside other indices of instructional effectiveness such as statements of the instructors teaching philosophies, duties, and short- and long-term goals and objectives; self-evaluations undertaken by the instructor; evaluations by peers and administrators; unsolicited written comments made by students; samples of students work; and records of student achievement after leaving the course and/or Institution (Onwuegbuzie et al. 2009, p. 207). The framework presented here suggests that in the case of the SET process in its conventional form, its value is questionable as the sole measure of classroom performance since the quality, richness and diversity of what happens in the typical classroom cannot be captured by the SET process alone. (Pounder 2007, p. 189). Student evaluations can be helpful when collected and maintained by individual professors for their own information. But they should not be used for personnel purposes (Fischer 2006, p.6 of 11). Student evaluations of instruction appear to follow a seriously flawed paradigm. As the very least, they should be closely monitored both by faculty and by administrators when they are used as indicators of teaching quality (Clayson & Sheffet 2006, p. 159). After reminding us that SET is here to stay, the authors state, Committees too often rely heavily on students evaluations to make inferences about the teachers instructional effectiveness (Algozzine et al. 2004, p. 138).

45

Student evaluations of teaching are essential but not sufficient (Laverie 2002, p. 106). After showing that grade inflation occurred after the adoption of a SET, Eiszler (2002) warns, Taken together these factors describe a pattern of over reliance on and over interpretation of student ratings of instruction (Eiszler 2002, p. 499). Like the evaluations, but faculty need outside consultants, published reports are in danger of misrepresenting the findings, and administrators must be very careful to use them for punitive purposes (Aleamoni 1999). References for Appendix
Aleamoni, L.M. (1999). Student rating myths versus research facts from 1924 to 1998. Journal of Personnel Evaluation in Education, 13, 153-166. Algozzine, B., Gretes, J., Flowers C., Howley, L. Beattie, J., Spooner, F., Mohanty, G., & Bray, M. (2004). Student evaluation of college teaching: A practice in search of principles. College Teaching, 52, 134-141. Becker, W.E., & Watts, M. (1999). How departments of economics evaluate teaching. American Economic Review, 89, 344349. Carrell, S.E., & West, J.E. (2010). Does professor quality matter? Evidence from random assignment of students to professors. Journal of Political Economy, 118, 409-432. Centra, J.A. (1994). The use of the teaching portfolio and student evaluations for summative evaluation. Journal of Higher Education, 65, 555-570. Clayson, D.E. (2009). Student evaluations of teaching: Are they related to what students learn? Journal of Marketing Education, 31, 16-30. Clayson, D.E., & Sheffet M.J (2006). Personality and the student evaluation of teaching. Journal of Marketing Education, 28, 149-160. Eiszler, C. F. (2002). College students evaluations of teaching and grade inflation. Research in Higher Education, 43, 483-501. Fischer, J.D. (2006). Implications of recent research on student evaluations of teaching. The Montana Professor, 17, Retrieved December 22, 2009 from http://mtprof.msun.edu Johnson, R. (2000). The authority of the student evaluation questionnaire. Teaching in Higher Education, 5, 419-434. Laverie, D.A. (2002). Improving teaching through improving evaluation: A guide to course portfolios. Journal of Marketing Education, 25, 104-113. Lill, D. J. (1979). The development of a standardized student evaluation form. Journal of the Academy of Marketing Science, 7, 242-254. Madden, T.J., Dillon, W.R., & Leak, R. L. (2010). Students Evaluation of teaching: Concerns of Item Diagnosticity. Journal of Marketing Education, 32, 264-274.

46

Madu, C. N., & Kuei, C. (1993). Dimensions of quality teaching in higher education. Total Quality Management, 4, 325-338. Marks, M., Fairris, D., & Beleche, T. (2010). Do course evaluations reflect student learning? Evidence from a pre-test/post-test setting. Retrieved December 17, 2010 from http://faculty.ucr.edu/~mmarks/Papers/marks2010course.pdf Marsh, H. W. (1984). Students evaluation of university teaching: Dimensionality, reliability, validity, potential biases, and utility. Journal of Educational Psychology, 76, 707-754. Marsh, H.W. (2007). Do university teachers become more effective with experience? A multilevel growth model of students evaluations of teaching over 13 years. Journal of Educational Psychology, 99, 775-790. Marsh, H.W., & Roche, L.A. (1997). Making students evaluations of teaching effectiveness effective. American Psychologist, 52, 1187-1197. Mohanty, G., Gretes, J., Flowers, C., Algozzine, B., & Spooner, F. (2005). Multi-method evaluation of instruction in engineering classes. Journal of Personnel Evaluation in Education, 18, 139151. Moore, P., & Flinn, R.E. (2009). The limitations of measuring student learning. In Flinn, R.E. & Crumbly, D.L (Eds.) Measure Learning Rather than Satisfaction in Higher Education. Sarasota, Florida: American Accounting Association. Onwuegbuzie, A. J., Daniel, A.J., Larry G., & Collins, K.M.T. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality and Quantity, 43, 197-209. Paswan, A. K., & Young, J. A. (2002). Student evaluation of instructor: A nomological investigation using structural equation modeling. Journal of Marketing Education, 24, 193-202. Platt, M. (1993). What student evaluations teach. Perspectives on Political Science 22, 2940. Pounder, J.S. (2007). Is student evaluation of teaching worthwhile? An analytical framework for answering the question. Quality Assurance in Education, 15, 178 191. Pressman, S. (2007). The economics of grade inflation. Challenge, 50, 93-102. Reckers, P. M. J. (1995). Know thy customer. Change in Accounting Education: A Research Blueprint (C.P. Baril, Ed.) Federation of Schools of Accountancy: St Louis. Sproula, R., & Valsan, C. (2009). The student evaluation of teaching: Its failure as a research program, and as an administrative guide. Economic Inferences, 11, 125-150. Sproule, R. (2002). The underdetermination of instructor performance by data from the student evaluation of teaching. Economics of Education Review, 21, 287-295. Titus, J. J. (2008). Student ratings in a consumerist academy: Leveraging pedagogical an authority. Sociological Perspectives, 51, 397-422.

Вам также может понравиться