Вы находитесь на странице: 1из 26

Of What Value are Student Evaluations?

2003 Edward B. Nuhfer " Center for Teaching and Learning, Idaho State University

This review paper is periodically updated to serve as a reference. It allows the reader to understand the meaning and limits of student evaluations as gleaned from the major literature, and addresses the general question, "To what degree can student evaluations be used to distinguish what constitutes "good" teaching?" A recent web site concerning student evaluations (http://www.cedanet.com/fe_page.htm#Assess%20Anchor) states: "Student ratings of teaching serve as an important component of many faculty evaluation systems. Either by design or default, institutions often place great weight on student rating data in making decisions that impact faculty rewards, career progress and professional growth. It is critical that student rating forms be designed and constructed in such a way as to provide valid and reliable information for these purposes." This shows the importance that evaluators ascribe to student evaluations. Some users fail to treat them as simply an "important component and use student evaluations as the defining measure of success. If we subscribe to an extreme alternative hypothesis that "good teaching is too complex to describe or evaluate," this indicates that none of our own memories about who were good teachers in our lives has any value, beyond perhaps providing some warm satisfaction within our imaginations. If true, then claims to be able to use evaluations to learn about, to improve, or to change teaching are ridiculous. After all, how can one claim to be learning, improving, or changing something that one can"t evaluate? If we can"t improve teaching, then this conclusion challenges entrenched traditions of practice. Why mandate a traumatic seven years of "probation" before making a tenure decision that includes "teaching criteria" if successful teaching can"t be recognized? Why should salaries rise with experience, if teaching experience cannot be demonstrably linked with better performance? Clearly, the implications of an inability to measure or modify teaching are displeasing at levels ranging from personal to institutional. Yet, an uncritical refutation of the hypothesis, as encouraged in far too many contemporary discussions, is equally damaging. The premise that student evaluations are reproducible and useful for evaluation has, since the 1980s, been accepted as generally valid. Workers who have surveyed the growing literature on the subject invariably weight their summary in favor of the usefulness of student evaluations (Cashin, 1988 and 1995; Cohen, 1981; d"Apollonia and Abrami, 1997; Dunkin and Barnes, 1986; Greenwald, 1997; Theall, Abrami, and Mets, 2001). There is a very general trend for highly rated teachers to be those whose students achieve well (McKeachie, 1986). Seldin (1993) notes that "...hundreds of studies have determined that student ratings are generally both reliable (yielding similar results consistently) and valid (measuring what the instrument is supposed to measure.)"

A key to deeper understanding of statements such as those of Seldin lies in recognizing what such an instrument actually measures. This often is pre-empted by unwarranted assumptions about "what the instrument is supposed to measure." Some examples of common rash reasoning follow: (1) "Student evaluations measure successful teaching." (2) "Student evaluations have a "valid correlation" with students" learning, so student evaluations can be used as measures of students' learning." (3) "When a relationship is established as "reliable" and "valid" from good research on large populations, I can safely apply this relationship to judge any individual whom I evaluate." All three statements are untrue, but nevertheless student evaluations are misused as though such statements were ironclad facts. Further, true believers (who too often seem to have a stake in selling institutions a workshop or an evaluation form) proclaim that student evaluations cannot be manipulated or subverted. Anyone who believes such claims needs to read the first part of Generation X Goes to College by Peter Sacks. This part is an autobiography of a tenure track experience by the author in an unnamed community college in the Northwest. Sacks, an accomplished journalist who is not a very accomplished teacher, soon finds himself in trouble with student evaluations. Sacks exploits affective factors to deliberately obtain higher evaluations, and describes in detail how he did it in Part 1 called "The Sandbox Experiment." Sacks obtains higher evaluations through a deliberate pandering, but not through promotion of any learning outcomes. For years, he manages not only to deceive students, but also peers and administrators and eventually gets tenure based on higher student evaluations. This is a brutal case study that many could find offensive, but it proves clearly that (1) student evaluations can indeed be manipulated, and (2) that faculty peer reviewers and administrators who should know better than to place such blind faith in student evaluations sometimes do not. Unless one uses and understands the terms with their formal meanings, it is easy to confuse "reliable" and "valid" with "highly predictive," "precise" and even "accurate." Deans and review committees probably more often misuse student evaluations than use them correctly, and much of the criticism of student evaluations probably arises more from misuse than the nature of the evaluation tools themselves (McKeachie, 1997). Objective advocates of student evaluations invariably note that the relationships are not precise and simple, that student evaluation programs are run from very well to very poorly, and that results from ratings can be useful or misused (Seldin, 1993). In charged emotional arguments with ramifications for bruised egos (i.e. "We've always done it this way;" "The dean designed the evaluation form;" or even "Student evaluations are embarrassing to faculty and hence we should get rid of them"), it is perilously easy to promote actions for reasons other than actual value or merit. To get bottom line near the top, good teaching can be defined and evaluated with the aid of student evaluations, but not by such evaluations alone. Behind all the rhetoric about reliability and validity, there is a basic fact being ignored: student evaluations are not clean assessments of any specific work being done. Instead, they are ratings derived from students overall feelings that arise from an inseparable mix of learning,

pedagogical approaches, communication skills and affective factors that may or may not be important to student learning. As Sacks demonstrated, a professor can emphasize a particular practice that will change student ratings but not necessarily affect learning outcomes very much. Let"s begin by recognizing two very different kinds of student evaluations. These are "summative" (those used to evaluate professors for rank, salary and tenure purposes) and "formative" (those that diagnose in ways that allow professors to improve their teaching.) Summative evaluations given at the end of a course are direct measures of student satisfaction. "Satisfaction" is the sum of complex factors that include learning, teaching traits, and affective personal reactions that are products of both what happens in a class and what an individual has brought with him or her t o the class in form of bias and motivation. Formative evaluations given during the ongoing course, usually about mid-term, ask detailed questions that provide a profile of pedagogy and strategy being employed. There is plenty of evidence to show that the functions of the two kinds of evaluation must be clearly separated. It is maddening when writers of papers and books about "student evaluations" or "student ratings" fail to specify whether they are talking about summative or formative tools. The thorough compilation by Theall, Abrami, Mets, (2001) is damaged by lack of such specificity, because when one talks of the utility of evaluations to help to improve teaching, one cannot be talking about summative evaluations. But the average reader won"t know this, and false expectations that can arise from such confused presentations may lead to a lot of grief and damage in application. Typical summative questions are "Overall, how do you rate this instructor"s teaching ability compared to all other college instructors you have now and have had in the past?" "Overall, how do you rate this course compared to all other college courses you have now and have had in the past?" (Scriven, 1997, recommends against the types of competitive wording such as "compared to all other college instructors"") and "How do you rate this course as a learning experience?" Such questions are also called "global" because they solicit a general overview of the experience. These are the kind of evaluations being talked about in the quote from the web page in the very first paragraph of this paper. In application, no matter how many questions appear on the evaluation form, the summative evaluation process too often boils down to actually using results from just one or two global questions. The worst practice involves calculating numerical average scores on a question (sometimes to several decimal places) and using these scores to sort professors into categories for purposes of awarding (or withholding) salary, tenure or rank. The attraction for using the global question alone in such a way is mainly laziness"it makes for an "easy" evaluation. Such practice is also inept evaluation. Formative evaluation involves responses on a Likert scale from "strongly agree" to "strongly disagree " to statements such as: "Discusses recent developments in the field;" "Uses examples and illustrations;" "Is well prepared;" "States objectives of each class session;" "Encourages class discussion/participation;" "Gives personal help to

students having difficulty in the course;" "Is enthusiastic;" and "Instructor gave helpful suggestions on how to make small-group work more beneficial." These items reveal specific teaching practices and the degree to which they are used in a course. The items established based on research (like that of Hildebrand et. al., 1971; or Feldman, 1986) reveal the importance of a particular trait or practice as used to promote students" success. An instructor can use the information provided to add new practices or emphasize particular ones of his/her choice. As such, formative tools yield the information required about how to improve. Merely evaluating professors never constitutes a program to improve teaching. "Primitive" is perhaps too kind a word for any administration that embraces an evaluation system tied to rewards and punishments but fails to provide a complementary program required to foster improvement. Some brief history of formative and summative uses The following was provided via email by Dr. Michael Theall, (now at Youngstown State University), who has written extensively on student evaluations (see Theall & Franklin, 1990; Theall, Abrami, and Mets, 2001). "The earliest distinction between formative and summative uses was by Mike Scriven, who coined the terms in his (1967) "Methodology of evaluation" in Taylor, Gagne, & Scriven's "Perspectives of curriculum evaluation". The earliest studies were by H. Remmers (e.g., "Experimental data on the Purdue Rating Scale for Instructors in 1927) and they were concerned with exploring student opinions as one way to find out more about teaching/learning for "self-improvement of instruction" and for psychometric reasons (i.e. to validate the scale). In 1928, Remmers investigated "student marks and student attitude toward instructors". This time, the psychometric properties of ratings were more the focus, (perhaps due to increasing summative use and resulting validity questions?). By 1949, Remmers was referring (in "Are student ratings of their instructors related to their grades") to students' opinions of the teacher as one of the "Two criteria by which teachers are often evaluated..." In the 1949 study, Remmers, concluded that "There is warrant for ascribing validity to student ratings not merely as measures of student attitude toward instructors...but also as measured by what students learn of the content of the course." The timing of the administration of the instrument isn't mentioned in the 1927 study, for example, but in the 1949 study, the evaluations were done at the close of the term. So it looks like: 1) the earliest intent was formative; 2) summative uses developed fairly quickly; 3) psychometric properties were first a measurement issue and then a matter of establishing ratings validity due to summative use; and 4) specifics of the evaluation process gradually evolved from end-of-term administration to other timing and process changes." Searching for value via the numbers

Statistics have been used extensively, and in ways, both fair and foul, to give credibility to use of student evaluations for annual review of faculty. A requisite to using the literature of student evaluations effectively is to recognize the quantitative implications of statistical "validity" as it applies to (a) large populations and (b) individuals. The fact that trends and correlations are statistically "significant" in the studies of large populations is not synonymous with their being "reliable" ways through which to evaluate individuals. In social science research (emphasis mine), correlation coefficients above 0.20 are somewhat useful, and those above 0.50 are very useful but are rare when studying complex phenomena (Cashin, 1988). Reviews of the actual results from large numbers of teacher evaluations show that global questions, in particular, correlate very highly with one another (Cashin, 1995). The correlations between global questions commonly reach higher than r = 0.8 (see Figure 1). For example, a professor who is rated highly on one global question that has to do with his/her overall rating as a good professor will likely also get a high rating in an overall question about the quality of his/her course. Global questions carry a great deal of redundancy by measuring the same general feelings about satisfaction. The redundancy should not be forgotten when discussing the actual merits of the correlation coefficient or the high factor loadings in a factor analysis (see for example Marsh, 1983). Cohen (1981) utilized students" scores on an external exam as a measure of student learning and compared them with ratings given by students on their evaluation questionnaire. The highest correlations were with ratings of teacher skill ("explains clearly;" r = 0.50), teacher structure ("uses class time well;" r = 0.47), rating of own achievement ("rating of how much I learned;" r = 0.47), "overall rating of course" (r = 0.47) and "overall rating of instructor effectiveness" (r = 0.44). "Teacher rapport" produced a correlation of 0.31. If students' learning could simply be related to instructor characteristics, then correlation coefficients would be higher, so it is reasonable to conclude that factors other than mere student learning are included within summative ratings. Cohen"s research is valuable for two good reasons: (a) It demonstrates that students generally know when they are learning and (b) that the relationships are not nearly high enough between student evaluation ratings and measures of actual learning to allow one to be used to predict the other (see Figure 1). The validity of statement "b" is best shown by numerous studies on small populations (see Dunkin and Barnes, 1986) that show results as disparate as both positive and negative relationships between teachers" ratings by students and students" actual learning. An individ ual instructor's global rating by students is no predictor that his or her students will outperform those of an instructor who received less favorable ratings. Statistics are misused when they lead to deception, such as the measure of one thing (student satisfaction) being a measure of something else (student learning). There are effective and infinitely better ways, such as tests and knowledge surveys (Nuhfer, 1993; 1995; Nuhfer and Knipp, 2003) to measure student learning directly. To take a tool that does not address content mastery, and then to use statistical correlation to argue that the tool "really is a measure of student learning" ignores the obvious, and falls

somewhere between junk science and unethical practice. More effort than merited has gone into arguing for student evaluations as a measure of student learning. This is not because the relationships are not known. More recently, it is because assessment of student learning, not faculty ratings, is the major outcome now demanded from evaluators and accreditors, and schools have been generally more engaged in rating faculty than in assessing student learning. Admission that student evaluations are no measure of student learning is now doubly embarrassing to those who argued that the practice of rating teachers was a means to improve students learning. There is immense difference between "Rate this course as a learning experience." and "Rate your ability to quantitatively explain the distinction between permeability and hydraulic conductivity." The former question is not a rating of any specific learning objective and begs for affective influence; the latter is a measure of specific content knowledge devoid of affective general feelings. If student learning is as important as student satisfaction, then we should use direct learning measures even more often than we use global satisfaction ratings to deduce "good teaching." Figure 1 displays scattergrams associated with correlations within the ranges of those presented in Tables 1, 2 and 3. One must be aware that research on large populations is not the same exercise as evaluating an individual or assisting an individual to improve. Observe the line-fit to the data pairs and the resulting correlation coefficients. Recognize that an individual, could be any one of the points in the data set, and it becomes easy to see how often an individual can be seriously over-rated or under-rated by reliance on valid relationships established based on large populations. The ability to evaluate individual performance from data derived from relationships between two variables through a large population does not result because a relationship is "valid," "reliable" or "very useful." This ability arises only if the relationship carries the quality of high degree of predictability, and that requires a correlation coefficient of at least 0.9 or above. Not a single relationship established between summative ratings and outcomes or practices meets that qualification.

Figure 1. Scattergrams showing correlations typical of student ratings (global ratings of overall satisfaction) with other important parameters. The research that established the relationships shown in accompanying tables were done on much larger populations than shown by the points used to create these graphs. These scattergrams display "best-fit" lines to the data that show degree of prediction of Y from X can be expected with varied correlations. A "significant correlation" does not mean "high degree of predictability." Perfect predictability (r = 1) would place all points on the best-fit line. From these graphs, it is obvious why a correlation established on a large population cannot be applied reliably to judge an individual. This is why multiple means of assessment are required; student evaluations are never in themselves sufficient to judge individuals.

So, if we can"t really use student evaluations as a reliable basis to judge "good teaching" or "students learning" in an individual"s classes, why should we give these evaluations at all? To see why, let"s use another way to look at valid trends of low predictability. Davis and Thomas (1989) cited a 30-month study on a heart drug that led to a 93% survival rate in contrast to a 90.5% survival rate for recipients of a placebo. At the level of the individual, it would be hazardous to promise extended longevity as result of taking the drug. In a similar vein, a measure that yields a "significant" correlation coefficient of 0.4 between student learning and summative evaluation scores over a large population of faculty (Cohen, 1981; d"Appolonia, and Abrami, 1997) is not

something that one can reliably apply to an individual faculty member in a rank-salarytenure decision. Yet in the case of the drug, the tiny 2.5% difference translated into the drug"s saving over 20,000 lives in the U.S. in one year"s use! In the case of the teaching evaluations, a correlation of r = 0.4 documents that while large numbers of individual faculty will have valid experiences that violate these very general trends, somewhat larger numbers will have experiences that fit it. Over a large population of faculty, we would be more often correct than not in concluding that highly rated teachers are performing well. Administrators in charge of leading whole institutions to improve thus are justified in their arguing for including student evaluation as a part of annual review"on the basis that the evaluations will somewhat more often reflect successful teaching than not. However it should also be clear to these same administrators that one must seek and use more data than mere student evaluations in order to truly insure that more good than harm is achieved. There is one final caveat to note in gaining meaning from the use of published numbers. The studies from which the most accepted correlations are cited are the best available and they provide a best-case scenario. What made these studies the best is the fact that researchers doing meta-analyses were judicious in throwing out those studies that were deemed to be flawed for psychometric or other reasons. That the most-accepted studies sanitized available data in this way is abundantly clear from reading Theall, Abrami, and Mets (2001). The problem is that evaluation forms created by the "unwashed masses" across college campuses are not constructed with the care and sophistication that equates with those of the studies admitted into meta analyses. If one argues that research confirms a correlation of r = 0.4 exists between student learning and a locally drafted summative item, the argument is weak because one is comparing a homemade form to forms produced under research conditions; the correlation may apply to good surveys, but perhaps not to the survey in question. The correlation that may best represent what is happening across college campuses is what would have been obtained by meta analysis had the researcher included the flawed studies. Faculty Development vs. Educational Research Educational research shows that, with the lecture system of teaching, it helps to maintain interest and convey the desired content if the teacher is highly expressive, considerate towards students, and breaks up long lectures with active-learning exercises. These traits can be learned by teachers and promoted through faculty development. Educational research shows that cooperative learning and discussion teaching do not rely nearly so much on teacher expressiveness but more on ability to organize and facilitate learning. Results show very high levels of student satisfaction and student achievement in courses taught through these alternative methods. These methods too can be learned by teachers and promoted by faculty developers. But, while educational research contributes to faculty development, it is not faculty development. It is important to note that faculty developers work with individuals in order to help individuals meet their needs; educational researchers study large populations in order to discover relationships. Sometimes individual faculty objections to student evaluations

are based on individual experiences that nevertheless become labeled as "anecdotal," "misbeliefs" or even "myths." Some researchers (e. g. Boice, 1990) suggest that "countering" is the proper response to such objections. "Countering" might be well suited to arguing in a technical session for one"s beliefs about trends in general populations, but "countering," in the context of faculty development, denies the reality of individuals, and it is a humiliating response to faculty seeking help. If a misunderstanding of research or poor interpersonal skills leads to heavy-handed "countering," this cavalier approach will have the undesired effect of scorning individuals" actual hardships and experiences. The skills needed to be a great researcher in education or in psychology are not the same as those required to be a faculty developer. The ability to help an individual become a stronger teacher is far more important in development than the ability to argue.

Teasing Apart the Components of Student Evaluations

What, why and who do we want to evaluate? The mere exercise of having students complete evaluation forms is no guarantee that an evaluation took place or that anything beneficial resulted. One requisite for successful student evaluation is a good evaluation tool or survey questionnaire. Some questions on surveys are constructed without much thought or basis. (Example: Check yes or no - "I respect this teacher." Dilemma for the respondent: "Is the teacher such a horrid human that he or she doesn"t merit respect, or am I supposed to admit here that I am incapable of respecting an authority figure?") This question existed on a university survey for many years, but the results of respondents were never used, and most likely were not seriously looked at. A prime rule in this subheading is: "Don"t gather any information that you don"t intend to actually use." It is better to produce evaluative tools that directly address attainment of specific educational goals (Nuhfer and Knipp, 2003; McKeachie, 1997) than to continue to rely on tools that produce relative ratings of professors based on an unspecific mix of outcomes and feelings. As faculty, we would never permit ourselves or others to grade students on the basis of how we "feel" about them, but we have come perilously close to accepting that very basis for evaluating professors. The tyranny imposed by inept use of evaluations, along with its tragic outcomes has been described in the book Generation X Goes to College. The story told there is no isolated instance. Evaluations follow expectations, and there is now an increasingly dominant view among students that the goal of college is more about obtaining a credential (degree) than about becoming educated. In recent months (May-June 2003), this topic's discussion on the POD list was particularly impassioned regarding the issue of professors' satisfying student needs for being educated versus catering to wants in order to "please the customer." Chemist Mike Chejlava at Lafayette's College notes such "forces lead students to believe that they must get the RIGHT answer the FIRST time ...(and) any faculty member who gives work that they cannot master the first time is trying to keep them from their goals by setting standards

too high." In her dissertation, "Bridging the Gap Between What is Praised and What is Practiced: Supporting the Work of Change as Anatomy & Physiology Instructors Introduce Instructional Strategies to Promote Student Active Learning in Undergraduate Classrooms," Thorn (2003& personal communication) revealed that all instructors in her study received lower student evaluations while attempting to emphasize critical thinking. One, called to task by her dean, was ordered to stop that emphasis because of low satisfaction ratings. Weimer (2002) is refreshingly candid in revealing that effective learner-centered practices will not receive initial appreciation from students and one can anticipate this resistance will express itself through lowered global evaluations of the faculty member who introduces them. Peter Sack's entered the "Generation X" college with the original intent of educating students, but, like the examples above, learned that it was safer to please students than to educate them. Student evaluations were the lever that pressured Sacks to change his goals, and it is likely that faculty such as those in Thorn's (2003) study face similar pressure. Damage done to both education and to individuals through inept use of evaluation is considerable. The greatest reason that professors should embrace assessment (looking at the work that is done and the student learning that results) is to extricate themselves from the morass of having their livelihoods depend upon how others "feel" about them. Administer the tool with care. If we simply send a student assistant into the class who passes out the forms and says little more than "Fill 'em out!" we are not administering any paper evaluation with care. Because university - wide surveys may have questions that simply do not apply to particular classes, students really do need to be cautioned to leave questions blank that they believe may not apply to a particular class. Pitfalls arise in questions such as: "Is the professor accessible to students outside of class?" Research shows that about 10% of the students in many classes ever go to the professor's office for help. Students who have never been to the professor's office simply do not know whether the professor is actually available. In rating on a scale from 1 (poor) to 5 (outstanding), most students who are not cautioned about leaving responses blank unless they have first-hand knowledge of the question are prone to circle a "3" as their own expression of "I don't know." or "I really don't care much about this question." Of course, when the responses from 90% of students who don't have first-hand information are tabulated, their responses overwhelm those of the 10% who are furnishing solid information. In this way a professor who has kept all of his or her office hours and has perhaps even given out the home telephone and encouraged students to call will receive the same ratings as the person who abrogates all responsibility for being accessible. Students need to be told that if they do not have first-hand knowledge about a particular item, then they should leave the response blank. What factors other than actual student learning influence evaluations? Cashin (1988, 1995) provided a concise summary of the research on student evaluations. It shows that professors who teach classes where students are motivated

(such as classes taken by choice or in one's own major) have a major advantage in their ratings over other teachers. Those who teach large classes are generally at some slight disadvantage in their ratings, and over large populations, ratings of students are generally consistent with those of alumni, colleagues and administrators, but not at the level needed to make accurate predictions about individuals within the population. Those who are productive in research are generally rated more highly in classes they teach, but the relationship is so slight that research productivity is useless as a predictor of student satisfaction with teaching. The correlation could perhaps be a function of the energy and enthusiasm of more productive faculty. The higher professorial ranks have slightly better student satisfaction, but the relationship is so weak that rank cann ot be used as any predictor. The relationship between grade expectations and student satisfaction is weak.

Relationships with Student Evaluations: Correlations between Various Influences and Ratings of Overall Teaching Effectiveness (NR = Not Related at any Significance)



Sex of Instructor


Sex of Student Level of Student Rank of Professor Research productivity Student"s GPA Age of Student Age of Professor Time of day Class size Student Motivation Expected Grades Course Level Colleagues" Ratings Administrators" Ratings Alumni Ratings

NR NR 0.10 0.12 NR NR NR NR -0.18 0.39 0.12 0.07 0.48 to 0.69 0.47 to 0.62 0.40 to 0.75

Table 1, Relationships of various factors to student ratings, comes from various studies cited in Cashin, 1988, with exception of relationship to sex of instructor, which comes from Feldman, 1992, and Centra and Gaubatz, 1998. The results are all outcomes based upon studies of large populations. Student moti vation (willingness to participate actively in the learning process) has the greatest positive influence on student satisfaction of any instructional factor shown. Student ratings are also consistent with those of faculty colleagues and administrators, and the ratings remain consistent as student become alumni. Of interest is the fact that, in practice, administrators and colleagues spend little to no time in classroom from which these ratings are derived. Their only means of obtaining information is either hearsay from the students from the class, or from seeing the results of the student evaluations. Thus, colleagues" and administrators" ratings woul d appear to be redundant expressions of the student evaluations themselves. The only alternative explanation to valid insights gained without direct experience is "divine enlightenment."

Feldman (1998"see Table 2) teased out the formative components that lead to summative ratings of satisfaction and components that lead to measurably increased learning. There is similarity in the ranking (Table 2), but there are enough differences so that one could develop a teaching approach designed to increase ratings of satisfaction that differs from an approach that optimizes learning. We have a further complication from recent studies that indicate that "the most effective" teaching practices do vary across ethnic groups (Sanders and Wiseman, 1998).

"Top 5" Instructional Dimensions Based on Different Indicators

(Modified from Feldman, 1998)

Instructional Dimension

Importance Importance % Variation Shown by Shown by Explained Correlation Rank (and rank) with Student Achievement with Overall Evaluations

Teacher's preparation; organization of the course Clarity and understandableness Perceived outcome or impact of instruction Teacher's stimulation of interest in the course and its subject matter Teacher's encouragement of questions, discussion, and openness to opinions of others Intellectual challenge and encouragement of independent thought (by

30 " 35% 25 " 30% 15 " 20% 10 " 15% 10 " 15%

.57 (1) .56 (2) .46 (3) .38 (4) .36 (5)

(6) (2) (3) (1) (11) (4)

5 " 10% .25 (13)

teacher & course) Teacher's sensitivity to, and concern with class level and progress

5 " 10% .30 (10)


Table 2. Instructional dimensions compared with their ranks of importance in producing satisfaction and producing learning. These reveal similarity but not congruence. The trait most important to develop in order to produce highest levels of student learning is attention to course organization and preparation. This importance was also confirmed in the National Study for Student Learning (Pascarella, 2001). Yet, this is only the sixth most important practice for producing high ratings of satisfaction.

Cashin (1988) notes that within the over 1300 articles on the topic which had been written through 1988, a consensus existed (and still does in 2001) to support some conclusions. These and others gleaned from later literature are summarized here. 1) No single source of data, including student-rating data, provides sufficient information to make a valid judgment about teaching effectiveness (See Tables 1, 2, 3, 4, and Figure 1). 2) No single student rating item or set of related items will be useful for all purposes. A well-designed form will be multidimensional and will preferably rate twenty or more facets of teaching. To make appropriate use of these forms, all facets must be considered, and simple averaging of all the items is not appropriate consideration. 3) Although the instructor, not the course, is the primary determinant of student rating items (based upon Marsh, 1982, from a database of 1364 courses), there is a difference in consistency in ratings for the same instructor in the same course and for the sam e instructor teaching different courses. Cashin (1988) recommends ratings from two or more courses every term for at least two years if ratings are to be used for a particular purpose beyond instructor feedback. 4) Instructor variables significantly related to positive student ratings include expressiveness and delivering of actual content (Marsh and Ware, 1982), but the magnitude of their effect on global ratings varies in particular depending on whether one has mostly motivated or unmotivated students in class. 5) Instructor variables not significantly related to student ratings are sex of the instructor (Basow and Silberg, 1987), age, rank, teaching experience (Feldman, 1983), personality traits perceived by the faculty member (with the exceptions of self-esteem and enthusiasm - Feldman, 1986) and research productivity (Feldman, 1987). 6) Student variables positively related to evaluation include prior motivation or desire to take the course (Marsh, 1984), anticipated grades (Howard and Maxwell, 1982), and perceived challenge in terms of workload and difficulty (Cashin and Slawson, 1977). Greenwald (1997) believes that grading leniency is a more powerful influence than these studies have admitted to. These trends indicate some prejudice based on

expectations, so someone with a bad reputation among students may, after making a change for the better, still be partly judged for a time on the basis of their past reputations. 7) Other variables weakly related to instructor ratings include course level (positive relationship - Braskamp and others, 1984) and class size (negative relationship - Cashin and Slawson, 1977; Smith and Glass, 1980). 8) Academic fields are a significant factor; some fields seem inherently more highly rated by students. The relationships are complex and variable (Cashin, Noma, and Hanna, 1977). 9) McKeachie (1994) provides additional summaries with citations of studies to show that student evaluations are not particularly changed by time (i.e. evaluations by students or the same students later as alumni. They are not influenced by faculty personality and popularity outside the classroom Such traits should not be confused with expressiveness; expressiveness is documented as the most important of all influences on increasing ratings of student evaluations of lecture-based classes. Personality, popularity, expressiveness and other traits are sometimes lumped and sometimes not under "rapport" and this leads to confusion when "rapport" is deemed as important or unimportant by different researchers. The effects of retention on evaluations have been scarcely studied, but it is possible that the results can be manipulated if students who are unsatisfied are "encouraged" to drop, leaving the most satisfied, motivated achievers at the end to complete the evaluation forms. Retention data is seldom gathered, but it perhaps should be because of the obvious value it has and biases it could introduce. Our Personalities Affect Our Student Evaluations. Feldman (1986) showed that some aspects of professors' personalities do affect students' ratings of overall teaching effectiveness (Table 3). In fact, many of the correlations that students see as important correlate more highly with a global question on overall teaching effectiveness than the global question usually does with actual measures of student learning. The most striking aspect of Table 3 is the demonstration of how we tend not to see ourselves as others see us! Of the personality traits, the only two traits that peers, students, and teachers agree upon as being of significant importance are enthusiasm and self esteem, and students and peers give these much more importance than we tend to give them in ourselves. Overall Teaching Effectiveness and Personal Attributes of Profess ors (After Feldman, 1986)



By Self Self Esteem Energy (enthusiasm) Warmth Cautiousness Leadership Sensitivity Flexibility Emotional Stability Friendliness Neuroticism Responsible/orderly Brightness Independence Aggressiveness 0.38 0.27 0.15 -0.09 0.07 0.07 0.05 -0.02 0.04 -0.04 0.06 -0.05 -0.12 0.23

By By Peers Students 0.51 not rated 0.62 0.51 0.55 -0.02 0.56 0.53 0.57 0.47 0.42 -0.49 0.31 0.36 0.01 0.05 0.50 -0.26 0.48 0.47 0.46 0.54 0.49 -0.35 0.25 0.22 0.08 0.02

Table 3. (After Feldman, 1986) reveals that professors personal traits do affect evaluations by peers and students. They further reveal that professors are apt to underestimate or misjudge the importance of the effects of these traits on others.

Any university leader who truly recognizes the significance of this research will do everything possible to insure that self-esteem and enthusiasm of faculty are nurtured. Argumentative "countering" will not improve these traits, and an evaluation system that is used primarily to punish faculty rather than to strengthen them will likely damage teaching through destroying self-esteem and enthusiasm on an institution-wide scale. Even Our Subject Affects Our Ratings. Erdle and Murray (1986) showed that certain behaviors in class were seen as important to student satisfaction, and that the relative importance of these behaviors varied between disciplines (Table 4). The overall number registered by the average of responses to a global question about overall satisfaction is affected by different professors' personalities and by the nature of their subject. Erdle and Murray's work

indicates that the nature of what we are trying to teach influences the traits we should seek to address to obtain improvement. Correlations between Ratings of Overall Teaching Effectiveness and Teaching Behavior Factors (after Erdle and Murray, 1986)


Perceived Importance to teaching

by students of: Humanities Social Physical/life Science science Rapport 0.43 0.70 0.59 Interest 0.50 0.71 0.37 Disclosure 0.30 0.65 0.25 Organization 0.51 0.56 0.47 Interaction 0.48 0.51 0.34 Course Pacing 0.53 0.45 0.62 Speech Clarity 0.53 0.45 0.62 Expressiveness 0.58 0.59 0.51 Emphasis 0.61 0.58 0.51 Mannerisms -0.53 -0.42 -0.28 Use of Graphic 0.22 0.35 0.37 Vocabulary 0.16 0.35 0.37 Presentation 0.23 0.14 0.31 Rate Media Use 0.30 0.23 0.11
Table 4. Relationship of teaching traits to summative global student ratings. The variations between how students value various traits depend upon the subject being taught. Affective factors such as rapport and expressiveness exert a powerful influence, but so do traits related to learning such as organization, clarity, and emphasis of important points.

A Famous Heresy Perhaps the most heretical of all studies concerning validity of student evaluations was the famed "Dr. Fox experiment" (Naftulin, Ware and Donnelly, 1973) in which a hired actor posed as Doctor Fox and lectured to three groups of educators in a manner which was highly expressive but low in content. The groups consisting of professors, professionals and administrators gave satisfactory content marks to the actor, thus demonstrating a tremendously disturbing fact"even those who are above average in intelligence, trained in critical thinking and are well-educated cannot always tell when a lecture has substantive educational value! If these professionals could not make this distinction, how could average undergraduates make it? Could they be expected to know whether a professor was providing substantive content, if the content were current, or would they rate their professors more on expressiveness (or worse, entertainment value) rather than content value? One implication from t he study was that student ratings were not valid criteria to evaluate actual teaching effectiveness by lecture. The implication was argued based on similar data both pro (Ware and Williams, 1975) and con (Marsh and Ware, 1982). Marsh and Ware (1982) used factor analysis to divide "evaluation" into several dimensions and showed, that of the two most important influences, expressiveness (number 1 in importance) was registered primarily through the rating of "Instructor Enthusiasm", whereas content coverage (number 2) was expressed through "Instructor Knowledge." The more comprehensive of the later studies (Perry, Abrami and Leventhal, 1979; Abrami, Leventhal and Perry, 1982 - see Dunkin and Barnes, 1986) respectively replicated the Dr. Fox experiment and analyzed data from their own and from 11 other studies. They found that the effect of expressiveness alone on overall student ratings was "significant and reasonably large" whereas the effect of content alone was sadly "inconsistent and generally much smaller." However, on overall student achievement, content became significant and expressiveness became insignificant. Those who argue that student ratings reflect reliable measures of students' learning tend to avoid referencing such studies, or they tend to dismiss them as poorly done. Instead, such studies yield very useful insights. Day-to-day practices in the classroom cannot be set up as carefully controlled experiments. The fact that unexpected (unappreciated?) results occurred in such studies indicates that similar results will come during the practice of teaching, and that we should expect these. The Dr. Fox study shows the need to inform students at the time that any survey is given about the pitfalls of paper surveys and how such things as general feelings can heavily influence survey results. One defense is to instruct students about the need to answer each survey item based only on the content it asks for and not based on general feelings. The study demonstrated that a populace predisposed to complete rating forms because of general feelings rather than thoughtful focused response to specific items will subvert any meaningful evaluation. Advocating evaluation for faculty together with educating the student populace about the pitfalls of such surveys is essential. The outcome of the Dr. Fox study would perhaps have been different had the students been reminded to look seriously at tangible content. Likewise, Peter Sacks might not have gotten his tenure had students

been cautioned not to react blindly through feelings when they wrote their evaluations. Sack"s (1996) case should be far more upsetting to "true believers" in the infallibility of student evaluations than the much maligned Dr. Fox study, because Sacks was able to enact his deception for years and achieve tenure. Both cases confirm that if one plans to manipulate or subvert the outcomes of student evaluations, one can indeed do it. How Can We Use Evaluations More Effectively? Dunkin and Barnes (1986), Murray, (1984), Stevens and Aleamoni (1985), and Cashin (1988) all show that follow-up consultation with mid-term (formative) evaluation is a very important factor in leading to improvement. Based upon a synthesis of 22 studies, Cohen (1980) showed that instructors with no student evaluations rated in the 50th percentile at the end of a term; those who obtained student evaluation feedback were rated in the 58th percentile and those who received feedback with follow-up consultation were rated in the 74th percentile. It is a de monstrated fact that follow-up consultation is very effective in raising student ratings and that simply doing final evaluation and giving the results to faculty, as is the common practice, is less effective. Follow-up consultations for individuals are best done in a neutral and supportive environment. Departments' colleagues and chairpersons are annually involved in ranksalary-tenure decisions and are thus undeniably evaluative. Departments are thus sometimes, but not always, neutral or suitable places in which to deal positively with sensitive issues, and there are few issues more sensitive to professors than teaching. Global questions may evoke mainly satisfaction ratings, but satisfaction is not trivial. If students aren"t satisfied with an experience that we believe is grand, then we owe it to ourselves to know the reasons for their dissatisfaction if for no other reason that it"s a greater pleasure to be in a classroom with happy students. The fact that student evaluations are often the only dialogue about teaching with our students is a sad commentary, in general, on the nature of communication within higher education. That infrequent student evaluations are capable of replacing legitimate open dialogue is probably one of the most limiting concepts subscribed to by students, faculty and administrators. Evaluations should be the result of interaction and dialogue throughout a course, not merely an exercise at the end. Those of us who have used our evaluations as a means to establish dialogue through the student management teams (Nuhfer and others, 1990 - 2002) have found one of the most exciting and immediately rewarding ways to use student evaluations " as the starting point through which to improve our teaching and involve students in responsibility for the quality of the classroom community. Summary Research reveals that what is taking place in the classroom is complex. Students rate professors heavily based upon the personality traits they present and the practices that they use. Most of the above-cited research was done in college classes that were predominantly Caucasian students, and many classes studied were in universities with

established admissions standards. Effects of multi-racial classrooms and more open admissions standards that now characterize classes have scarcely been studied, but they will only likely increase this complexity. A global question such as: "Overall, how do you rate this instructor"s teaching ability compared to all other college instructors you have now and have had in the past?" has attraction for personnel decisions because of its ease of use and the dubious credibility gained from its being highly correlative (r = >0.8) with other (redundant) global questions. However, global questions are not highly correlated with student learning (r = about 0.4 with actual student achievement). There is no global question, or group of them, that have ever been demonstrated in themselves to reliably define "effective teaching." Confidence in global questions as reliable measures of successful teaching is unwarranted. The above-italicized question should raise skepticism about the value of comparing one instructor in four in freshman courses with one instructor in sixty in senior courses on an equal basis, and across varied disciplines. Research by Cashin, Noma and Hanna (1977) shows that some irreverence about simplistic evaluation exercises is warranted. A summative evaluation represented by a single number invites convenient abuse. Using numerical ratings from a chosen global question to produce a rank-order of faculty under guise of assessing their relative success as teachers is a get-out-of-work gimmick for sorting people; it is not a suitable assessment of teacher effectiveness. Peter Seldin (verbal communication, 1992, U-WI UTIC Symposium at Madison, WI) cited the grossest abuse of student evaluations in a dean who had the results of a global question calculated to three decimal places, ranked his faculty in numerical order and posted the list. One may ask in the light of the dangers and uncertainties, "Why bother with summative student evaluations?" First, evaluations are valid measures of student satisfaction. If educators had earlier called the questionnaires "student satisfacti on surveys" rather than "teacher evaluations," we would likely be further along in openly embracing our students" comments and getting the benefits from them rather than being stigmatized. Admonishment to use such evaluations only in conjunction with other measures is not an argument to throw out summative evaluations; rather student evaluations should be valued as one facet of evaluation. The need for formative evaluations is clearer. The research shows that we are not good judges of ourselves, and we need the communication from our students to help define ways to serve them better. Our chances for improvement are markedly increased if we make use of consultation to help us establish a plan for improvement. Conclusions 1) Student evaluations are valid means of assessing student satisfaction. Ratings of student satisfaction are not equivalent to measures of learning. Individuals cannot be judged as to their effectiveness as teachers based on student ratings alone. Student

satisfaction is important and must be addressed. At the same time, student evaluations should never be the sole criteria for rating any professor's teaching effectiveness. 2) Summative student evaluations do not look directly or cleanly at the work being done. They are mixtures of affective feelings and learning. Formative evaluations look directly at the pedagogical work that is done; they reveal the practices being used and degree to which each is being used. To learn if a teacher is improving, it is better to look for increased employment of practices that the research has proven to be valuable than it is to look merely at increases in higher summative ratings. 3) Correlations established on large populations, even those relationships that have been proven as "valid" and "reliable," cannot be safely applied as a tool to judge individuals. The validity and reliability exist only for large enough populations to produce them. Because validity and reliability are not discernable from studies of small populations, it is even more perilous to project such correlations onto individuals. 4) The most important factor in producing student learning lies in the degree to which the professor prepares and organizes his or her course. This is only the sixth most important factor in producing high summative ratings. To use student ratings to evaluate individuals without some direct measure of student learning omits the most important outcome of the educational process. 5) Successful evaluation programs demand careful design. Helpful character istics include: a. a well-designed evaluation tool b. a student populace that is educated about the pitfalls of such surveys c. campus-wide awareness of meaning, validity, and effects of evaluations d. a mechanism that prevents abuse of evaluation results e. a mechanism that supports attempts by faculty to improve. 6) Self-esteem and enthusiasm are important traits for successful teaching. A university that aspires to excel in teaching can't afford either policies or administrators that damage these traits in faculty. Inept evaluation will damage faculty morale institutionwide. 7) In recognizing the complexity of student-teacher relationships that occur in our classes, regular communication with our students about teaching is essential for continuous improvement. Quality of teaching cannot be "inspected in" by final summative evaluations.

8) Professors and administrators need to become familiar with the body of literature that addresses learning, teaching, evaluation and assessment. 9) Knowing one"s student audience and having a clear set of learning objectives is the professor"s best defense against producing substandard results (in both satisfaction and learning). 10) No professor needs to be stuck for life with poor student evaluations. Means exist through which a professor can (1) become a more effective teacher and (2) without pandering, can raise his or her student evaluations. The problem is primarily finding out the critical areas to focus upon given the complexity of the teaching environment. Consultation with developers, other faculty, or with student management teams can help in getting this focus. 11) The case described in Generation X Goes to College is real. Any method of evaluation can be subverted or corrupted if one deliberately aims to do so. References Cited Abrami, P. C., Leventhal, L., and Perry, R. P., 1982, Can feedback from student ratings help improve college teaching?: 5th Intnl. Conf. on Improving Universit y Teaching, London. Basow, S. A., and Silberg, N. T., 1987, Student evaluations of college professors: are female and male professors rated differently?: Jour. Educ. Psychology, v. 79, pp. 308314. Boice, R., 1990, Countering common misbeliefs about student evaluations of teaching: Teaching Excellence, v. 2., n. 2. Braskamp L. A., Brandenberg, D. C., and Ory, J. C., 1984, Evaluating teaching effectiveness: a practical guide: Sage Pub., Beverly Hills, CA. Cashin, W. E., 1988, Student ratings of teaching: a summary of the research: Kansas State Univ. Center for Faculty Evaluation and Development, Idea Paper n. 20. Cashin, W. E., 1990, Student ratings of teaching: recommendations for use: Kansas State Univ. Center for Faculty Evaluation and Development, Idea Paper n. 22. Cashin, W. E., 1995, Student ratings of teaching: the research revisited: Kansas State Univ. Center for Faculty Evaluation and Development, Idea Paper n. 32. Cashin, W. E., Noma, A., and Hanna, G. S., 1977, Comparative data by academic field: IDEA Technical Report n. 4, Kansas State Univ. Center for Faculty Evaluation and Development.

Cashin, W. E., and Slawson, H. M., 1977, Description of data base 1976 - 1977: IDEA Technical Report n. 2, Kansas State Univ. Center for Faculty Evaluation and Development. Centra, J. A., and Gaubatz, N. B., 1998, Is there gender bias in student ratings of instruction?: Journal of Higher Education, v 70, pp 17-33. d"Appolonia, S., and Abrami, P. C., 1997, Navigating student ratings of instruction: American Psychologist, v. 52, pp. 1198-1208. Davis, G. A., and Thomas, M. A., 1989, Effective Schools and Effective Teachers: Boston, MA, Allyn and Baker, 214 p. Cohen, 1980, Effectiveness of Student-Rating Feedback for Improving College Instruction: Research in Higher Education, v. 13, pp. 321 -341. Cohen, P. A., 1981, Student ratings of instruction and student achievement: a metaanalysis of multisection validity studies: Review of Educ. Res., v. 51, pp. 281 - 309. Dunkin, M. J., and Barnes, J., 1986, Research on teaching in higher education: in Handbook of Research on Teaching, M. C. Wittrock, ed., pp. 754 - 777. Erdle, S., and Murray, H. G., 1986, Interfaculty differences in classroom teaching behaviors and their relationship to student instructional ratings: Research in Higher Education, v. 24, n. 2, pp. 115 - 127 Feldman, K. A., 1983, Seniority and experience of college teachers as related to evaluations they receive from students: Research in Higher Educ., v. 18, pp. 3 - 124. Feldman, K. A., 1986, The perceived instructional effectiveness of college teachers as related to their personality and attitudinal characteristics: a review and synthesis: Research in Higher Educ., v. 24, pp. 129 - 213. Feldman, K. A., 1987, Research productivity and scholarly accomplishment of college teachers as related to their instructional effectiveness: Research in Higher Educ., v. 26, pp. 227 - 298. Feldman, K. A., 1992, College students' view of male and female college teachers: Part I"Evidence from the social laboratory and experiments: Research in Higher Educ., v. 33, pp. 317 - 375. Feldman, K. A., 1998, Identifying exemplary teachers and teaching: evidence from student ratings: in Teaching and Learning in the College Classroom 2nd ed., K. A. Feldman and M. B. Paulsen, eds., Needham Heights, MA, Simon & Schuster, pp. 391414.

Greenwald, A. G., 1997, Validity concerns and usefulness of student ratings of instruction: American Psychologist, v. 52, pp. 1182-1186. Hildebrand, M., Wilson, R. C., and Dienst, E. R., 1971, Evaluating university teaching: Handbook published by Center for Research and Development in Higher Education, U of CA, Berkeley. Howard, G. S., and Maxwell, S. E., 1982, Do grades contaminate student evaluations of instruction?: Research in Higher Educ., v. 16, pp. 175 - 188. McKeachie, W. J., and others, 1994, Teaching Tips - A Guidebook for the Beginning College Teacher (9th ed.): Lexington, MA, D. C. Heath, 444 p. McKeachie, W. J., 1997, Student ratings"the validity of use: American Psychologist, v. 52, pp. 1218-1225. McKeachie, W. J., and Kaplan, M., 1996. Persistent problems in evaluating college teaching. AAHE Bulletin, February 1996, pp 5-8. Marsh, H. W., 1982, The use of path analysis to estimate teacher and course effects in students" ratings of instructional effectiveness: Applied Psychological Measurement. v. 6, pp. 47 - 59. Marsh, H. W., 1983, Multidimensional ratings of teaching effectiveness by students from different academic settings and their relation to student/course/instructor characteristics: Jour. Educational Psychology, v. 75, pp. 150-166. Marsh, H. W., 1984, Students" evaluations of university teaching: dimensionality, reliability, validity, potential biases and utility: Jour. Educ. Psyc., v. 76, pp. 707 - 754. Marsh, H. W., and Ware, J. E., 1982, Effects of expressiveness, content coverage, and incentive on multidimensional student rating scales: new interpretations of the Dr. Fox effect: Jour. Educ. Psyc., v. 74, pp. 126 - 134. Murray, H. G., 1985, Classroom behaviors related to college teaching effectiveness: in Using Research to Improve Teaching, J. G. Donald and A. M. Sullivan, eds., San Francisco, Jossey-Bass. Nuhfer, E. B., 1996, The place of formative evaluations in assessment and ways to reap their benefits: Jour. Geoscience Education, v. 44, n. 4, pp 385-394. Nuhfer, E. B., 1993, Bottom-line disclosure and assessment: Teaching Professor, v. 7, n. 7, p. 8. Nuhfer, E. B., and Knipp, D., 2002, The knowledge survey: a tool for all reasons: To Improve the Academy, V. 21, in press.

Nuhfer, E. B., and others, 1992, Involve your students in improving their teaching and learning community: 12th Annual Lilly Conference on College Teaching: The Greening of the Future: Oxford, Ohio, pp. 347 -350. Nuhfer, E. B., and others, 1990-2002, A Handbook for Student Management Teams: Center for Teaching and Learning, Idaho State University, 60 p. Naftulin, D. H., Ware, J. E., and Donnelly, F. A., 1973, The Doctor Fox lecture: a paradigm of educational seduction: Jour. Medical Educ., v. 48, pp. 630 - 635. Pascarella, E. T., 2001, Cognitive growth in college: Change, v. 33, n. 1, pp. 21-27.

Perry, R. P., Abrami, P. C., and Leventhal, L., 1979, Educational seduction: the effect of instructor expressiveness and lecture content on student ratings and achievement: Jour. Educ. Psyc., v. 71, pp. 107 - 116. Sacks, P., 1996, Generation X Goes to College: Chicago IL, Open Court Pub., 208 p. Sanders, J. A., and Wiseman, R. L., 1998, the effects of verbal and nonverbal teac her immediacy on perceived cognitive, affective, and behavioral learning in the multicultural classroom:: in Teaching and Learning in the College Classroom 2nd ed., K. A. Feldman and M. B. Paulsen, eds., Needham Heights, MA, Simon & Schuster, pp. 455-466. Scriven, M., 1997, Student ratings offer useful input to teacher evaluations: ERIC/AE Digest, http://ericae.net/db/digs/ed398240.htm. Seldin, P., 1993, The use and abuse of student ratings of professors: The Chronicle of Higher Education, v. 39, n. 46, p. A 40. Smith, M., and Glass, G., 1980, Meta-analysis of research on class size and its relationship to attitudes and instruction: Amer. Educ. Research Jour., v. 17, pp. 419 433. Stevens, J. J., and Aleamoni, L. M., 1985, The use of evaluative feedback for instructional improvement: a longitudinal perspective: Instructional Science: v. 13, pp. 285 - 304. Theall, M., and Franklin, J. (eds.), 1990, Student Ratings of Instruction: Issues for Improving Practice: San Francisco, Jossey-Bass, New Directions for Teaching and Learning, n. 43, 135 p. Theall, M., Abrami, P.C., and Mets, L. M., (eds.), 2001, The student ratings debate: Are they valid? How can we best use them?: San Francisco, Jossey-Bass, New Directions for Institutional Research, n. 109.

Thorn, P. M., 2003, Bridging the gap between what is praised and what is practiced: supporting the work of change as anatomy & physiology instructors introduce active learning into their undergraduate classrooms: Austin, Texas, University of Texas, PhD dissertation, 384 p. Ware, J. E., and Williams, R. G., 1975, The Dr. Fox effect: a study of lecture effectiveness and ratings of instruction: Jour. Medical Educ., v. 50, pp. 149 - 156. Weimer, M., 2002, Learner-centered Teaching: Five Key Changes to Practice: San Francisco, Jossey-Bass, 258 p.