Вы находитесь на странице: 1из 33

Research in Higher Education, Vol. 48, No. 3, May 2007 ( 2007) DOI: 10.

1007/s11162-006-9028-1

THE VALIDITY OF HIGHER-ORDER QUESTIONS AS A PROCESS INDICATOR OF EDUCATIONAL QUALITY


Robert D. Renaud*, and Harry G. Murray**

................................................................................................ ................................................................................................
One way to assess the quality of education in post-secondary institutions is through the use of performance indicators. Studies that have compared currently popular process indicators (e.g., library size, percentage of faculty with PhD) found that after controlling for incoming student ability, these process indicators tend to be weakly associated with student outcomes (Pascarella and Terenzini, 2005). In addition, while much research has found that students increase their critical thinking skills as a result of attending college, little is known about what goes on during the college experience that contributes to this. The purpose of this research was to examine the validity of higher-order questions on tests and assignments as a process indicator by comparing it with gains in critical thinking skills among college students as an outcome indicator. The present research consisted of three studies that used different designs, samples, and instruments. Overall, it was found that frequency of higher-order questions can be a valid process indicator as it is related to gains in students critical thinking skills.

................................................................................................ ................................................................................................
KEY WORDS: critical thinking; higher-order; educational quality; process indicators; performance indicators; college; university.

INTRODUCTION
While we have long known that students learn and develop in a wide variety of ways as a result of attending college, we are only beginning to get a clearer idea of what is going on within colleges that contributes to these gains (see review by Pascarella and Terenzini, 2005). More specically, Pascarella and Terenzini point out that the degree to which

*Department of Educational Administration, Foundations, and Psychology, University of Manitoba, Winnipeg, Manitoba, Canada. **Department of Psychology, University of Western Ontario, London, Ontario, Canada. Address correspondence to: Robert D. Renaud, Department of Educational Administration, Foundations, and Psychology, University of Manitoba, Winnipeg, Manitoba, Canada; E-mail: renaudr@ms.umanitoba.ca 319
0361-0365 07 0500-0319 0 2006 Springer Science+Business Media, Inc.

320

RENAUD AND MURRAY

a student progresses in a particular area of learning is determined to a considerable degree by his or her amount of involvement. As well, we are nding out much more about what is happening within colleges to foster greater levels of student involvement. The implications stemming from the progress in this area apply to several perspectives. Perhaps the most direct application of these ndings would be with the individual instructor who can decide which instructional approaches and classroom activities would best contribute towards the intended student outcomes. On a slightly broader level, having a clearer understanding of what furthers student achievement would enable administrators to make more appropriate policy decisions. With respect to the issues concerning the assessment of educational quality, with quality in higher education broadly dened as the amount the students have learned and developed as a result of their enrollment, evaluators have the opportunity to develop more valid performance indicators. Performance indicators are often divided into input, process, and output indicators (e.g., Borden and Botrill, 1994). Astin (1970) outlines a conceptual model of college impact that represents these categories. Student inputs are the characteristics a new student brings into college (e.g., aspirations, aptitude). Moreover, student inputs can be in the form of either variable measures that change with time (e.g., cognitive development), or static personal attributes (e.g., race, sex). The process indicators (or college environment as referred to by Astin) refer to those aspects of the institution capable of aecting the student (e.g., teaching quality, physical characteristics). Student outputs refer to those aspects of student learning and development that the college inuences or attempts to inuence. The input-process-output model is a fairly common approach in program evaluation research. Of particular importance in the present research is the relation between the processes and the outcomes. One way of judging the validity of a process variable is to compare it with an outcome variable (Scheirer, 1994). Scheirer points out that if the processes have any inuence on the outcomes (as intended), then these two factors should be associated. In other words, high levels of output should be preceded by high levels of the process variables and vice-versa. The important point here is that, without measuring the empirical relation between both the input and process variables and the outcomes, the degree to which one can conclude that the program is eective (i.e., schools are fostering student learning) is dicult to conrm. One main reason for focusing on the relation between institutional process indicators and student outcomes is in response to the rapidly growing popularity of college and university annual rankings, which

VALIDITY OF HIGHER-ORDER QUESTIONS

321

focus primarily on inputs and processes while little (if any) attention is paid to outcomes (e.g., Macleans, U.S. News and World Report). Perhaps the most controversial aspect of these rankings is that they are based largely on the unsupported assumption that the indicators used to rank institutions are in fact related to student outcomes or, in other words, accurately reect the level of quality within an institution (Astin and Panos, 1969; Ball and Halwachi, 1987; Conrad and Blackburn, 1985; Nedwick and Neal, 1994, Solmon, 1973, 1975; Tan, 1986). Another serious limitation with college rankings is the redundancy among the indicators in that many of the indicators are highly related to one another. For example, Astin (1971) reported that admission selectivity is highly correlated with an institutions prestige, faculty student ratio, library size, faculty salaries, endowment, research funds, academic competitiveness among students, and political orientation. In a later study, Astin and Henson (1977) found that selectivity correlated strongly with several traditional performance indicators including tuition, average faculty salary, educational and general expenditures per student, value of endowment per student, percentage graduate students, and educational and general expenditures. Therefore, it is hardly surprising to see that a school that is highly selective will also score highly on most if not all other indicators. A third concern with institutional rankings is that a particular performance indicator used to compare schools (e.g., facultystudent ratio) may not be based on consistent criteria (Conrad and Blackburn, 1985; Solomon, 1973, 1975). For example, what exactly denes faculty in determining the facultystudent ratio could be tenured faculty only at one institution and anyone who teaches a course (including graduate students and part-time instructors) at another institution. Although there are other limitations with respect to rankings and performance indicators (see Bruneau and Savage, 2002), the three limitations described above appear to be the most relevant and substantial. In summary, a central question with respect to measuring educational quality is, how much do institutional characteristics as reected in performance indicators actually contribute to student learning and development? In their extensive review on how college contributes toward a wide range of student outcomes, Pascarella and Terenzini (2005) conclude that the relationship between the institutional characteristics typically used in annual rankings and student outcomes is generally weak and inconsistent. In other words, among colleges that are similar in terms of incoming student ability, there seems to be little relation between structural features such as library size and output measures like cognitive skills. One possible explanation suggested by Nordvall and

322

RENAUD AND MURRAY

Braxton (1996) is that current indicators are inappropriate because they are far too removed from the level of actual classrooms and courses, and thus, it is dicult to formulate feasible educational policies based on them. Second, there is the problem of misuse of indicators. Several sources (e.g., Hossler, 2000; McGuire, 1995; Webster, 1992) suggest that some of the performance indicator information provided by institutions is either incomplete, inaccurate or, in an eort to reect the most positive image, deliberately false. Thus, even if a set of meaningful and valid indicators are established, they will be only as informative to the degree they are properly measured and interpreted. Given these concerns with the currently popular indicators, one process indicator that may be more valid is the use of higher-order questions in classes. The denition of a higher-order question in this study corresponds to the top four levels within Blooms (1956) taxonomy educational objectives within the cognitive domain, namely, application, analysis, synthesis, and evaluation. Briey, an acceptable response at a particular level assumes that one can exhibit the cognitive processes at all of the lower levels. For example, being asked to design a study to determine how much student learning is caused by teacher enthusiasm would represent a synthesis level item. This would require a student to know about each aspect of the study such as research design and data collection (knowledge); know what each aspect of the study means, such as why an ANOVA would be an appropriate statistic (comprehension); apply these abstract concepts to a particular situation (application); and tie in each of the separate concepts together such that each component (e.g., sample size, how enthusiasm and learning are measured) becomes an integral part of the newly created product (synthesis). Perhaps the clearest distinction between lower- and higher-order questions, as noted by Bloom, is that while lower-order questions are designed to elicit existing answers (e.g., from the textbook, directly from the lecture), higher-order questions require novel answers in that they cannot simply be recalled. It appears that the earliest consistent empirical support for the use of higher-order questions comes from several studies carried out by Braxton and Nordvall. The frequency of use of higher-order questions was found to be positively related to year level of course (Braxton, 1993), selectivity of the institution (Braxton, 1993; Braxton and Nordvall, 1985), whether a course is required or optional (Braxton and Nordvall, 1985), and quality of the graduate department from which a faculty member earned his or her degree (Braxton and Nordvall, 1988). In sum, these studies provide clear validity evidence for the corresponding inputs and processes.

VALIDITY OF HIGHER-ORDER QUESTIONS

323

There are three main advantages in focusing on higher-order questions as a process indicator. First, this variable is relatively easy to measure. Like most other performance indicators, frequency of higherorder questions is (1) a quantitative variable that can be measured quite reliably and objectively; (2) available from a large number of schools and many classes within each school; and (3) measurable in a nonintrusive way that does not interfere with regular classroom or administrative operations. The second main advantage is that unlike many other performance indicators, frequency of higher-order questions is something that occurs at a classroom or course level and instructors can control directly in their classrooms. Finally, higher-order questions can provide additional and, compared with students ratings, more objective data on the quality of the course (Nordvall and Braxton, 1996). Among the many possible outcome variables of interest, the current research examined gains in critical thinking skill because it would logically be expected to be enhanced by increased exposure to higher-order questions and, as most educators would agree, is one of the most important goals of the educational process (Facione, 1990; Halpern, 1996, 2001). In comparing the processes involved in higher-order thinking (application, analysis, synthesis, and evaluation) with commonly dened aspects of critical thinking, it appears that critical thinking comprises a signicant portion of higher-order thinking (Ennis, 1985; Facione; 1990; Ferguson, 1986; Halpern, 1998; King, 1995; Paul, 1993; Tsui, 1999). For example, two elements commonly included in the denition of critical thinking, namely the ability to identify assumptions and the ability to evaluate evidence (e.g., Ennis, 1985; Furedy and Furedy, 1985; Pascarella and Terenzini, 2005; Watson and Glaser, 1980), are also listed as major components within the analysis level as described by Bloom (1956). Aside from the conceptual overlap between higher-order thinking and critical thinking, there are several other reasons why it would be worthwhile to explore the relation between these two variables. Perhaps the biggest reason is that the suitability of focusing on a particular process like using higher-order questions in class would be best conrmed by determining its relation with a specic outcome. Second, among the many possible ways in which students learn and develop as a result of attending college, the one outcome that would be expected to be most directly inuenced would be critical thinking. Finally, previous ndings suggest that these two constructs are indeed clearly linked together. From a broader perspective, the results of several recent studies support the conclusion that higher-order questions may have a positive impact on students critical thinking skills. Using a qualitative approach to obtain a detailed account of what instructional activities best

324

RENAUD AND MURRAY

contribute toward gains in critical thinking skills, Tsui (2002) concluded that writing was one general factor that was clearly inuential. Although the Tsui (2002) study did not set out to focus on higherorder questions specically, one could reasonably presume that the activities involved in the process of writing, as outlined by Tsui, included at least some degree of the type of thinking that would be needed to respond to a higher-order question. Similarly, in a series of studies by Williams and colleagues (Williams, Oliver, Allin, Winn, and Booher, 2003; Williams, Oliver, and Stockdale, 2004; Williams and Stockdale, 2003), students who scored better on exams with items requiring logical reasoning tended to show larger pre-testpost-test gains in critical thinking skills. As with the Tsui (2002) study, the results of the Williams studies provide indirect support for the eect of higherorder questions on critical thinking skills. Previous studies that have explored the correlation between the frequency of higher-order questions with critical thinking have generally focused on higher-order questions either as those asked by the instructor during lectures, or those listed on tests and assignments. Except for the ndings of Logan (1976), who found a positive relation between the frequency of higher-order questions asked by the instructor while teaching and gains in critical thinking, most other studies (Foster, 1983; Smith, 1977, 1980; Winne, 1979) found little relation. One possible explanation for the weak association between asking higher-order questions during class and gains in critical thinking skills may have to do with the lack of variability in asking higher-order questions (e.g., Smith, 1977). In comparison, this link tended to be positive among studies in which students either created their own higher-order questions (Keeley, Ali, and Gebring, 1998; King, 1989, 1990, 1995) or answered higherorder questions provided by the instructor on assignments and exams (Gadzella, Ginther, and Bryant, 1996; Gareis, 1995; Willis, 1992).

The Current Research


Given that higher-order questioning is a potentially valuable process or performance indicator variable that has not been studied extensively in terms of how it inuences critical thinking, the goal of this research was to attempt to conrm empirically that the frequency of higher-order questions occurring on tests and assignments is a valid process indicator of educational quality in that it correlates signicantly and positively with gains in critical thinking skill. Three studies were conducted to test this hypothesis.

VALIDITY OF HIGHER-ORDER QUESTIONS

325

The rst study was a split-plot experimental design with students from three sections of an educational psychology course. As such, there were three levels of the between-subjects factor (i.e., course sections), and two levels of the within-subjects independent variable controlled by the researcher (i.e., lower- and higher-order questions). Each student was given six assignments during the course with a given assignment containing both low-level questions based on one chapter of the text and high-level questions based on another chapter. After each consecutive set of two assignments (representing four chapters of material from the text), students were given a test containing four critical thinking questions with each question focusing on one of the chapters covered in the preceding two assignments. Thus, there were three tests in total, each providing a within-subjects comparison of topics taught with and without higher-order questions. It was predicted that students would earn higher scores on the critical thinking questions from chapters where higher-order assignment questions were given than from chapters where lower-order assignment questions were given. The second study was a true experiment that involved students enrolled in an introductory psychology course. The experiment consisted of a pre-test measure of critical thinking, answering review questions based on a passage from the text used in the course, and nally a posttest measure of critical thinking. The independent variable was the level of review questions (lower- or higher-order). It was predicted that the experimental group would show larger pre-test to post-test gains in critical thinking compared to the control group. The last study was a correlational design that compared mean course gain in critical thinking with the proportion of higher-order questions used on assignments and tests across courses from dierent year levels and disciplines. Gains in critical thinking were measured with pre-test and post-test measures consisting of abbreviated versions of the WGCTA (Form A and B). It was predicted that the frequency of use of higher-order questions on assignments and tests would be positively correlated with class mean gains in critical thinking.

STUDY 1
This study used a split-plot design with three class sections. Within each section, students were given three tests and, in preparation for each test, they were given lower- and higher-order assignment questions. The purpose of this study was to determine if students would earn higher scores on critical thinking test items that pertain to higher-order

326

RENAUD AND MURRAY

assignment questions compared to critical thinking test items that pertain to lower-order assignment questions.

Method
Participants Within a large university in central Canada, a total of 131 undergraduate students from three sections of an optional introductory educational psychology course participated in this study. The classes ran from September to December 1998 (n = 49), January to April 1999 (n = 53), and July to August 1999 (n = 29). All sections were taught by the rst author. Although students were not randomly assigned to course sections, it is reasonable to expect that the groups were similar in terms of any possible confounding inuence (e.g., motivation) due to the fact that students typically chose a particular section based on factors unrelated to the variables of interest in this study (e.g., scheduling with other courses). Enrollment in this course typically consists of students who are 1922 years of age, and approximately 80% of the students are female. Participating students were not informed of any aspects of the study in order to preclude possible biases. However, ethical approval to conduct the study was obtained with the assurance that (1) the data were obtained entirely from students grades on normal course assignments and tests, such that conducting the study did not interfere with any aspect of the course (i.e., content coverage, teaching, evaluation); and (2) every student was given the same set of tasks (i.e., assignments and tests) and was assessed according to the same standards. For a students data to be included for analysis, he or she had to meet all three of the following criteria. First, to help ensure that a student was exposed to the independent variable (i.e., lower- and higherorder questions on assignments) to an acceptable degree, a student had to have submitted at least ve of the six assignments in the course. Second, data were retained only for students who obtained a total score of 41 or less out of 48 on test questions based on lower-order assignment questions, thus making it possible for a students score on higher-order test questions to exceed that on lower-order questions. A ceiling eect was found for some students who obtained near perfect scores on most of the test questions pertaining to lower-order assignment questions, such that the students scores on test questions corresponding to higherorder assignment questions could, at most, be only marginally higher than his or her scores on lower-order test questions. Third, to obtain a complete measure of the dependent variable, a student had to have

VALIDITY OF HIGHER-ORDER QUESTIONS

327

written all three tests in the course. Forty-one students failed to meet one or more of these three criteria, leaving 90 students for the analyses.1 Materials The independent variable was the level of questions (lower- vs. higher-order) in each of six assignments over the entire course. In all sections, each assignment focused on two consecutive chapters of the textbook. Most students completed each assignment within 23 pages. The textbook used in each section was Educational Psychology, 6th edition, by Gage and Berliner (1998). In terms of grading for each section, each assignment was given equal weight with all six assignments collectively accounting for 20% of the nal grade in the course. In the rst two class sections, each assignment contained six lowerorder questions based on one chapter and two higher-order questions based on the other. For example, Assignment 1 contained two higherorder questions based on Chapter 1 of the textbook, and six lowerorder questions covering Chapter 2. In total, the 6 assignments covered 12 chapters of the textbook. Given that the selection of chapters from which higher- and lower-order assignment questions was random, and that higher-order questions were intended to be more dicult than the lower-order questions, it was possible that how well students did on a particular question was partially a function of the content of the chapter. To help deal with this, lower- and higher-order questions on assignments used in the second class section focused on chapters opposite to those used in the rst class section. For example, Assignment 1 in the second class contained six lower-order questions from Chapter 1 and two higher-order questions covering Chapter 2. For the third class section, the structure of the assignments was the same as for the rst two sections, except for one main dierence. Instead of students doing an assignment containing lower- and higherorder questions written by the instructor, students themselves were required to compose two lower-order multiple-choice and two lower-order short answer essay questions based on one chapter in the text, and one higher-order multiple-choice and one higher-order essay question for another chapter. In all three class sections, the dependent variable was performance on critical thinking questions in each of three non-cumulative term tests covering successive thirds of the course. Each critical thinking question was of short-answer essay format with the answer expected to be roughly half a page in length. Each term test contained four critical thinking questions with each question focusing on one of the chapters covered in

328

RENAUD AND MURRAY

one of the two corresponding assignments. For example, the rst test contained four critical thinking questions covering chapters 14 of the textbook, which corresponded to the rst two assignments. In terms of grading, the critical thinking questions accounted for about half of the marks on each test, with each test worth 20% of the nal grade in the course. Procedure During the rst class, each student was given a course outline that explained the format of the assignments and tests, along with the exact dates when assignments were due and tests were given. Students were strongly encouraged to complete assignments not only to maximize their nal grades, but also as a study aid in preparing for the next test. In the rst two class sections, each assignment was handed out in class approximately 1 week before it was due and was graded and returned to students roughly 1 week after the due date. In the third class section, the details of all six assignments were handed out to students during the rst class. In addition to assigning a grade for each question on the assignment, written constructive feedback was provided for questions that were not written and/or answered adequately. The majority of the
13.0

12.5
Mean Critical Thinking Score

Higher-Order

12.0

11.5

11.0 Lower-Order 10.5

10.0

2 Test

FIG. 1. Mean critical thinking scores based on lower- and higher-order assignment questions (Study 1, all classes).

VALIDITY OF HIGHER-ORDER QUESTIONS

329

grading in each section was done by a trained assistant who was unaware of this study. In each section, the assistant graded each entire assignment and half of the critical thinking questions in each test (i.e., two questions). The remainder of the grading was done by the researcher/instructor.

Results
To assess the overall dierence between lower- vs. higher-order questions on critical thinking and to identify under which conditions that dierence would exist, a 3 2 3 split-plot analysis of variance was performed. There was one between-subjects factor namely class sections (A) with three levels, and two within-subjects factors being level of questions (i.e., lower- vs. higher-order) (B), and testing occasions (C) with three levels. As outlined previously, each test consisted of four critical thinking questions with two of those questions pertaining to chapters from which lower-order assignment questions were given, and the other two based on chapters from which higher-order assignment questions were given. The dependent variable was the total score on the two critical thinking questions on a test corresponding to either the lower- or higher-order assignment questions. Each critical thinking test question was worth eight marks. Therefore, scores could range from a low of 0 to a perfect score of 16. In this study, the most relevant main eect was the within-subjects eect of level of questions on assignments. Because this factor consisted of two levels, the assumption of circularity was not applicable. Across all classes and testing occasions, the mean score of the critical thinking test questions based on higher-order assignment questions (M=11.49, SD=1.84) was signicantly higher than the corresponding mean score based on lower-order assignment questions (M=11.07, SD=1.76), F (1, 87)=4.36, p<.05, and the proportion of variance accounted for by this eect was .048. A signicant two-way interaction was found between level of questions and testing occasion, F (2, 174)=5.41, p<.01. Figure 1 shows the means of test questions based on lower- and higher-order assignment questions collapsed across the three classes for each time period. It was found that the mean score of critical thinking questions based on higher-order questions (M=12.34, SD=2.01) was higher than the mean score of critical thinking questions based on lower-order questions (M=10.98, SD=2.41) in the rst test, whereas the corresponding mean scores in both the second and third tests did not dier signicantly. A three-way signicant interaction between level of questions, testing

330

RENAUD AND MURRAY

occasion, and class section was also found, F (4, 174)=54.67, p<.01. Contrary to expectation, the mean critical thinking score based on lowerorder questions (M=10.78, SD=2.81) was signicantly higher than that representing higher-order assignment questions (M=9.22, SD=3.61) on the third test. In contrast, the mean critical thinking score based on higher-order questions on the third test in Class 2 (M=13.18, SD=1.84) was signicantly higher than that representing lower-order questions (M=10.53, SD=3.49). Similarly, for Class 3, the mean critical thinking score based on higher-order questions (M=12.06, SD=1.98) was greater than that reecting lower-order questions (M=9.19, SD=2.11) on the rst test. In sum, while the ndings with respect to the main eect for level of questions appears to be encouraging as the overall mean critical thinking score based on higher-order questions was higher than that associated with lower-order questions, the simple main eect comparisons were complex and confusing, and failed to identify any patterns that might suggest under which particular conditions this eect is more likely to occur.

Discussion
It was predicted that students would earn higher scores on critical thinking test questions when the questions pertained to a chapter from which higher-order assignment questions were given than from a chapter from which lower-order assignment questions were given. The main feature of this study was that it provided a within-subjects comparison of topics taught with and without higher-order questions. As such, some possible confounds such as incoming knowledge of educational psychology or critical thinking ability were controlled for. Overall, the results oer mild statistical support for this hypothesis. However, the ndings could be considered more notable for at least two reasons. Although the fact that level of assignment questions accounted for only 4.8% of the variance may not seem particularly noteworthy, when considered in an educational context, it could mean the dierence of a letter grade (e.g., going from a C to a B) for some students. Furthermore, this nding is even more impressive when one considers the fact that the independent variable (i.e., level of assignment questions) had only two levels, which would tend to attenuate its relationship with a dependent variable that could take on many possible values (i.e., critical thinking test item) (Prentice and Miller, 1992). One possible reason why this study did not nd a stronger main eect for level of questions may stem from the signicant interaction between level of questions and testing occasion. Of the three tests, the only

VALIDITY OF HIGHER-ORDER QUESTIONS

331

signicant dierence was found on the rst test in which the mean critical thinking score based on higher-order assignment questions was higher than that based on lower-order questions. In each class section, students were informed at the beginning of the course that each of the three tests were in exactly the same format, and were non-cumulative. Therefore, some students may have modied their studying strategies after the rst test so as to obscure the eects of the lower- vs. higherorder questions on subsequent assignments. For example, a student may have felt that all four questions on the rst test were quite challenging in that they did not simply ask for information to be recalled. Expecting similar types of questions on the next two tests, that student may have studied each chapter more thoroughly (especially those chapters that corresponded to lower-order assignment questions) by thinking about its content in ways that were not covered explicitly in either the text or the lectures. Secondly, in this course, the rst test was given relatively early in comparison to when mid-term exams are typically given in other classes. In this situation, some students may not have had other immediate commitments (e.g., essays, tests in other classes) and therefore, could have had more time to prepare for the rst test. Conversely, the second and third tests occurred when it was more likely that students had other commitments, especially the third test happening during the nal exam period. A third explanation for the seemingly small main eect has to do with the main limitation of a eld experiment. In this study, there was a clear manipulation of the independent variable of interest (i.e., level of questions on assignments), and the topics were randomly assigned to treatment levels. However, because this study was conducted in actual classes, the dependent variable may have been dicult to inuence because of extraneous factors. Considering the conditions under which this research was carried out, it appears that the impact of lower- vs. higher-order questions on critical thinking is notable. In addition to the main limitation outlined above, the fact that this eect was even detectable during the short time span of the course (about 13 weeks) makes this nding even more impressive.

STUDY 2
This purpose of this study was to compare two groups in terms of their mean pre-testpost-test gains in student critical thinking skills in a true experiment. During the treatment phase, all students were given a short passage to read along with review questions to answer. Students in the experimental group were given higher-order review questions,

332

RENAUD AND MURRAY

while students in the control group were given lower-order review questions.

Method
Participants Within the same university as Study 1, the participants in this study consisted of 190 undergraduate students registered in a rst-year introductory psychology course. Students who chose to participate in this study were given credit toward fullling the research participation requirement in this course. Before the experiment began, all participating students signed a consent form indicating that they were taking part in a study that compares study strategies. Participants were randomly assigned to either an experimental group or a control group with 96 and 94 students, respectively. To minimize the degree to which the data used in the analyses were contaminated with outlying values, each subjects data had to meet the following criteria for inclusion. Separate analyses were performed for each of the general and subject-specic critical thinking tests. Looking at the general test, there were three criteria for inclusion. First, because the maximum score on both the general pre-test and post-test was 10 points, a subjects data were removed if he or she obtained a perfect score on the general pre-test, thus precluding pre-test-to-post-test gain. During the treatment phase of the experiment, to help ensure that a subject had put forth a reasonable eort toward reading the passage and answering the review questions that pertained to the passage, each subject had to obtain a minimum score of three out of eight on the review questions. Although this criterion was applied to all subjects, it was intended more for those in the higher-order group to ensure that they had engaged in a sucient level of higher-order thinking. Of the 26 subjects data who were taken out for failing to meet this criterion, 25 came from the higher-order group. Finally, as an indication that a subject had put as much eort into answering the questions on the general post-test as on the pre-test, and considering how a subjects scores could decline slightly from pre-test to post-test because of factors other than critical thinking ability (e.g., the dierent questions in each version), a subjects data were removed if his or her post-test score was four or more points below his or her pre-test score. Based on these three criteria, 157 subjects were retained for analyses on the general test with 66 and 91 subjects in the experimental and control conditions, respectively. With respect to the course-specic psychology portion of the critical

VALIDITY OF HIGHER-ORDER QUESTIONS

333

thinking test, there were two further criteria for inclusion in the data analysis. First, as with the general subtest, each subject had to obtain a minimum score of three out of eight on the review questions. Because the review exercise (i.e., treatment) was the same in both the general and psychology analyses, the number of subjects taken out from each group was the same. Regarding the level of student motivation in completing the post-test, it is possible that a student who is less motivated may simply select answers to the multiple-choice questions at random. With each of the 15 items in the psychology subtest having four options, the probability of someone obtaining up to four out 15 correct by picking answers at random is .69. Therefore, to help ensure that a post-test score was the result of a genuine eort rather than a random selection, the second criterion for inclusion was that a subjects post-test score had to be at least ve out of 15. Based on these two criteria, 95 subjects were retained for analyses on the psychology test with 37 and 58 subjects in the experimental and control conditions, respectively. Materials All subjects were given a pre-test and a parallel post-test measure of critical thinking, consisting of both a general subtest and a coursespecic subtest. Each critical thinking test consists of 25 multiple-choice questions intended to measure the degree to which a student can engage in a particular aspect of critical thinking. Each item is followed by four or ve options representing varying degrees or correctness or applicability from which students selected the most appropriate option. Scores on either test could range from a minimum of 0 to a maximum of 25. To measure general critical thinking ability, the rst ten questions in the pre-test and post-test were adapted from the Watson-Glaser Critical Thinking Appraisal (WGCTA). These questions focus on everyday situations that most people would likely be familiar with. To concur with the overall format of the WGCTA, the general items used in this study reected each of the ve main components of critical thinking covered by the WGCTA (i.e., inference, recognition of assumptions, deduction, interpretation, and evaluation of arguments), with two items reecting each component. The remaining 15 questions focused on a selected passage from a chapter on personality theory from the introductory psychology textbook that was used in the course. The chapter on personality theory was chosen because it was not scheduled to be covered in class for at least a month after the completion of this study. Between the pre- and post-test measures of critical thinking, all subjects were asked to read the passage on personality theory. This passage

334

RENAUD AND MURRAY

consisted of nine textbook pages, which most students read easily within 25 min. Along with the assigned reading, each subject was given a set of review questions that pertained to the passage. The independent variable in this experiment was the level of review questions subjects answered as they read the passage. Subjects assigned to the experimental condition were given four higher-order critical thinking questions, each of which was in a short answer essay format and answerable within half a page. Subjects in the control condition were given eight lower-order recall questions, each of which was also in a short answer open format and answerable within a quarter of a page. The reason for assigning a greater number of lower-order questions than higher-order questions was to ensure that students in both groups were spending about the same length of time on the reading assignment before starting the post-test measure of critical thinking. It was suspected that the lower-order questions would be easier and, therefore, take less time and space to answer compared to the higher-order questions. Procedure Subjects were tested in groups ranging in size from 3 to 10. To ensure that experimental and control groups had roughly equal numbers of subjects, and that subjects in both groups were tested at the same times, half of the subjects in each testing session were randomly assigned to the experimental condition and half to the control condition. Before the experiment began, each participant signed a consent form indicating his or her understanding of the experiment and willingness to participate. To encourage students to put forth their best eort, students were told before the experiment began that anyone who scored higher than the predicted population average on both critical thinking tests and the chapter questions would receive a one-dollar lottery ticket at the end of the experiment. Actually, every participant had received a lottery ticket. The predicted population average was a ctitious goal that enabled the experimenter to justify giving every participant a lottery ticket more easily than would an absolute goal such as 50% correct. To begin, all subjects were given Form A as the critical thinking pretest, which most subjects completed in about 20 min. After all subjects had completed the pre-test, they were given the assigned reading and review questions at the same time. Before the students began this phase, they were instructed to briey look over the questions before reading the passage so they could identify more readily which parts of the passage to pay close attention to in order to answer the questions. Most subjects took approximately 40 min to read the passage and answer all

VALIDITY OF HIGHER-ORDER QUESTIONS


8.0

335

Mean Critical Thinking Score

7.5 7.0 6.5 6.0 5.5 5.0


Experimental Group Control Group

PRETEST

POSTTEST

Critical Thinking Test (General)

FIG. 2. Mean pre-testpost-test critical thinking test scores (General) (Study 2).

of the questions. After each subject had nished answering his or her review questions based on the assigned reading, they were given Form B as the critical thinking post-test. After completing the post-test, each subject received a debrieng form outlining the purpose of this experiment in detail and a lottery ticket.

Results
On the pre-test, internal consistency estimates for the 10 items in the general subtest and the 15 items in the psychology subtest were .41 and .00, respectively. Those for the corresponding tests in the post-test were .50 for the general subtest and .08 for the psychology subtest. These estimates for the general subtest suggest that the test items reect the same construct at least to a moderate degree. On the other hand, surprisingly low reliability estimates for the psychology tests indicate that the test items fail to clearly represent a single construct. An item analysis was performed on each of the four subtests to determine if any particular items had a noticeably detrimental eect on internal consistency reliability. In each of the subtests, it was found that omitting as many as half of the items with the lowest itemtotal correlations made only a marginal improvement in reliability estimates. Moreover, omitting items had little impact on the degree to which the pre-testpost-test means differed between the control and experimental groups. Therefore, all items were retained for analysis.

336
7.0

RENAUD AND MURRAY

6.5
Mean Critical Thinking Score

Experimental Group

6.0 Control Group 5.5

5.0

4.5

PRETEST

POSTTEST

Critical Thinking Test (Psychology)

FIG. 3. Mean pre-testpost-test critical thinking test scores (Psychology) (Study 2).

Mean scores on the review exercises were 7.7 out of eight for the lower-order question group and 4.4 out of eight for the higher-order question. This indicates that while most students in the lower-order condition had little diculty in correctly answering the review questions, students in the higher-order category were not as successful. However, it should be pointed out that the higher-order review questions were clearly more dicult compared to those in the lower-order condition. The mean score for each group on the general pre-test and post-test is shown in Fig. 2. The two groups showed virtually the same amount of gain with the control group going from (M=5.95, SD=1.81) on the pre-test to (M=6.45, SD=1.91) on the post-test, while the experimental went from (M=6.53, SD=1.32) to (M=7.02, SD=1.76) on the pre-test and post-test, respectively. Note that the higher mean score of the experimental group compared to the control group on the general pretest was primarily due to the majority of omitted subjects, who tended to score relatively low on this test, coming from the experimental group as explained earlier. Using the ANCOVA procedure with pre-test scores as the covariate, there was no signicant dierence between the groups post-test scores, with group membership accounting for only 1% of the total variance. The mean score for each group on the psychology pre-test and post-test is shown in Fig. 3. It may be noted that the group receiving

VALIDITY OF HIGHER-ORDER QUESTIONS

337

higher-order review questions showed a larger gain from pre-test (M=5.38, SD=1.59) to post-test (M=6.46, SD=1.45) than the group receiving lower-order questions (M=5.16, SD=1.63) and (M=5.79, SD=0.91) for pre-test and post-test, respectively. Using the ANCOVA procedure controlling for psychology pre-test scores, the mean post-test score of the group that received higher-order review questions was signicantly higher than that of the group that received the lower-order review questions, F(1, 92)=7.94, p<.01, and the eect of the treatment accounted for 7.9% of the total variance.

Discussion
The purpose of this study was to determine the eect of higher-order questions on gain in critical thinking skills. This was a true experiment in which all or nearly all possible confounds were controlled for. On the general test of critical thinking, there was virtually no dierence (in both statistical and practical terms) between the lower- and higherorder groups in pre-test to post-test gain. However, for the coursespecic test of critical thinking, the results showed that students who answered higher-order review questions, showed a larger pre-test to post-test gain than students who answered lower-order review questions. In addition to being signicant both in statistical and practical terms, this nding is noteworthy for another reason. Like intelligence, it is not unreasonable to think that an appreciable gain in critical thinking skills would occur after having been exposed to a variety of courses and experiences over an extended period. Therefore, given that one could expect only a small eect under the restricted conditions of this experiment, namely minimal exposure to the independent variable in such a short time frame (less than 2 hours), these results are encouraging. An ongoing concern in controlled experiments like this is the degree to which students are putting forth a genuine eort to complete the tasks involved in the experiment. The students who took part in this study did so to help fulll their research participation requirement in the introductory psychology course. This requirement is based solely on the number of research studies in which the student participates, and has nothing to do with the quality of the participation. As an incentive for the students in this study to try their best in completing the critical thinking tests and the review questions, each student was promised a lottery ticket if his or her scores were above a hypothetical standard on all three tasks (i.e., pre-test, review questions, post-test). Admittedly, in this study, the eectiveness of this incentive is questionable for

338

RENAUD AND MURRAY

two reasons. First, although both groups showed a pre-testpost-test improvement on the course-specic psychology subtest, it was somewhat surprising that there was not a larger improvement given that the critical thinking tests referred to a relatively short passage that students had read through for about 45 min. Second, it is possible that the lower scores obtained on review questions by students in the higher-order condition (4.4 out of eight, on average) may suggest that these students were not engaged in higher-order thinking as much as was expected. Another factor that may have aected scores on both the pre-test and post-test measures of critical thinking is the number of answers selected simply by guessing. All students were instructed to answer all of the multiple-choice questions and to take their best guess if they were not completely sure of the correct answer. In this situation, it is more dicult to know if a correct answer was chosen either because the student knew the answer, or simply guessed. This would explain why many of the items, particularly those in the psychology subtests, had negative itemtotal correlations, which diminished their internal consistency. With the large amount of error in the observed scores on the psychology subtests, it is more dicult to accurately assess ones level of critical thinking ability. Therefore, although the ndings of the current study are encouraging when the context is considered, the lack of a clearly eective incentive along with the limitation of the critical thinking measures points out that these results should be regarded with appropriate caution. One way to address the concern regarding the level of student motivation to try their best in each of the tasks in the experiment would be to provide a more valuable incentive. At the post-secondary level, one of the clearest, most immediate incentives is grades. Therefore, one possibility would be include the experimental tasks as a small part of a course with a corresponding weight of the nal grade. To deal with the ethical issues involved, such an experiment could utilize a withinsubjects design (as in Study 1) such that each student receives exactly the same materials from which he or she will be assigned a grade. For example, this study could take place during a class period with the pretest and post-test critical thinking measures focusing only on relevant course material. The review questions could consist of both lower-order questions that pertain to one half of the passage (e.g., based on one chapter), and higher-order questions that pertain to the rest (e.g., based on another chapter in the text).

VALIDITY OF HIGHER-ORDER QUESTIONS

339

STUDY 3
This study was a correlational design that compared mean course pre-test to post-test gain in critical thinking with the proportion of higher-order questions used on assignments and tests across courses from dierent year levels and disciplines.

Method
Participants The participants in this study were 781 students enrolled in 24 onesemester Year 2 to Year 4 courses in a variety of disciplines within the same university as in Study 1. As with Study 2, the most important criterion for inclusion was that each student had to have completed both preand post-critical thinking tests. In addition, to deal with the concern that some students may not have put forth a genuine eort to do their best on the critical thinking tests, a cuto was established both for the pre-test and the decline from pre-test to post-test. For a students data to be included for analysis, he or she had to have scored at least 18 out of 40 on the pretest and to have dropped from pre-test to post-test by no more than three points. The only criterion for class mean data to be included was that the instructor had to provide a blank copy of each piece of work (e.g, assignments, tests, etc.) that was to be graded. Course materials were obtained from all classes except one. Out of 387 students who wrote both tests, 313 students from 23 classes met all of the above criteria for inclusion. Materials Gains in critical thinking were measured with abbreviated pre-test and post-test versions of the Watson-Glaser Critical Thinking Appraisal (WGCTA). For each of the two forms, Form A and Form B, roughly half of the items in their original format from each of the ve WGCTA subtests were used. The predictor variable in this correlational study was the proportion of higher-order questions on tests and assignments in a course, as indicated by ocial course documents such as handouts, tests, and the course outline. The criterion variable was the mean pre-test to post-test gain in critical thinking skill. Procedure At least 4 weeks prior to the start of the semester, course instructors were contacted by e-mail to obtain permission to include their classes in the study. Instructors who agreed to participate signed a consent form

340

RENAUD AND MURRAY

indicating their willingness to participate and their understanding of the study. In the rst 2 weeks of each course, students were given the critical thinking pre-test during the rst part of a regular class meeting. Before the test was administered, students were informed that (1) they were participating in a study that was examining the critical thinking skills of university students, and (2) their participation was entirely optional and (3) their scores would have no bearing on their grade in the course. To counter the possibility that pre-testpost-test gains might be inuenced by one version of the test being more dicult than the other, half of the classes received Form A as the pre-test while the other half received Form B, and vice versa for the post-test. Almost all students were able to complete both the pre-test and the post-test within 20 min. Throughout the semester, copies of course outlines, assignments, and tests from each course were obtained from each course instructor. After all course materials were received for a particular course, each question on each piece of work was coded in terms of whether it was a higherorder or a lower-order question. A higher-order question was judged as one which appeared to reect one of the top four levels of Blooms taxonomy of cognitive objectives, namely application, analysis, synthesis, or evaluation. On the other hand, a lower-order question included those that reected one of the lowest two levels at the bottom of the taxonomy, namely knowledge and comprehension. All questions were coded by the rst author as being either lower-order or higher-order. While it was relatively straightforward to classify a particular question on a test or assignment, many courses required students to write an essay or give a presentation. The potential diculty here is in determining what proportion of the entire project involved higher-order objectives. For example, a student could be faced with a question such as Outline the pros and cons of drug testing in sports. On a test, one can reasonably assume that most if not all of the response to this question would require higher-order thinking. In comparison, a student who was asked to write an essay or give a presentation on Drug testing in sports could submit an essay with most of its content resulting from higher-order thinking. On the other hand, a student could give a presentation on this topic that included little more than sharing denitions and facts that are readily available. Therefore, because of the potentially wide range of the proportion of higher-order thinking that would go into essays and presentations, as a conservative estimate, each essay or presentation was classied as being half lower- and half higher-order. To obtain an estimate of the proportion of coursework in each of the two Bloom categories, each test question and each essay or presentation was weighted

VALIDITY OF HIGHER-ORDER QUESTIONS

341

according to how much of the nal course grade it represented. For example, if a particular question on a test was worth 20% of the test, and the test was worth 30% of the grade in the course, then that question would receive a weight of .06. Therefore, the proportion of higherorder content for a particular course was obtained by calculating the sum of the weights of those questions. To provide an indication of the accuracy with which the test questions were classied, each question in the course materials from ve courses were independently classied by another expert rater to provide an estimate of inter-rater reliability. Because the purpose of this research required each question to be classied as either lower- or higher-order, the inter-rater reliability was calculated as the proportion of the total number of questions in the sample of ve courses in which both raters judged a question as being either lower- or higher-order. Based on a total of 215 questions from the ve courses, both raters gave the same classication for 74.0% of the questions. In the nal 2 weeks of each course, students were given the critical thinking post-test during the rst part of a regular class meeting. The post-test version was opposite to the version each class received at the start of the course. Within 2 weeks after the end of the course, a letter was mailed out to each instructor who participated. This letter served two functions: to thank them for participating in the study; and to oer feedback in the form of mean scores on the pre-test and post-test.

4.50 4.00
Pretest-Posttest Gain

3.50 3.00 2.50 2.00 1.50

0.20

0.40

0.60

0.80

Proportion of Higher-Order Questions

FIG. 4. Scatterplot of Mean pre-testpost-test gain in critical thinking test scores and proportion of higher-order questions for each class (Study 3, n = 23).

342

RENAUD AND MURRAY

Results
As the items in each version were adapted from the WGCTA, each version of the test used in this study contains the same ve subtests with eight items in each subtest. In Form A, the internal consistency estimates ranged from .31 for the Deduction subtest to .78 for the Recognition of Assumptions subtest. In comparison, those for Form B ranged from .11 for the Evaluation of Arguments subtest to .82 for the Recognition of Assumptions subtest. These estimates indicate that while the items in the Recognition of Assumptions subtests (both Form A and B) each focus on a particular construct quite well, that for each of the remaining subtests is moderate at best. In comparing the two versions, it appeared that Form B was slightly more dicult than Form A. Across 12 classes that were given Form A as the pre-test and Form B as the post-test, the mean pre-test and posttest scores were 27.45 and 27.18, respectively, for a mean decline of .27. Across the remaining 11 classes who received the tests in the opposite order, the mean pre-test and post-test scores were 25.22 and 27.84, respectively, for a mean gain of 2.62. Because the value of interest with respect to the critical thinking scores was the mean gain for each class, to make the two groups of classes (i.e., those who received either Form A or Form B as the pre-test) more comparable, the dierence between the two groups of classes on their mean gains was added as a constant to the mean gain (or decline) of each class that received Form A as the pre-test. Therefore, a constant of 2.89 (from ).27 to 2.62) was added to the mean of each class who received Form A as the pre-test. Figure 4 shows the scatterplot of the mean gain in critical thinking scores and proportion of higher-order questions for each class. The proportion of higher-order questions in each class ranged from .11 to .80 with a mean of .36. The mean gain in critical thinking scores ranged from 1.31 to 4.38 with an overall mean of 2.58. The correlation between mean gain in critical thinking scores and proportion of higher-order questions was r (22)=.42, p<.05, with frequency of use of higher-order questions accounting for 17.8% of the variation in critical thinking gain scores.

Discussion
The purpose of this study was to determine the extent to which asking higher-order questions on tests and assignments in university courses is associated with gains in student critical thinking skills. The signicant positive correlation found in this study was particularly encouraging

VALIDITY OF HIGHER-ORDER QUESTIONS

343

given that it was purely an observational study that spanned only about a 3-month period. An interesting feature of this study compared to the two other studies was that it used the same method of assessing the process indicator (i.e., proportion of higher-order questions) as might be used in an actual assessment of educational quality. In addition, this study attempted to classify assignment and test questions in a more accurate manner. Compared to studies such as that of Braxton and Nordvall (1985) where proportion of higher-order questions was measured on one examination only, this study included all course materials and calculated the proportions based on each questions weight toward the nal grade in the course. As pointed out earlier, there are two benets of this approach. First, it is unobtrusive as it neither interferes with teaching in the classroom, nor requires much time from an instructor. Second, unlike many other process indicators such as library size, the proportion of higher-order questions is one that inuences student study activities and is under the direct control of the instructor. The most obvious limitation in this study was the lack of control of confounding variables such as eects of other courses. Even within a particular class, it was unlikely that even a small proportion of its students had been taking the same other classes. Admittedly, it is dicult to conclude how much a students gain in critical thinking skills was inuenced by the higher-order questions in a particular course compared to what that student had experienced in his or her other courses, especially when it was not known what other courses each student was taking. Another reason to regard these results with caution is that, as pointed out by Bloom (1956), each question was classied as either loweror higher-order based on the wording of the question. As such, this assumes that the answers to higher-order questions were neither covered in class nor could be taken directly from the text. In other words, each answer to a higher-order question was a novel response based on inference or deduction from information learned in class or from the text. A related limitation was that the raters who classied each question were not familiar with the knowledge of many of the disciplines in this sample, which may have reduced the level of agreement between the two raters. Finally, although these ndings seem encouraging, perhaps an even greater eect would have been found if this study was conducted over an entire academic year (i.e., two semesters) and included rst-year courses. It is plausible that the rst-year courses would show the smallest proportion of higher-order questions that, when included with courses at each other year level, would increase the range of that variable. Unfortunately, most rst-year courses at this university are too large

344

RENAUD AND MURRAY

(over 100 students) to allow for ecient administration of the pre-test and post-test measures. The main recommendation for this type of study would be to consider training raters in each discipline from which course materials would be obtained. For example, an experienced engineering instructor might be better able to classify questions from any engineering course than might a researcher from an unrelated discipline. In this sense, it would seem easier to train an instructor who is already knowledgeable in a particular discipline to classify questions according to Blooms taxonomy rather than have an instructor who knows Blooms taxonomy quite well but, is unfamiliar with the content of the discipline at hand.

GENERAL DISCUSSION
The present research consisted of three studies to determine the impact of higher-order questions as a process variable on student gain in critical thinking skills as an outcome variable. While it is dicult for a single study to avoid at least one serious methodological limitation, the three studies in the present research used dierent designs, samples and instruments to address the main design limitations associated with a single study. One study compared the amount of higher-order questions on tests and assignments in actual classes to pre-testpost-test gains in critical thinking, while in the other two experimental studies, one compared groups of students given lower- vs. higher-order questions in actual classes, and the other was a true experiment done in a laboratory that related level of review questions to pre-testpost-test gains while controlling for possible confounding variables. The rationale behind the main research question in these studies was to validate the use of higher-order questions as a process indicator of educational quality. Previous studies have found little relation between typical process variables (e.g., library size, proportion of faculty with PhD) and student outcomes (see review by Pasacrella and Terenzini, 2005). It appears that most of the process variables used in previous research in this area were chosen partly on the basis of expediency and partly on the basis of a presumed relation with student outcomes. In theory, many of those indicators seem reasonable. For example, it would not be unreasonable to believe that students will learn more when taught by more qualied faculty (i.e., with PhD). Perhaps one reason why most process indicators fail to show a relation with student outcomes is because they are far too removed from what happens in actual courses and classrooms and, thus, are not directly linked with student learning. In other words, student learning is more likely to be

VALIDITY OF HIGHER-ORDER QUESTIONS

345

correlated with a process variable when that variable has a direct connection with student learning. Although we are beginning to learn more about what happens during the college experience that contributes to gains in critical thinking skills (see reviews by McMillan, 1987; Pascarella and Terenzini, 2005), the current studies provide a detailed, empirical look at one particular process variable, namely frequency of use of higher-order questions, that appears to have a direct impact on the development of critical thinking. Overall, the ndings of this research clearly indicate that students are more likely to improve their critical thinking skills when they have answered higher-order questions in their coursework. These ndings are encouraging for at least two reasons. First, having found the same conclusion using dierent methods and samples provides more convincing and converging evidence of the eect of higher-order questions on gains in critical thinking skills. This eect was found using an observational study, and two experiments, one done in the eld and one in the laboratory. Secondly, these results are impressive when one considers the short time frame in which these studies were conducted. Pascarella and Terenzini (2005) point out that students typically show about a 19% improvement in critical thinking skills throughout their full university experience, from freshman to senior year. In the present research, an eect was found in two studies conducted over the course of one semester, and in a laboratory experiment that lasted under 2 hours. Moreover, Halpern (2001) points out that it would seem unreasonable to expect a substantial gain over a short period (i.e., one semester or less). Beyond the demonstrated validity of using higher-order questions as a performance indicator, there are several implications that stem from this research. As mentioned earlier, the proportion of higher-order questions can be measured quite easily, objectively, and unobtrusively, and unlike many typical indicators, this variable is less likely to be distorted by either inconsistent operational denitions or falsication. Secondly, using higher-order questions may be a more acceptable indicator because unlike the more popular indicators such as library size, or selectivity in admission, it is directly under an instructors control. In other words, an instructor can usually choose what types of questions to use in assignments and tests based on the objectives of the course. Perhaps the most obvious issue that can be addressed by this type of research is the evaluation of educational quality. In addition to the typical comparisons that one could make, such as comparing psychology departments across several schools, this research could help to answer the long debated question of whether the quality of education has either improved or declined over the years. For example, looking at

346

RENAUD AND MURRAY

higher-order questions, one could obtain course materials from courses taught 30 years ago and compare those with materials from the same or similar courses taught today. While many expert opinions have been offered (e.g., Bercuson, Bothwell, and Granatstein, 1984, 1997), very few studies (e.g., Stalker, 1986) have attempted an objective approach toward answering this question. Finally, as Braxton (1990, 1993) has pointed out, it appears that some faculty lack sucient training in being able to prepare higher-order questions. The clear benets of using higher-order questions in course work implied by the ndings of these studies demonstrates the need to include training in this area as a part of faculty development. While the results of this research clearly support the use of higherorder questions as an indicator of educational quality, a couple of important qualications are necessary. First, it must be emphasized that focusing on higher-order questions does not by any means give a complete picture of educational quality. As covered extensively in their review, Pascarella and Terenzini (2005) conclude that students learn and develop in many ways as a result of attending a post-secondary institution, and these outcomes are inuenced by many factors that occur during this period. This research compared only one process with one outcome. Second, despite the ndings of this research and the importance of critical thinking skills as an outcome variable, it would be a mistake to cast a negative light on a course that had little or no emphasis on higher-order questions. The proportion of higher-order questions in a particular course ought to be a function of the course objectives. For example, in an introductory course in a particular discipline such as biology, the objective may be to have students know a large number of terms to serve as a foundation for them to draw upon in subsequent upper-year courses in which they are expected to engage in higher-order thinking in responding to the more challenging problems. A clear limitation that existed in each of the three studies in the present research was the multiple-choice format of the critical thinking tests. One could argue that a student would need to engage in critical thinking in order to dierentiate between the most appropriate option and attractively worded distractors. However, it would certainly be more preferable to use a critical thinking test that better reected this process in everyday life and allowed students to construct rather than select their answers. A related limitation has to do with the lack of a precise denition of critical thinking. As outlined earlier, critical thinking has been dened as a construct consisting of several aspects. It is possible that the proportion of questions representing each aspect of critical thinking could have diered from pre-test to post-test. This may have

VALIDITY OF HIGHER-ORDER QUESTIONS

347

reduced the degree to which the pre-test and post-test are comparable. This possible variation in the proportion of the types of critical thinking questions may also explain how one version was found to be more dicult than the other version in a particular study (e.g., in Study 3, Form B was found to be more dicult than Form A). While the results of this research seem encouraging given its short time frame, it may have shown clearer eects if studied over a longer period. Dressel and Mayhew (1954) have found that the biggest gains in critical thinking skills occur during the rst 2 years of college. A second limitation is that the ndings in each of these studies may have been attenuated somewhat due to the narrow observed variance in the amount of higherorder questions. Barnes (1983) and Smith (1977) note that college instructors have been found to rely very heavily on lower-order questions. One possible reason higher-order questions are used so infrequently is that faculty are not adequately trained to write higher-order questions (Barnes, 1983; Braxton, 1990; Nordvall and Braxton, 1996). Another factor could be the academic disciplines examined. Nordvall and Braxton point out that lower-level objectives and lower-level questions may be more appropriate for disciplines such as mathematics and chemistry with a high level of consensus on theory and methods, while higher-level objectives are more likely to occur in disciplines such as history or philosophy with a variety of theoretical perspectives. Similarly, increased variance and statistical power would be obtained by including other academic processes, such as written instructions for term papers and assignments. However, the benet of assessing the amount of higher-order questions more thoroughly from courses across varied disciplines may pose another problem as the accuracy with which someone classies higher-order items may vary with his or her level of knowledge of that subject (Braxton and Nordvall, 1985). Bloom (1956) outlines two main conditions in being able to correctly classify test or assignment questions. First, one needs to either know, or at least make assumptions about, the context in which the material was learned. For example, if students were given the following item on a test: Outline three criticisms of Piagets theory of cognitive development, the person classifying this item would have to judge whether the lecture and/or the text merely described Piagets stages of cognitive development without explicitly covering any particular strengths or limitations, or if these criticisms were clearly explained. The dierence is that the former situation would lead one to label this as an item at the evaluation level, while in the latter situation it could be considered a knowledge-level item. While Bloom also suggests that classication accuracy can be enhanced by actually attempting to solve the problem, having judges

348

RENAUD AND MURRAY

trained to look for a particular phrase within a question can help in this regard. For example, King (1995) provided a list of phrases such as What is the best___ and why? that commonly appear in higher-order questions at various levels of Blooms taxonomy. Therefore, with the assumption that either the answer to this question has not been explicitly covered in the course, or the student has not learned this sometime before starting the course, judges ought to be fairly consistent in being able to at least determine whether or not this is a higher-order question. In addition to the specic recommendations outlined in each study, one general recommendation would be to validate other process indicators that are directly linked with student outcomes. In a manner similar to the approach used in this research, future studies could use dierent designs varying in their levels of internal and external validity. One viable process variable is the amount of instructors feedback to students. While the amount of feedback has been found to be positively related to admission selectivity (Ewell, 1989), it seems that it has not been explored much in terms of its relation to student outcomes. Another process variable that could be explored further is the level and type of student involvement in the classroom. In an observational study, Smith (1977) found that classroom participation and studentteacher interactions were positively related to gains in students critical thinking skills. The aim of this area of research is to determine which aspects of an institution one should look at in order to obtain the most valid assessment of the performance of that institution. As the need for postsecondary institutions to be accountable to students, parents, industry, and even investors is increasing, further research in this area is essential so that more informed decisions can be made regarding the assessment and comparison of schools in terms of the quality of the education they are providing.

NOTES
1. While some of the criteria for inclusion are more clearly xed (e.g., having completed all tests), the particular cuto scores for others may not be quite as straightforward to justify (e.g., specic test scores). Although the latter type of criteria could be considered more arbitrary to some degree, each cuto score can be at least somewhat justied as it was set at an arguably reasonable level in an attempt to address some of the relevant validity issues (e.g., ceiling eect, lack of motivation). In each of the three studies, the pattern of the results based on all cases with complete data was similar to those reported here.

VALIDITY OF HIGHER-ORDER QUESTIONS

349

REFERENCES
Astin, A. W. (1970). The methodology of research on college input, Part one. Sociology of Education 43: 223254. Astin, A. W. (1971). Open admissions and programs for the disadvantaged. Journal of Higher Education 42: 629647. Astin, A. W., and Henson, J. W. (1977). New measures of college selectivity. Research in Higher Education 6: 19. Astin, A. W., and Panos, R. J. (1969). The Educational and Vocational Development of College Students, American Council on Education, Washington, D. C. Ball, R., and Halwachi, J. (1987). Performance indicators in higher education. Higher Education 16: 393405. Barnes, C. (1983). Questioning in the college classrooms. In: Ellner, C., and Barnes, C. (eds.), Studies in College Teaching: Experimental Results, Theoretical Interpretations and New Perspectives, D. C. Heath, Lexington, MA. Bercuson, D. J., Bothwell, R., and Granatstein, J. L. (1984). The Great Brain Robbery: Universities on the Road to Ruin, McClellend and Stewart Limited, Toronto, Canada. Bercuson, D. J., Bothwell, R., and Granatstein, J. L. (1997). The Crisis in Canadas Universities: Petried Campus, Random House of Canada, Toronto, Canada. Bloom, B. S. (1956). Taxonomy of Educational Objectives: Cognitive Domain, McKay, New York. Borden, V. M. H., and Bottrill, K. V. (1994). Performance indicator: history, denitions, and methods. In: Borden, V. M. H., and Banta, T. W. (eds.), Using Performance Indicators to Guide Strategic Decision Making, Jossey-Bass, San Francisco. Braxton, J. M. (1990). Course-level academic processes as indicators of the quality of undergraduate education. Instructional Developments 1: 810. Braxton, J. M. (1993). Selectivity and rigor in research universities. Journal of Higher Education 64: 657675. Braxton, J. M., and Nordvall, R. C. (1985). Selective liberal arts colleges: Higher quality as well as higher prestige? Journal of Higher Education 56: 536554. Braxton, J. M., and Nordvall, R. C. (1988). Quality of graduate department origin of faculty and its relationship to undergraduate course examination questions. Research in Higher Education 28: 145159. Bruneau, W., and Savage, D. C. (2002). Counting out the scholars: How performance indicators undermine universities and colleges, Lorimar, Toronto. Conrad, C. F., and Blackburn, R. T. (1985). Program quality in higher education: A review and critique of literature and research. In: J. C., Smart (ed.), Higher Education: Handbook of Theory and Research (Vol. 1), Agathon, New York, pp. 283308. Dressel, P. L., and Mayhew, L. B. (1954). General education: Explorations in evaluation, American Council on Education, Washington, D. C. Ennis, R. H. (1985). A logical basis for measuring critical thinking skills. Educational Leadership 43: 4448. Ewell, P. T. (1989). Institutional characteristics and faculty/administrator perceptions of outcomes: An exploratory analysis. Research in Higher Education 30: 113136. Facione, P. A. (ed.) (1990). Critical Thinking: A Statement of Expert Consensus for Purposes of Educational Assessment and Instruction, American Philosophical Association, ERIC ID 315 423. Ferguson, N. B. L. (1986). Encouraging responsibility, active participation, and critical thinking in general psychology students. Teaching of Psychology 13: 217218.

350

RENAUD AND MURRAY

Foster, P. (1983). Verbal participation and outcomes in medical education: A study of thirdyear clinical-discussion groups. In: Ellner, C., and Barnes, C. (eds.), Studies in College Teaching: Experimental Results, Theoretical Interpretations and New Perspectives, D. C. Heath, Lexington, MA. Furedy, C., and Furedy, J. (1985). Critical thinking: Toward research and dialogue. In: Donald, J., and Sullivan, A. (eds.), Using Research to Improve Teaching (New Direction for Teaching and Learning No. 23), Jossey-Bass, San Francisco. Gadzella, B. M., Ginther, D. W., and Bryant, G. W. (1996). Teaching and Learning Critical Thinking Skills. Paper Presented at the International Congress of Psychology (26th, Montreal, Quebec, August, 1996). Gage, N. L., and Berliner, D. C. (1998). Educational Psychology (6th Ed.), Houghton Miin Company, Boston. Gareis, K. C. (1995). Critiquing articles cited in the introductory textbook: A writing assignment. Teaching of Psychology 22: 233235. Halpern, D. F. (1996). Thought and Knowledge: An Introduction to Critical Thinking, (3th Ed.), Erlbaum, Mahwah NJ. Halpern, D. F. (1998). Teaching critical thinking for transfer across domains. American Psychologist 53: 449455. Halpern, D. F. (2001). Assessing the eectiveness of critical thinking instruction. The Journal of General Education 50: 270286. Hossler, D. (2000). The problem with college rankings. About Campus 5: 2024. Keeley, S. M., Ali, R., and Gebing, T. (1998). Beyond the sponge model: Encour aging students questioning skills in abnormal psychology. Teaching of Psychology 25: 270274. King, A. (1989). Eects of self-questioning training on college students comprehension of lectures. Contemporary Educational Psychology 14: 116. King, A. (1990). Enhancing peer interaction and learning in the classroom through reciprocal questioning. American Educational Research Journal 27: 664687. King, A. (1995). Inquiring minds really do want to know: Using questioning to teach critical thinking. Teaching of Psychology 22: 1317. Logan, G. H. (1976). Do sociologists teach students to think more critically? Teaching Sociology 4: 2948. McGuire, M. D. (1995). Validity issues for reputational studies. In: Walleri, R. D., and Moss, M. K. (eds.), Evaluating and Responding to College Guidebooks and Rankings, Jossey-Bass, San Francisco, pp. 4559. McMillan, J. H. (1987). Enhancing college students critical thinking: A review of studies. Research in Higher Education 26: 329. Nedwick, B. P., and Neal, J. E. (1994). Performance indicators and rational management tools: A comparative assessment of projects in North America and Europe. Research in Higher Education 35: 75103. Nordvall, R. C., and Braxton, J. M. (1996). An alternative denition of quality of undergraduate education: Toward usable knowledge for improvement. Journal of Higher Education 67: 483497. Pascarella, E. T., and Terenzini, P. T. (2005). How College Aects Students: A Third Decade of Research, (Vol. 2), Jossey-Bass, San Francisco. Paul, R. (1993). Critical Thinking: How to Prepare Students for a Rapidly Changing World, (3th Ed.), Sonoma State University Center for Critical Thinking and Moral Critique, Rohnert Park, CA. Prentice, D. A., and Miller, D. T. (1992). When small eects are impressive. Psychological Bulletin 112: 160164.

VALIDITY OF HIGHER-ORDER QUESTIONS

351

Scheirer, M. (1994). Designing and using process evaluation. In: Wholey, J. S., Hatry, H. P., and Newcomer, K. E. (eds.), Handbook of Practical Program Evaluation, Jossey-Bass, San Fransisco. Smith, D. G. (1977). College classroom interactions and critical thinking. Journal of Educational Psychology 69: 180190. Smith, D. G. (1980). College instruction: Four empirical views. Instruction and outcomes in an undergraduate setting. Paper Presented at the Annual Meeting of the American Educational Research Association, Boston, April, 1980. Solmon, L. C. (1973). The denition and impact of college quality. In: Solmon, L., and Taubman, P. (eds.), Does College Matter? Academic Press, New York, pp. 77105. Solmon, L. C. (1975). The denition of college quality and its impact on earnings. Explorations in Economic Research 2: 537587. Stalker, R. G. (1986). Is the quality of university education declining? Survey of faculty attitudes and longitudinal comparison of undergraduate honours theses. Unpublished Honours Thesis, University of Western Ontario, London, Canada. Tan, D. L. (1986). The assessment of quality in higher education: A critical review of the literature and research. Research in Higher Education 24: 223265. Tsui, L. (1999). Courses and instruction aecting critical thinking. Research in Higher Education 40: 185200. Tsui, L. (2002). Fostering critical thinking through eective pedagogy: Evidence from four case studies. Journal of Higher Education 73: 740763. Watson, G. B., and Glaser, E. M. (1980). WatsonGlaser Critical Thinking Appraisal, The Psychological Corporation, San Antonio. Webster, D. (1992). Are they any good? Rankings of undergraduate education in U. S. News and World Report and Money. Change 24: 1931. Williams, R. L., Oliver, R., Allin, J. L., Winn, B., and Booher, C. S. (2003). Psychological critical thinking as a course predictor and outcome variable. Teaching of Psychology 30: 220223. Williams, R. L., Oliver, R., and Stockdale, S. (2004). Psychological versus generic critical thinking as predictors and outcome measures in a large undergraduate human development course. Journal of General Education 53: 3758. Williams, R. L., and Stockdale, S. L. (2003). High performing students with low critical thinking skills. Journal of General Education 52: 199225. Willis, S. A. (1992). Integrating levels of critical thinking into writing assignments for introductory psychology students. Paper presented at the annual meeting of the American Psychological Association, Washington, DC, August 1992. Winne, P. H. (1979). Experiment relating teachers use of higher cognitive questions to students achievement. Review of Educational Research 49: 1350. Received September 6, 2005