Академический Документы
Профессиональный Документы
Культура Документы
A thesis is submitted to the faculty of Charles Schmidt College of Science in partial fulfillment of the requirements for the Degree of Master in Science
ABSTRACT
Author: Title:
Angelica Hotiu The relationship between item difficulty and discrimination indices in multiple-choice tests in a physical science course
We have developed a method of quantifying multiple-choice test items in an introductory physical science course in terms of the various tasks required to solve the problem. We assign a numerical level of difficulty to each task so that any question can be assigned a degree of difficulty, which is the sum of the individual levels of difficulty associated in each steps. Using the questions and results from the tests we have investigated the relationship between the degree of difficulty of each question and the corresponding discrimination index. Our results indicate that as the degree of difficulty increases so does the capability of the item to discriminate between students with different abilities. There is a maximum degree of difficulty beyond which the discrimination starts to decrease. At that point, test items become too difficult. Thus, it should be possible in future to design items that will provide optimum discrimination.
ACKNOWLEDGEMENTS
First of all I would like to express my sincere gratitude and appreciation to Dr. Robin Jordan for his effort, guidance, devotion and advice during the entire study and the preparation of the thesis.
Also many thanks are extended to all members of the faculty, staff and graduate students in the Department of Physics at FAU. I am also very grateful to Dr. Warner Miller, Chair of the Department of Physics, and the members of my thesis committee for all their advice and assistance.
Thanks also go to Dr. Fernando Medina for giving me the opportunity to study in Physics Department at Florida Atlantic University.
Finally, I want to express my gratitude to my lovely family, to my parents and specially to my mother taking care of my son during my studies and to my husband, Laurentiu, who supported and encouraged me during this studies. Thank you for all that you have done for me.
Contents
1. Introduction 2. Theory 2.1 Anatomy of multiple-choice questions................... 2.2 Blooms taxonomy.................. 2.3 The cognitive domain 2.4 The discrimination index 2.5 The degree of difficulty................... 3. Results of the Research 3.1 Description................. 3.2 Analysis of the questions for which D>0.5 4. Concluding remarks 5. References
5 15 15 16 18 24 25 35 35 37 52 55
Chapter 1 Introduction
The classroom test is one of the most important parts of the teaching and learning process. There are several different types of tests those with short essay answers, multiple-choice answers, etc. and the type used will depend on a number of factors such as the instructional objectives, the class size, the type of instruction, the type of subject matter and the type of feedback required by the instructor. However, the two most important characteristics of any achievement test are its content validity and reliability. A test's validity is determined by how well it samples the range of knowledge, skills, and abilities that students were supposed to acquire in the period covered by the test. The reliability of a test depends upon grading consistency and discrimination between students of differing performance levels.
There are two major types of multiple-choice tests, criterion-referenced tests (CRTs) and norm-referenced tests (NRTs). In criterion-referenced testing, the goal is usually to make a decision about whether or not an individual can demonstrate mastery in an area of content and competencies; examples include the written part of a driving test, certification and licensure exams. In norm-referenced testing, the goal is usually to rank the entire set of individuals in order to make comparisons of their performances relative to one another. In this study, we will be analyzing students performances on multiple-choice tests administered during a physical science course; such tests are NRTs.
Although multiple-choice tests are widely used, many instructors do not hold them in high regard; some believe, for example, that multiple-choice questions are really multiple-guess items, or that multiple-choice questions are only capable of testing factual information and so are ill suited for testing higher-order cognitive skills. However, it is now accepted that well-constructed multiple-choice items can test many of the same cognitive skills that essay test do. Moreover, they can be used to diagnose student difficulties if the incorrect options are designed to reveal common misconceptions, and they can provide a more comprehensive sampling of the subject material because more questions can be asked. In addition, they are often more valid and reliable than essay tests because (a) they sample material more broadly; (b) discrimination between performance levels is easier to determine; and (c) scoring consistency is virtually guaranteed when carried out by machine.
The validity of multiple-choice tests depends upon a systematic selection of items with regard to both content and level of learning. Although most teachers try to select items that sample the range of content covered in class, they often fail to consider the level or degree of difficulty of the questions they use. Moreover, since it is easy to develop items that require only recognition or recall of information, instructors tend to rely heavily on those types of questions. Unfortunately, multiple-choice tests in the instructors manuals that accompany textbooks are often composed exclusively of recognition or recall items.
Psychologists have elaborate systems for classifying different cognitive levels, but for most test planning purposes, a simple three-level scheme is sufficient to ensure that the range of knowledge, skills, and abilities are tested appropriately. The three categories are recall, application, and evaluation/synthesis, and they are derived from the six levels of Blooms taxonomy of cognitive objectives [1.1]. At the lowest level, recall, students remember specific facts, terminology, principles, or theories, e.g., stating Newtons 2nd Law. At the median level, application, students use their knowledge to solve a problem or analyze a situation, e.g., using Newtons 2nd Law to determine the motion of an object. The highest level, evaluation and synthesis, requires students to derive hypotheses from data, or put the parts of a problem together, or exercise informed judgment. By analyzing the course material in terms of these three categories, multiple-choice tests can be constructed that sample both the range of content and the various cognitive levels at which the students must operate. Performing this analysis is an essential step in designing multiple-choice tests that have high validity and reliability.
The purpose of this study is not to provide to comprehensive guide for constructing multiple-choice items; there are several excellent articles available that provide such information [1.2, 1.3]. Our main aim is to investigate and quantify two of the most important factors in creating valid and discriminating multiple-choice tests, namely, the degree of difficulty and the discrimination index - we define these quantities below using the results of actual tests. We have been unable to find any previously published, quantitative data on such a study, except for a private communication from
Hostetter and Haky who made a similar study of multiple-choice test items in introductory General Chemistry [1.4]. Accordingly, we have analyzed the results of six multiple-choice tests (labeled 1A, 1B, 2A, 2B, 3A and 3B) given in a Physical Science class (PSC2121), at Florida Atlantic University in the Fall 2004 semester. The numbers of students that took each test was ~ 50 . The numbers 1, 2, and 3, represent the number of the test during the semester there were five tests in total -
! and A, B represent two different versions given to different group of the students but
covering the same material and designed to be as similar as possible. Physical science is a general science course for non-science majors, covering topics in physics, chemistry and earth science. However, in this study we restricted ourselves to
questions on topics that were within the physics discipline; the subject material covered by the tests is shown in Table 1.1 and the number of students taking each test and the average scores are shown in Table 1.2. The tests were compiled by Dr. Robin Jordan, Physics Department, Florida Atlantic University.
Vector analysis Resolution of vectors Speed and velocity Accelerated motion A theory of motion Galileo and the Experimental motion
Planetary motion
Ptolomys system The Copernican revolution Gateway to the skies: Tyco Brahe How planets move: Johannes Kepler Galileos Discoveries with the Telescope
Isaac Newtons Marvelous year The principia Newtons first law of motion. Inertia Newtons 2nd law of motion. Force Applications of Newtons 2nd law Newtons 3rd law of motion. Action and reaction The Center-Seeking force
Topics Temperature measurement Temperature scale The lowest temperature Kinetic theory and the molecular
interpretation of temperature Temperature and the heat Specific heat Calorimetry Change of state Thermal expansion Energy Conservation Mechanical equivalent of the heat The 1st law of thermodynamic The 2nd law of thermodynamic Wave Motion and Sound Transverse waves Longitudinal waves Reflection of the waves Refraction of the waves Superposition of the waves. Interference Standing waves Vibrating air columns Light and other electromagnetic waves The velocity of the light Electromagnetic waves
10
Chapter
Topics Electromagnetic spectrum, radio, TV, microwaves Simple lenses The optic of the eye
Amber phenomenon Conductors, Semiconductors, Insulators Forces between electric Electric current Electric circuits Electric power and energy
The Quantum Theory of Radiation and Spectroscopy Matter The electron X Rays Radioactivity Plancks Quantum hypothesis Einsteins photoelectric equation
Table 1.1 The subject material (chapters and topics) covered by the tests
11
Test 1A 1B 2A 2B 3A 3B
of Number respondents 52 53 52 53 48 48
of Scoring (%) 26.7 86.7 16.7 86.7 16.7 90.0 19.4 77.4 30.0 86.7 26.7 83.3
score
Table 1.2 Details of the tests used in this study. The tests were part of the PSC2121 course given in the Fall 2004 semester. * One question was omitted from the analysis due to a technical problem.
Each question on a multiple-choice test has a discrimination index that determines how well each question discriminates between students in the top 27% of the class on
12
total test score and those in the lower 27% of the class on total test score. As we explain in more detail below, the discrimination index can range from +1 to "1; a value of +1 means that all of the high scorers answered the question correctly and
! all of the low scorers answered the question incorrectly. A ! value of 0 means that ! same number of high scorers and low scorers obtained the correct answer and so the
the question does not discriminate between the two sub-groups of students. In this study, we analyzed the questions from all tests for which the discrimination index was >0.5. To determine the degree of difficulty, we identified the various tasks or
operations, such as memorization and identification, application, unit conversion, algebraic manipulation, use of vectors, etc., required to answer each question [1.5]. We assigned a numerical level of difficulty to each task, based on the range of knowledge, skill, and ability required, so that any question involving a number of different steps has an overall degree of difficulty, which is the sum of the individual levels of difficulty associated with each of the required steps.
The results indicate a definite correlation between the degree of difficulty and the discrimination index. For example, as the degree of difficulty increases so does the discrimination index, which is not unexpected. However, there is a maximum degree of difficulty beyond which the discrimination index starts to fall off. At that point, the test items become too difficult for both the high scorers and the low scorers to answer, so that they no longer discriminate effectively. Clearly, there are two
extremes; questions that are too easy, i.e., with a small difficulty value, and those that are too hard, i.e., with a high difficulty value. Such questions are not effective if the
13
purpose of a test is to produce a spread of scores, reflecting differences in student achievement and abilities.
As part of our study, we have been able to identify the common tasks are that are involved in the most discriminating questions. Our results suggest that for optimum discrimination, i.e., questions resulting in a discrimination index > 0.5 , the degree of difficulty lies within a reasonably well-defined range for all the tests analyzed. So, in
! principle, by adopting our assigned levels of difficulty for each task or operation, one
can actually design questions with the required level of difficulty and range of cognitive levels that will result in multiple-choice tests that truly discriminate between students of different abilities.
Our study is very similar to the analysis of multiple-choice test items in a General Chemistry I course, carried out by Hostetter and Haky [1.5]. Indeed, it was their study that prompted ours. Altogether, they used the results from approximately 300 students; a somewhat larger sampling group compared with our study. Our results based on an analysis of physics topics - indicated a similar correlation between the degree of difficulty and discrimination; namely, as the difficulty increased the average discrimination increased, but there was a critical level of difficulty beyond which the discrimination decreased.
14
Chapter 2 Theory
A standard multiple-choice item consists of basic two parts: A problem (the stem) A list of suggested solutions (alternatives)
Typically, multiple-choice items present the stem in the complete question form or an incomplete statement and the list of alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior alternatives (distractors). For example:
Stem in complete form What is a weight of an object? The force with which it is attracted to the earth The amount of matter that it contains A measure of its inertia The same quantity as its mass but expressed in different units
Incomplete statement The weight of an object is: The force with which it is attracted to the earth The amount of matter that it contains A measure of its inertia The same quantity as its mass but expressed in different units
15
Students are directed to select either the correct answer or the best answer from the list of options provided. In the correct answer form, the answer is correct beyond question while the distractors are definitely incorrect. In the best answer version, more than one option may be appropriate in varying degrees. The purpose of the distractors is to appear as plausible solutions to the problem for those students who have not achieved the required learning examined by the question. On the other hand, the distractors will appear as implausible solutions for those students who achieved the required learning; only the (required correct) answer is plausible for those students. As we mentioned in the Introduction, multiple-choice items can be
designed to test not only the lower levels of the learning process, i.e., recall, but also the higher-level skills of comprehension, application, analysis and all of which may be part of the required educational objectives of the class.
Starting in 1948, a committee of colleges, led by Benjamin Bloom, began the task of classifying education goals and objectives. The intent was to develop a classification system for three domains: the cognitive, the affective, and the psychomotor: Cognitive: mental skills (Knowledge) Affective: growth in feelings or emotional areas (Attitude) Psychomotor: manual or physical skills (Skills)
16
Learning Process
Cognitive
Affective
Psychomotor
They completed their study on the cognitive domain in 1956 and the resulting classification system is now commonly referred to as Bloom's Taxonomy of the Cognitive Domain [2.1]. Work on the affective and psychomotor domains was
completed in 1972-3 [2.2, 2.3]. The divisions between different classes of skills or behavior are not absolute and other systems or hierarchies have been devised in the educational and training world. However, Bloom's taxonomy is the most easily understood and is arguably the one most used today.
The major idea of the taxonomy of the cognitive domain is that what educators want students to know, i.e., the educational objectives, can be arranged in a hierarchy, starting from the simplest behavior or skill to the most complex. As a result, it can also provide a useful structure within which to categorize and analyze test items. Instructors characteristically ask questions within particular skill levels, for example, Bloom found that over 95 % of the test questions students encounter require them to think only at the lowest possible level, i.e., the recall of information. However, education research shows that students remember more, and can apply their knowledge more effectively, when they have learned to handle the topic at the higher levels of the taxonomy, where more complex skills are required [2.4, 2.5]. Clearly,
17
students can "know" about a topic or subject at different levels. So, it is plain there must be a close link between the taxonomy and test questions, if the latter are constructed with the aim of checking the skill level of students, and discriminating between students of different abilities.
The cognitive domain involves knowledge and the development of intellectual skills. This includes the recall or recognition of specific facts, procedural patterns, and concepts that serve in the development of intellectual abilities and skills. There are six major categories, which are shown in Tables 2.1 to 2.3, starting from the simplest behavior to the most complex. The categories can be thought of as degrees or hierarchies of difficulties.
18
Competence 1. Knowledge
Skills demonstrated
observation information
and
recall
of
knowledge of dates, events, places knowledge of major ideas mastery of subject matter Keywords list, define, tell, describe, identify, show, label, collect, examine, tabulate, quote, name, who, when, where, etc.
2. Comprehension
interpret facts, compare, contrast order, group, infer causes predict consequences Keywords summarize, contrast, describe, predict, interpret, associate,
19
Competence
Table 2.1 The lowest levels of intellectual behaviors within the cognitive domain Identified by Bloom.
20
Competence 3. Application
Skills demonstrated
complete, illustrate, show, solve, examine, modify, relate, change, classify, experiment, discover 4. Analysis seeing patterns organization of parts recognition of hidden meanings identification of components Keywords analyze, separate, order, explain, connect, classify, arrange, divide, compare, select, explain, infer
Table 2.2 The median levels of intellectual behaviors within the cognitive domain identified by Bloom.
21
Competence 5. Synthesis
Skills demonstrated
use old ideas to create new ones generalize from given facts relate knowledge from several areas
rearrange, substitute, plan, create, design, invent, what if?, compose, formulate, rewrite 6. Evaluation compare and discriminate prepare, generalize,
between ideas
assess
value
of
theories,
presentations
22
Competence
Skills demonstrated assess, decide, rank, grade, test, measure, recommend, convince, select, judge, explain,
Table 2.3 The highest levels of intellectual behaviors within the cognitive domain identified by Bloom.
23
The discrimination index is a useful measure of item quality whenever the purpose of a test is to produce a spread of scores, reflecting differences in student achievement, so that distinctions may be made among the performances of respondents. It
measures the extent to which item responses discriminate between individuals who have a higher overall score on a test and those that get a lower overall score. The discrimination index is determined by the FAU computer-based test scoring and analysis system [2.6] automatically, in the following way. The distribution of
students is treated as normal and so the students scores are arranged into two subgroups [2.7],
the top 27%; the upper group (U), and the bottom 27%; the lower group (L).
The discrimination index for a particular question is defined by the proportion of the students in the top group who got it correct, p U, and the proportion of the students in the bottom group who got it correct, p L . The discrimination index is defined as
!
D = pu ! p L .
there is inverse discrimination, which is most likely caused by a mis-keyed item. Thus, discrimination indices " 0 are found on difficult items such that almost everyone gets them wrong and on items so easy that almost everyone gets them right.
! For instructional purposes it is important to know the content areas and type of items
that most students get right or wrong. As mentioned earlier, when multiple-choice tests are graded using the FAU computer-based test scoring and analysis system, values of the discrimination indices are obtained automatically [2.6].
In order to carry out this study, we need a quantitative measure of the difficulty of a question. The difficulty of a question is normally determined from the proportion of the total group selecting the correct answer to that question. The following formula may be used to calculate the difficulty factor (sometimes called the p-value):
p=
c "100 n
! where c is the number of students who selected the correct answer and n is the total
number of respondents. A value of p = 100% indicates that all the students selected the correct answer and so that item is very easy. A value of 0 indicates that none of
! the students selected the correct answer and so that item is very difficult. So, this
ratio is one measure of how difficult the question was to the answer.
25
The implication is that if the purpose of the test is to test an individuals mastery of the material, i.e., as in a criterion-referenced test (CRT), p values of ~ 90% may be expected. However, if the emphasis is to obtain a spread of scores between
for each question of a test, we observe a definite correlation between the two
! quantities, as shown in Figures 2.1 to 2.6 for tests 1, 2 and 3; similar behavior is
observed for all tests. First, as p increases, the discrimination index also increases, but at a p value between ~40% and ~60%, the discrimination reaches a maximum.
! When p >~ 60% , the discrimination index decreases. It is generally claimed that ! items for which 40% to 60% of the group passes are preferred to those that are easier ! ( p > 60% ) or more difficult ( p < 40% ) [2.6]. In these particular cases, the number of
items falling into the range 40% < p < 60% in tests 1A, 1B, 2A, 2B, 3A and 3B are
! ! 8/26, 10/26, 8/31, 10/31, 12/30 and 9/30, respectively. !
26
20
40
60
80
100
Difficulty p (%)
Figure 2.1. The discrimination index versus the difficulty factor, p , for Test 1A.
!
20
40
60
80
100
Difficulty p (%)
Figure 2.2. The discrimination index versus the difficulty factor, p , for Test 1B.
!
27
20
40
60
80
100
Difficulty p (%)
Figure 2.3. The discrimination index versus the difficulty factor, p , for Test 2A.
!
20
40
60
80
100
Difficulty p (%)
Figure 2.4. The discrimination index versus the difficulty factor, p , for Test 2B
!
28
20
40
60
80
100
Difficulty p (%)
Figure 2.5. The discrimination index versus the difficulty factor, p , for Test 3A.
!
20
40
60
80
100
Difficulty p (%)
Figure 2.6. The discrimination index versus the difficulty factor, p , for Test 3B.
!
29
Note that over the range 40% < p < 60% , the discrimination index is >~ 0.5 . Therefore, we will take D = 0.5 as the desirable minimum value for a discriminating item.
!
!
The difficulty factor, as defined above, is a property of the obtained measurements. However, we require a definition that depends on the content of the question and reflects the difficulty and complexity of the tasks required to find a solution. Thus, we seek a quantitative and independent measurement of difficulty.
We have found that it is possible to assign a degree of difficulty to items on a multiple-choice test based on the knowledge and tasks required to solve the problem. Basically, all questions can be analyzed in terms of a combination of letters and numbers. The letters represent the tasks or actions that students must perform in order to obtain a complete solution to the problem; the numbers indicate the number of times each task or action is performed. In general terms, Blooms taxonomy, described above, classifies the various tasks and actions, e.g., simple memorization (recall), unit conversion, solving a system of equations, etc., into a hierarchy. Using the classification system as a guide, we are able to assign a numerical level of difficulty to each of these tasks, as shown in Table 2.4.
30
Task Knowledge and recall (K) Identification (I) Application (A) Unit conversion (simple) ( C 3) Simple equation (E) Unit conversion !C 4 ) ( Vector analysis (V) Solving! equation ( S5), derive (D) Solving a system of equation ( S6 )
Level of difficulty 1
2 3
5 6
! ! Table 2.4
Numerical level of difficulty associated with each task
31
In this way we are able to assign an overall degree of difficulty to each question on a test as the sum of the individual levels of difficulty encountered to obtain the answer. In more detail, the tasks in Table 2.4 are:
Knowledge (K) or recall: a task that simply implies memorization or a definition or a quantity that must be known in order to answer the question.
Identification (I): a task that requires identification of the process, laws or the equation that must be used in order to solve the problem.
Application (A): a task when the knowledge is applied to a problem Unit conversions ( C 3 and C 4 ): are tasks when a unit conversion is done in completing the problem.
! ! Simple equation (E): describes a task that involves simply inserting numbers
into an equation to obtain a solution.
Vector analysis (V): a task when vector addition or manipulation of vectors is required in order to solve the problem.
Derivation (D): a task that requires the derivation or proof of an algebraic expression.
Equation ( S5 and S6 ): tasks that involve the manipulation of one or more equations before numbers can be input in order to obtain a result.
32
1. Example of 1K question (Test 1A, Q9): Velocity is a rate of change of a) Speed b) Energy c) Distance d) Displacement
In order to answer this question correctly, the student should know the definition of speed. The level of difficulty level of this question is 1.
2. Example of KI question (Test 1A, Q19): A skydiver jumps from an airplane. As her velocity of fall increases, neglecting air resistance, her acceleration a) Increases b) Is constant c) Decreases
In order to answer to this question correctly, the student needs to Identify the type of motion for a skydiver (uniform accelerated motion) Know that acceleration is constant during the motion Difficulty level of this question is 2.
33
3. Example of 2IAE question (Test 1B, Q18): What is a speed of an object after 4s, if it falls from the rest with an acceleration of 32 ft/s2 ? a) 32 ft/s
In order to answer to this question correctly, the student has to Identify the type of motion (uniform accelerated motion ! free fall) Apply the formula for the velocity in uniform accelerate motion v = v o + at Identify that the initial speed is v o = 0 Solve the equation for v Difficulty level of this question! 2 "1+ 2 + 3 = 7 . is
! The main aim of this study is to investigate any relationship between the level of
difficulty of a particular question and the corresponding discrimination index, using the results of a total of six multiple-choice tests in a physical science course.
34
3.1 Description
As mentioned previously, the main aim of this study is to investigate the relationship between the degree of difficulty of a particular question and the corresponding discrimination index. The degree of difficulty is defined in Chapter 2 and can be described as a numerical quantity that depends on the content of the question and reflects the difficulty and complexity of the tasks and operations required to find a solution. In this study, we use a combination of letter and numbers to quantify a complete solution to a question; the letters represent the task(s) or action(s) that must be performed and the numbers represent the number of times each task or action is performed. As we described above, we have classified the tasks and actions into a hierarchy, using Blooms taxonomy as a guide, and assigned a numerical degree of difficulty to each of the tasks. For example, the following question:
The speed limit in a school zone is 20mi/h and it is strictly enforced. If you are driving at 30km/h are you likely to get a ticket? (a) Yes (b) No
35
can be analyzed in the following way. In order to answer to this question the student should convert km/h to mi/h using the relationship 1 mi = 1.61 km , i.e.,
1 km = (1 1.61) mi = 0.621 mi . This task is C 3, a simple unit ! conversion with a level of difficulty of 3. !
" solve the equation for v: v = 30 km/h ! 30 # 0.621 = 18.6 mi/h . This
task is E, a simple equation with a level of difficulty of 3.
The discrimination index measures! extent to which the question discriminates the between individuals who fall into the top 27% of scorers on a test and those who fall into the bottom 27%. The index, as defined in Chapter 2, which has a value
"1 # D # +1, is determined automatically for each question on a test by the FAU
computer-based test scoring service. For the purposes of this study, we claim that
questions with values of D > 0.5 qualify as questions that are reasonable discriminators; hence, we only concentrated on such test items in our study.
! Altogether, we analyzed the results of six multiple-choice tests (labeled 1A, 1B, 2A,
2B, 3A and 3B) given in a Physical Science class (PSC2121), at Florida Atlantic University in the Fall 2004 semester and selected only those items for which D > 0.5 .
36
In Tables 3.1 to 3.6, we list the results of our analysis of the six tests for which D > 0.5 .
! In Figures 3.1 to 3.9, we show plots of the degree of difficulty and the discrimination
index for the individual tests. We have included a second order polynomial fit to the data simply to act as a guide to the eye.
Despite the limited statistics, due to a relatively small number of respondents (~50) on each test, a trend does appear to emerge. The data for each test suggests that there is a correlation between the degree of difficulty and the discrimination index. Specifically, initially, as the degree of difficulty increases the discrimination index also increases. However, there is an optimum degree of difficulty beyond which the discrimination begins to fall. (Such behavior was noted previously, in chapter 2, when the difficulty factor, defined as:
p= c "100 , n
where c is the number of students who selected the correct answer and n is the total
! number of respondents. But, as we argued in chapter 2, the difficulty factor is a
property of the obtained measurements and is not appropriate in our analysis, which is why we found it necessary to introduce a quantitative and independent degree of difficulty for each question, based on content of a question and the difficulty and complexity of the tasks required to find a solution.)
37
We can understand such behavior by identifying the two extremes, namely, (a) questions that have a low degree of difficulty ( <~ 8 ), i.e., questions that are too easy, and (b) questions with a high degree of difficulty ( >~ 14 ), i.e., questions that are too hard. Questions in these regimes! less effective in discriminating between are students of different abilities because:
in case (a) more of the lower scoring students are likely to answer the question correctly, so the test item is too easy for both the lower and higher scorers, resulting in less discrimination, and in case (b) fewer of the higher scoring students are likely to answer the question correctly, so the test item is too difficult for both the high scorers and the low scorers to answer and so it no longer discriminates effectively.
In spite of the limited size of the data sets, we suggest that, for the tests that we have analyzed, the optimum discrimination likely occurs when the degree of difficulty lies in the range from ~9 to ~14. It might be tempting to compare the degrees of difficulty for optimum discrimination from one test to the next; clearly, if students are learning then we might expect the degree of difficulty for optimum discrimination to increase! However, the sample set is simply not adequate for reliable comparisons. These results are very similar to those obtained by Hostetter and Haky who analyzed
38
the results of a number of multiple-choice tests given in an introductory General Chemistry course [1.4].
A further outcome of this study is that, in principle, it is now possible to design multiple choice items with a known degree of difficulty and, hence, discrimination.
Finally, in Figure 3.10 we show the correlation between the measured difficulty factors ( p ), as defined in Chapter 2, and our calculated degrees of difficulty for Test 1A. In (a) we have used the complete set of values; where there is more than one
!measured difficulty factor for a particular degree of difficulty; we have plotted the
averaged value. The plots indicate a close relationship between the measured and calculated values. When the calculated degree of difficulty is very small, most of the students get the correct answer, so p "100% ; when the calculated degree of difficulty is very large, most students fail to get the correct answer, so p " 0 .
! !
39
Question number 2 5 12 13 17 18 25
Type question
Degree of difficulty 8 9 13 14 8 6 12
CS5
2CE
! K3AS6 K2VS5
!
! !
2IKAE IAE
KAVS5
! !
!
40
Question number 7 9 15 17 21 22 23 24 25
Type question
Degree of difficulty 9 8 11 9 10 7 8 12 13
Discrimination index 0.58 0.52 0.65 0.66 0.66 0.51 0.52 0.64 0.58
2C3E
KAVI
! KAS5E IKAS5
!
! !
3KI3A
AS5
!
!
5KAI
K2I2AS5 KAIVS5
!
! !
41
Question number 2 7 22 25 27 28
Question type
Degree of difficulty 4 10 12 9 11 11
IKA
2IKAS5
!
!
! ! ! !
42
Question number
Question type
Degree difficulty
5 16 22 23 24
2IAS5
9 5 11 2 14
2KE
! KS5 2AI
!
!
2K
2KS5 3AI
!
!
Table 3.4. The results for Test 2B.
43
Question number 1 5 8 9 12 14 18 20 22 24 25 26 27 28 30
Question type
Degree of difficulty 6 5 10 6 8 14 23 15 12 16 11 12 7 6 6
Discrimination index 0.52 0.50 0.61 0.57 0.60 0.86 0.60 0.84 0.77 0.59 0.70 0.75 0.66 0.50 0.57
IAE 2IKA ! !
K 2AS 5
KAE
KAS5
!
! ! ! ! ! ! !
! ! ! !
44
Question number
Question type
Degree of difficulty
Discrimination index 0.57 0.65 0.75 0.57 0.65 0.65 0.66 0.66 0.65 0.50
6 7 14 20 21 22 24 27 28 29
4KA
K 2AS 5
6 10 14 15 9 12 11 7 6 5
2I2AS5E K2A2S5
! ! ! ! !
4KE 4KA
! ! !
AE
45
Figure 3.1. The discrimination index versus the degree of difficulty for Test 1A.
Figure 3.2. The discrimination index versus the degree of difficulty for Test 1B.
46
Figure 3.3. The discrimination index versus the degree of difficulty for Tests 1A and 1B.
Figure 3.4. The discrimination index versus the degree of difficulty for Test 2A
47
Figure 3.5. The discrimination index versus the degree of difficulty for Test 2B.
Figure 3.6. The discrimination index versus the degree of difficulty for Tests 2A and 2B.
48
Figure 3.7. The discrimination index versus the degree of difficulty for Test 3A.
Figure 3.8. The discrimination index versus the degree of difficulty for Test 3B.
49
Figure 3.9. The discrimination index versus the degree of difficulty for Tests 3A and 3B.
50
Figure 3.10. (a) the difficulty factor ( p ) and (b) the averaged difficulty factor ( p av ) versus the calculated degree of difficulty for test 1A. A linear trend line has been fitted to the data; in (a) R 2 = 0.73 !
!
! !
51
In this study we analyzed the questions and results of a total of six multiple-choice tests in a physical science course at Florida Atlantic University in the Fall 2004. Our main aim was to quantify two of the most important factors in creating valid and discriminating test items, namely, the degree of difficulty of each item and the corresponding discrimination index based the results of actual tests, and to investigate the relationship between them. Following the analysis of the results of a test, each item can be assigned a discrimination index, which determines how well it discriminates between the top scoring students of the test and the bottom group of students. In this study we confined our analysis to the questions from all tests for which discrimination index is >0.5.
In order to associate a degree of difficulty with each item, we identified the various tasks or operations, such as memorization and identification, application, unit conversion, algebraic manipulation, use of vectors, etc., required to answer each question. We assigned a numeric level of difficulty to each task, based on the range of knowledge, skill, and ability required, so that any question involving a number of different steps has an overall degree of difficulty, which is the sum of the individual levels of difficulty associated with each of the required steps.
52
Our results indicate a definite correlation between the degree of difficulty and the discrimination index. For example, as the degree of difficulty increases so does the discrimination index. However, there is a optimum degree of difficulty, in the range ~9 to ~12, beyond which the discrimination index starts to fall. At that point, the test items become too difficult for both the high scorers and the low scorers to answer, so the items no longer discriminate effectively. Clearly, there are two extremes;
questions that are too easy, i.e., with a low degree of difficulty, and those that are too hard, i.e., with a high degree of difficulty. Such questions are not effective in discriminating between students of different abilities.
By adopting our assigned levels of difficulty for each task or operation, one can actually design questions with the required level of difficulty and range of cognitive levels that will result in multiple-choice tests that truly discriminate between students of different abilities. For example, the results of our study indicate that the most discrimination questions, i.e., with D > 0.6 , have a degree of difficulty level is in interval 9 "14 . Using this result and we can set up an inequation
! !
9 " a # K + b # A + c # E + d # V + e # S5 + f # S6 " 14 ,
where K, A, E, V, S5 and S6 , etc., are the various tasks and operations required to ! solve a problem, as defined in chapter 2, and a, b, c, d, e, f represent the number of
! times each action ! performed. We found that it was possible to assign a numerical is
level of difficulty to each of these tasks, e.g., K = 1, A = 2 , E = 3, V = 4 , S5 = 5 ,
! 53 !
9 " a + 2b + 3c + 4d + 5e + 6f " 14 .
Although there are many possible solutions to this equation, there are, however, ! limits. So, in principle, we can use this inequation to develop items for multiplechoice tests in a physical science course where the requirement is to obtain optimum discrimination between students who have mastered the course material and those who have not. However, the design of items with optimum discrimination and the verification under test conditions is beyond the scope of this study; we suggest it might form the basis of further research.
54
References
[1.1] Bloom B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. [1.2] Victoria Clegg and William Cashin (1986), Improving Multiple-Choice Tests iDEA PAPER No. 16 available from http://www.idea.ksu.edu/papers/ [1.3] Improving Multiple http://ctl.unc.edu/fyc8.html. Choice Questions (1990) available from
[1.4] Laura Hostetter and Dr. J.E. Haky private communication. Also, A classification scheme for preparing effective multiple-choice questions based on item response theory, L. Hostetter and J.E. Haky, FLORIDA ACADEMY OF SCIENCES, Annual meeting, University of South Florida, March 2005. [1.5] Note that our definition of the degree of difficulty is different from that used by the FAU Testing and Evaluation Center. [2.1] B.S. Bloom (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. There is a considerable amount of information available about Blooms taxonomy on the internet, see, for example: http://www.nwlink.com/~donclark/hrd/bloom.html, http://www.coun.uvic.ca/learn/program/hndouts/bloom.html, http://www.valdosta.edu/~whuitt/psy702/cogsys/bloom.html. [2.2] D.R. Krathwohl, B.S. Bloom, and B.M. Bertram (1973). Taxonomy of Educational Objectives, the Classification of Educational Goals. Handbook II: Affective Domain. New York: David McKay Co., Inc. [2.3] E.J. Simpson (1972). The Classification of Educational Objectives in the Psychomotor Domain. Washington, DC: Gryphon House. [2.4] J. D. Bransford, A.L. Brown and R.R. Cocking (eds) (2000). How People Learn: expanded edition. Washington, D.C.: National Academy Press. [2.5] M. Suzanne Donovan and John D. Bransford (eds) (2005). How Students Learn. Washington, D.C.: The National Academies Press. [2.6] Handout entitled Computer based test scoring and analysis is available from the Florida Atlantic University, Testing and Evaluation Center. [2.7] The selection of upper and lower groups for the validation of test items, T.L. Kelley, J. Ed. Psych., 30, 17-24 (1939).
55