Вы находитесь на странице: 1из 55

THE RELATIONSHIP BETWEEN ITEM DIFFICULTY AND DISCRIMINATION INDICES IN MULTIPLE-CHOICE TESTS IN A PHYSICAL SCIENCE COURSE by Angelica Hotiu

A thesis is submitted to the faculty of Charles Schmidt College of Science in partial fulfillment of the requirements for the Degree of Master in Science

Florida Atlantic University Boca Raton, Florida December 2006

ABSTRACT

Author: Title:

Angelica Hotiu The relationship between item difficulty and discrimination indices in multiple-choice tests in a physical science course

Institution : Thesis advisor Degree Year

Florida Atlantic University Dr. Robin Jordan Master of Science 2006

We have developed a method of quantifying multiple-choice test items in an introductory physical science course in terms of the various tasks required to solve the problem. We assign a numerical level of difficulty to each task so that any question can be assigned a degree of difficulty, which is the sum of the individual levels of difficulty associated in each steps. Using the questions and results from the tests we have investigated the relationship between the degree of difficulty of each question and the corresponding discrimination index. Our results indicate that as the degree of difficulty increases so does the capability of the item to discriminate between students with different abilities. There is a maximum degree of difficulty beyond which the discrimination starts to decrease. At that point, test items become too difficult. Thus, it should be possible in future to design items that will provide optimum discrimination.

ACKNOWLEDGEMENTS

First of all I would like to express my sincere gratitude and appreciation to Dr. Robin Jordan for his effort, guidance, devotion and advice during the entire study and the preparation of the thesis.

Also many thanks are extended to all members of the faculty, staff and graduate students in the Department of Physics at FAU. I am also very grateful to Dr. Warner Miller, Chair of the Department of Physics, and the members of my thesis committee for all their advice and assistance.

Thanks also go to Dr. Fernando Medina for giving me the opportunity to study in Physics Department at Florida Atlantic University.

Finally, I want to express my gratitude to my lovely family, to my parents and specially to my mother taking care of my son during my studies and to my husband, Laurentiu, who supported and encouraged me during this studies. Thank you for all that you have done for me.

Contents

1. Introduction 2. Theory 2.1 Anatomy of multiple-choice questions................... 2.2 Blooms taxonomy.................. 2.3 The cognitive domain 2.4 The discrimination index 2.5 The degree of difficulty................... 3. Results of the Research 3.1 Description................. 3.2 Analysis of the questions for which D>0.5 4. Concluding remarks 5. References

5 15 15 16 18 24 25 35 35 37 52 55

Chapter 1 Introduction

The classroom test is one of the most important parts of the teaching and learning process. There are several different types of tests those with short essay answers, multiple-choice answers, etc. and the type used will depend on a number of factors such as the instructional objectives, the class size, the type of instruction, the type of subject matter and the type of feedback required by the instructor. However, the two most important characteristics of any achievement test are its content validity and reliability. A test's validity is determined by how well it samples the range of knowledge, skills, and abilities that students were supposed to acquire in the period covered by the test. The reliability of a test depends upon grading consistency and discrimination between students of differing performance levels.

There are two major types of multiple-choice tests, criterion-referenced tests (CRTs) and norm-referenced tests (NRTs). In criterion-referenced testing, the goal is usually to make a decision about whether or not an individual can demonstrate mastery in an area of content and competencies; examples include the written part of a driving test, certification and licensure exams. In norm-referenced testing, the goal is usually to rank the entire set of individuals in order to make comparisons of their performances relative to one another. In this study, we will be analyzing students performances on multiple-choice tests administered during a physical science course; such tests are NRTs.

Although multiple-choice tests are widely used, many instructors do not hold them in high regard; some believe, for example, that multiple-choice questions are really multiple-guess items, or that multiple-choice questions are only capable of testing factual information and so are ill suited for testing higher-order cognitive skills. However, it is now accepted that well-constructed multiple-choice items can test many of the same cognitive skills that essay test do. Moreover, they can be used to diagnose student difficulties if the incorrect options are designed to reveal common misconceptions, and they can provide a more comprehensive sampling of the subject material because more questions can be asked. In addition, they are often more valid and reliable than essay tests because (a) they sample material more broadly; (b) discrimination between performance levels is easier to determine; and (c) scoring consistency is virtually guaranteed when carried out by machine.

The validity of multiple-choice tests depends upon a systematic selection of items with regard to both content and level of learning. Although most teachers try to select items that sample the range of content covered in class, they often fail to consider the level or degree of difficulty of the questions they use. Moreover, since it is easy to develop items that require only recognition or recall of information, instructors tend to rely heavily on those types of questions. Unfortunately, multiple-choice tests in the instructors manuals that accompany textbooks are often composed exclusively of recognition or recall items.

Psychologists have elaborate systems for classifying different cognitive levels, but for most test planning purposes, a simple three-level scheme is sufficient to ensure that the range of knowledge, skills, and abilities are tested appropriately. The three categories are recall, application, and evaluation/synthesis, and they are derived from the six levels of Blooms taxonomy of cognitive objectives [1.1]. At the lowest level, recall, students remember specific facts, terminology, principles, or theories, e.g., stating Newtons 2nd Law. At the median level, application, students use their knowledge to solve a problem or analyze a situation, e.g., using Newtons 2nd Law to determine the motion of an object. The highest level, evaluation and synthesis, requires students to derive hypotheses from data, or put the parts of a problem together, or exercise informed judgment. By analyzing the course material in terms of these three categories, multiple-choice tests can be constructed that sample both the range of content and the various cognitive levels at which the students must operate. Performing this analysis is an essential step in designing multiple-choice tests that have high validity and reliability.

The purpose of this study is not to provide to comprehensive guide for constructing multiple-choice items; there are several excellent articles available that provide such information [1.2, 1.3]. Our main aim is to investigate and quantify two of the most important factors in creating valid and discriminating multiple-choice tests, namely, the degree of difficulty and the discrimination index - we define these quantities below using the results of actual tests. We have been unable to find any previously published, quantitative data on such a study, except for a private communication from

Hostetter and Haky who made a similar study of multiple-choice test items in introductory General Chemistry [1.4]. Accordingly, we have analyzed the results of six multiple-choice tests (labeled 1A, 1B, 2A, 2B, 3A and 3B) given in a Physical Science class (PSC2121), at Florida Atlantic University in the Fall 2004 semester. The numbers of students that took each test was ~ 50 . The numbers 1, 2, and 3, represent the number of the test during the semester there were five tests in total -

! and A, B represent two different versions given to different group of the students but
covering the same material and designed to be as similar as possible. Physical science is a general science course for non-science majors, covering topics in physics, chemistry and earth science. However, in this study we restricted ourselves to

questions on topics that were within the physics discipline; the subject material covered by the tests is shown in Table 1.1 and the number of students taking each test and the average scores are shown in Table 1.2. The tests were compiled by Dr. Robin Jordan, Physics Department, Florida Atlantic University.

Chapter Physical science and measurement

Topics Why standardization? The metric System SI units

Description of the motion

Vector analysis Resolution of vectors Speed and velocity Accelerated motion A theory of motion Galileo and the Experimental motion

Planetary motion

Ptolomys system The Copernican revolution Gateway to the skies: Tyco Brahe How planets move: Johannes Kepler Galileos Discoveries with the Telescope

Law of motion and gravitation

Isaac Newtons Marvelous year The principia Newtons first law of motion. Inertia Newtons 2nd law of motion. Force Applications of Newtons 2nd law Newtons 3rd law of motion. Action and reaction The Center-Seeking force

Chapter Heat- A form of Energy

Topics Temperature measurement Temperature scale The lowest temperature Kinetic theory and the molecular

interpretation of temperature Temperature and the heat Specific heat Calorimetry Change of state Thermal expansion Energy Conservation Mechanical equivalent of the heat The 1st law of thermodynamic The 2nd law of thermodynamic Wave Motion and Sound Transverse waves Longitudinal waves Reflection of the waves Refraction of the waves Superposition of the waves. Interference Standing waves Vibrating air columns Light and other electromagnetic waves The velocity of the light Electromagnetic waves

10

Chapter

Topics Electromagnetic spectrum, radio, TV, microwaves Simple lenses The optic of the eye

Electricity and Magnetism

Amber phenomenon Conductors, Semiconductors, Insulators Forces between electric Electric current Electric circuits Electric power and energy

The Quantum Theory of Radiation and Spectroscopy Matter The electron X Rays Radioactivity Plancks Quantum hypothesis Einsteins photoelectric equation

Table 1.1 The subject material (chapters and topics) covered by the tests

11

Test 1A 1B 2A 2B 3A 3B

Number questions 30 30 30* 31 30 30

of Number respondents 52 53 52 53 48 48

of Scoring (%) 26.7 86.7 16.7 86.7 16.7 90.0 19.4 77.4 30.0 86.7 26.7 83.3

range Average (%) 56.9 53.7 55.2 52.6 55.6 57.0

score

Table 1.2 Details of the tests used in this study. The tests were part of the PSC2121 course given in the Fall 2004 semester. * One question was omitted from the analysis due to a technical problem.

Each question on a multiple-choice test has a discrimination index that determines how well each question discriminates between students in the top 27% of the class on

12

total test score and those in the lower 27% of the class on total test score. As we explain in more detail below, the discrimination index can range from +1 to "1; a value of +1 means that all of the high scorers answered the question correctly and

! all of the low scorers answered the question incorrectly. A ! value of 0 means that ! same number of high scorers and low scorers obtained the correct answer and so the
the question does not discriminate between the two sub-groups of students. In this study, we analyzed the questions from all tests for which the discrimination index was >0.5. To determine the degree of difficulty, we identified the various tasks or

operations, such as memorization and identification, application, unit conversion, algebraic manipulation, use of vectors, etc., required to answer each question [1.5]. We assigned a numerical level of difficulty to each task, based on the range of knowledge, skill, and ability required, so that any question involving a number of different steps has an overall degree of difficulty, which is the sum of the individual levels of difficulty associated with each of the required steps.

The results indicate a definite correlation between the degree of difficulty and the discrimination index. For example, as the degree of difficulty increases so does the discrimination index, which is not unexpected. However, there is a maximum degree of difficulty beyond which the discrimination index starts to fall off. At that point, the test items become too difficult for both the high scorers and the low scorers to answer, so that they no longer discriminate effectively. Clearly, there are two

extremes; questions that are too easy, i.e., with a small difficulty value, and those that are too hard, i.e., with a high difficulty value. Such questions are not effective if the

13

purpose of a test is to produce a spread of scores, reflecting differences in student achievement and abilities.

As part of our study, we have been able to identify the common tasks are that are involved in the most discriminating questions. Our results suggest that for optimum discrimination, i.e., questions resulting in a discrimination index > 0.5 , the degree of difficulty lies within a reasonably well-defined range for all the tests analyzed. So, in

! principle, by adopting our assigned levels of difficulty for each task or operation, one
can actually design questions with the required level of difficulty and range of cognitive levels that will result in multiple-choice tests that truly discriminate between students of different abilities.

Our study is very similar to the analysis of multiple-choice test items in a General Chemistry I course, carried out by Hostetter and Haky [1.5]. Indeed, it was their study that prompted ours. Altogether, they used the results from approximately 300 students; a somewhat larger sampling group compared with our study. Our results based on an analysis of physics topics - indicated a similar correlation between the degree of difficulty and discrimination; namely, as the difficulty increased the average discrimination increased, but there was a critical level of difficulty beyond which the discrimination decreased.

14

Chapter 2 Theory

2.1. Anatomy of multiple-choice questions

A standard multiple-choice item consists of basic two parts: A problem (the stem) A list of suggested solutions (alternatives)

Typically, multiple-choice items present the stem in the complete question form or an incomplete statement and the list of alternatives contains one correct or best alternative (answer) and a number of incorrect or inferior alternatives (distractors). For example:

Stem in complete form What is a weight of an object? The force with which it is attracted to the earth The amount of matter that it contains A measure of its inertia The same quantity as its mass but expressed in different units

Incomplete statement The weight of an object is: The force with which it is attracted to the earth The amount of matter that it contains A measure of its inertia The same quantity as its mass but expressed in different units

15

Students are directed to select either the correct answer or the best answer from the list of options provided. In the correct answer form, the answer is correct beyond question while the distractors are definitely incorrect. In the best answer version, more than one option may be appropriate in varying degrees. The purpose of the distractors is to appear as plausible solutions to the problem for those students who have not achieved the required learning examined by the question. On the other hand, the distractors will appear as implausible solutions for those students who achieved the required learning; only the (required correct) answer is plausible for those students. As we mentioned in the Introduction, multiple-choice items can be

designed to test not only the lower levels of the learning process, i.e., recall, but also the higher-level skills of comprehension, application, analysis and all of which may be part of the required educational objectives of the class.

2.2 Blooms taxonomy

Starting in 1948, a committee of colleges, led by Benjamin Bloom, began the task of classifying education goals and objectives. The intent was to develop a classification system for three domains: the cognitive, the affective, and the psychomotor: Cognitive: mental skills (Knowledge) Affective: growth in feelings or emotional areas (Attitude) Psychomotor: manual or physical skills (Skills)

16

Learning Process

Cognitive

Affective

Psychomotor

They completed their study on the cognitive domain in 1956 and the resulting classification system is now commonly referred to as Bloom's Taxonomy of the Cognitive Domain [2.1]. Work on the affective and psychomotor domains was

completed in 1972-3 [2.2, 2.3]. The divisions between different classes of skills or behavior are not absolute and other systems or hierarchies have been devised in the educational and training world. However, Bloom's taxonomy is the most easily understood and is arguably the one most used today.

The major idea of the taxonomy of the cognitive domain is that what educators want students to know, i.e., the educational objectives, can be arranged in a hierarchy, starting from the simplest behavior or skill to the most complex. As a result, it can also provide a useful structure within which to categorize and analyze test items. Instructors characteristically ask questions within particular skill levels, for example, Bloom found that over 95 % of the test questions students encounter require them to think only at the lowest possible level, i.e., the recall of information. However, education research shows that students remember more, and can apply their knowledge more effectively, when they have learned to handle the topic at the higher levels of the taxonomy, where more complex skills are required [2.4, 2.5]. Clearly,

17

students can "know" about a topic or subject at different levels. So, it is plain there must be a close link between the taxonomy and test questions, if the latter are constructed with the aim of checking the skill level of students, and discriminating between students of different abilities.

2.3 The cognitive domain

The cognitive domain involves knowledge and the development of intellectual skills. This includes the recall or recognition of specific facts, procedural patterns, and concepts that serve in the development of intellectual abilities and skills. There are six major categories, which are shown in Tables 2.1 to 2.3, starting from the simplest behavior to the most complex. The categories can be thought of as degrees or hierarchies of difficulties.

18

Competence 1. Knowledge

Skills demonstrated

observation information

and

recall

of

knowledge of dates, events, places knowledge of major ideas mastery of subject matter Keywords list, define, tell, describe, identify, show, label, collect, examine, tabulate, quote, name, who, when, where, etc.

2. Comprehension

understanding information grasp meaning translate knowledge into new context

interpret facts, compare, contrast order, group, infer causes predict consequences Keywords summarize, contrast, describe, predict, interpret, associate,

19

Competence

Skills demonstrated distinguish, estimate, differentiate, discuss, extend

Table 2.1 The lowest levels of intellectual behaviors within the cognitive domain Identified by Bloom.

20

Competence 3. Application

Skills demonstrated

use information use methods, concepts, theories in new situations

solve problems using required skills or knowledge

Keywords apply, demonstrate, calculate,

complete, illustrate, show, solve, examine, modify, relate, change, classify, experiment, discover 4. Analysis seeing patterns organization of parts recognition of hidden meanings identification of components Keywords analyze, separate, order, explain, connect, classify, arrange, divide, compare, select, explain, infer

Table 2.2 The median levels of intellectual behaviors within the cognitive domain identified by Bloom.

21

Competence 5. Synthesis

Skills demonstrated

use old ideas to create new ones generalize from given facts relate knowledge from several areas

predict, draw conclusions Keywords combine, integrate, modify,

rearrange, substitute, plan, create, design, invent, what if?, compose, formulate, rewrite 6. Evaluation compare and discriminate prepare, generalize,

between ideas

assess

value

of

theories,

presentations

make choices based on reasoned argument

verify value of evidence recognize subjectivity Keywords

22

Competence

Skills demonstrated assess, decide, rank, grade, test, measure, recommend, convince, select, judge, explain,

discriminate, support, conclude, compare, summarize

Table 2.3 The highest levels of intellectual behaviors within the cognitive domain identified by Bloom.

23

2.4 The Discrimination Index

The discrimination index is a useful measure of item quality whenever the purpose of a test is to produce a spread of scores, reflecting differences in student achievement, so that distinctions may be made among the performances of respondents. It

measures the extent to which item responses discriminate between individuals who have a higher overall score on a test and those that get a lower overall score. The discrimination index is determined by the FAU computer-based test scoring and analysis system [2.6] automatically, in the following way. The distribution of

students is treated as normal and so the students scores are arranged into two subgroups [2.7],

the top 27%; the upper group (U), and the bottom 27%; the lower group (L).

The discrimination index for a particular question is defined by the proportion of the students in the top group who got it correct, p U, and the proportion of the students in the bottom group who got it correct, p L . The discrimination index is defined as

!
D = pu ! p L .

Note that "1 # D # +1. When D = 0, i.e., p U = p L , there is no discrimination, when

D = +1, i.e., p U = 1 and p L = 0 , there is perfect discrimination, and when D = "1, ! ! !


! !
24

there is inverse discrimination, which is most likely caused by a mis-keyed item. Thus, discrimination indices " 0 are found on difficult items such that almost everyone gets them wrong and on items so easy that almost everyone gets them right.

! For instructional purposes it is important to know the content areas and type of items
that most students get right or wrong. As mentioned earlier, when multiple-choice tests are graded using the FAU computer-based test scoring and analysis system, values of the discrimination indices are obtained automatically [2.6].

2.5 The degree of difficulty

In order to carry out this study, we need a quantitative measure of the difficulty of a question. The difficulty of a question is normally determined from the proportion of the total group selecting the correct answer to that question. The following formula may be used to calculate the difficulty factor (sometimes called the p-value):

p=

c "100 n

! where c is the number of students who selected the correct answer and n is the total

number of respondents. A value of p = 100% indicates that all the students selected the correct answer and so that item is very easy. A value of 0 indicates that none of
! the students selected the correct answer and so that item is very difficult. So, this

ratio is one measure of how difficult the question was to the answer.

25

The implication is that if the purpose of the test is to test an individuals mastery of the material, i.e., as in a criterion-referenced test (CRT), p values of ~ 90% may be expected. However, if the emphasis is to obtain a spread of scores between

! ! individuals, as is the case in a norm-referenced test (NRT), then p values over a


broad range can be expected, with the greatest spread if all test items have a difficulty
! of 50%. If we plot the difficulty, p , against the corresponding discrimination index

for each question of a test, we observe a definite correlation between the two
! quantities, as shown in Figures 2.1 to 2.6 for tests 1, 2 and 3; similar behavior is

observed for all tests. First, as p increases, the discrimination index also increases, but at a p value between ~40% and ~60%, the discrimination reaches a maximum.
! When p >~ 60% , the discrimination index decreases. It is generally claimed that ! items for which 40% to 60% of the group passes are preferred to those that are easier ! ( p > 60% ) or more difficult ( p < 40% ) [2.6]. In these particular cases, the number of

items falling into the range 40% < p < 60% in tests 1A, 1B, 2A, 2B, 3A and 3B are
! ! 8/26, 10/26, 8/31, 10/31, 12/30 and 9/30, respectively. !

26

20

40

60

80

100

Difficulty p (%)
Figure 2.1. The discrimination index versus the difficulty factor, p , for Test 1A.
!

20

40

60

80

100

Difficulty p (%)
Figure 2.2. The discrimination index versus the difficulty factor, p , for Test 1B.
!

27

20

40

60

80

100

Difficulty p (%)
Figure 2.3. The discrimination index versus the difficulty factor, p , for Test 2A.
!

20

40

60

80

100

Difficulty p (%)
Figure 2.4. The discrimination index versus the difficulty factor, p , for Test 2B
!

28

20

40

60

80

100

Difficulty p (%)
Figure 2.5. The discrimination index versus the difficulty factor, p , for Test 3A.
!

20

40

60

80

100

Difficulty p (%)
Figure 2.6. The discrimination index versus the difficulty factor, p , for Test 3B.
!

29

Note that over the range 40% < p < 60% , the discrimination index is >~ 0.5 . Therefore, we will take D = 0.5 as the desirable minimum value for a discriminating item.
!

!
The difficulty factor, as defined above, is a property of the obtained measurements. However, we require a definition that depends on the content of the question and reflects the difficulty and complexity of the tasks required to find a solution. Thus, we seek a quantitative and independent measurement of difficulty.

We have found that it is possible to assign a degree of difficulty to items on a multiple-choice test based on the knowledge and tasks required to solve the problem. Basically, all questions can be analyzed in terms of a combination of letters and numbers. The letters represent the tasks or actions that students must perform in order to obtain a complete solution to the problem; the numbers indicate the number of times each task or action is performed. In general terms, Blooms taxonomy, described above, classifies the various tasks and actions, e.g., simple memorization (recall), unit conversion, solving a system of equations, etc., into a hierarchy. Using the classification system as a guide, we are able to assign a numerical level of difficulty to each of these tasks, as shown in Table 2.4.

30

Task Knowledge and recall (K) Identification (I) Application (A) Unit conversion (simple) ( C 3) Simple equation (E) Unit conversion !C 4 ) ( Vector analysis (V) Solving! equation ( S5), derive (D) Solving a system of equation ( S6 )

Level of difficulty 1

2 3

5 6

! ! Table 2.4
Numerical level of difficulty associated with each task

31

In this way we are able to assign an overall degree of difficulty to each question on a test as the sum of the individual levels of difficulty encountered to obtain the answer. In more detail, the tasks in Table 2.4 are:

Knowledge (K) or recall: a task that simply implies memorization or a definition or a quantity that must be known in order to answer the question.

Identification (I): a task that requires identification of the process, laws or the equation that must be used in order to solve the problem.

Application (A): a task when the knowledge is applied to a problem Unit conversions ( C 3 and C 4 ): are tasks when a unit conversion is done in completing the problem.

! ! Simple equation (E): describes a task that involves simply inserting numbers
into an equation to obtain a solution.

Vector analysis (V): a task when vector addition or manipulation of vectors is required in order to solve the problem.

Derivation (D): a task that requires the derivation or proof of an algebraic expression.

Equation ( S5 and S6 ): tasks that involve the manipulation of one or more equations before numbers can be input in order to obtain a result.

We provide three examples below.

32

1. Example of 1K question (Test 1A, Q9): Velocity is a rate of change of a) Speed b) Energy c) Distance d) Displacement

In order to answer this question correctly, the student should know the definition of speed. The level of difficulty level of this question is 1.

2. Example of KI question (Test 1A, Q19): A skydiver jumps from an airplane. As her velocity of fall increases, neglecting air resistance, her acceleration a) Increases b) Is constant c) Decreases

In order to answer to this question correctly, the student needs to Identify the type of motion for a skydiver (uniform accelerated motion) Know that acceleration is constant during the motion Difficulty level of this question is 2.

33

3. Example of 2IAE question (Test 1B, Q18): What is a speed of an object after 4s, if it falls from the rest with an acceleration of 32 ft/s2 ? a) 32 ft/s

b) 128 ft/s c) 256ft/s d) 384 ft/s

In order to answer to this question correctly, the student has to Identify the type of motion (uniform accelerated motion ! free fall) Apply the formula for the velocity in uniform accelerate motion v = v o + at Identify that the initial speed is v o = 0 Solve the equation for v Difficulty level of this question! 2 "1+ 2 + 3 = 7 . is

! The main aim of this study is to investigate any relationship between the level of
difficulty of a particular question and the corresponding discrimination index, using the results of a total of six multiple-choice tests in a physical science course.

34

Chapter 3 Results of the Research

3.1 Description

As mentioned previously, the main aim of this study is to investigate the relationship between the degree of difficulty of a particular question and the corresponding discrimination index. The degree of difficulty is defined in Chapter 2 and can be described as a numerical quantity that depends on the content of the question and reflects the difficulty and complexity of the tasks and operations required to find a solution. In this study, we use a combination of letter and numbers to quantify a complete solution to a question; the letters represent the task(s) or action(s) that must be performed and the numbers represent the number of times each task or action is performed. As we described above, we have classified the tasks and actions into a hierarchy, using Blooms taxonomy as a guide, and assigned a numerical degree of difficulty to each of the tasks. For example, the following question:

The speed limit in a school zone is 20mi/h and it is strictly enforced. If you are driving at 30km/h are you likely to get a ticket? (a) Yes (b) No

35

can be analyzed in the following way. In order to answer to this question the student should convert km/h to mi/h using the relationship 1 mi = 1.61 km , i.e.,

1 km = (1 1.61) mi = 0.621 mi . This task is C 3, a simple unit ! conversion with a level of difficulty of 3. !

" solve the equation for v: v = 30 km/h ! 30 # 0.621 = 18.6 mi/h . This
task is E, a simple equation with a level of difficulty of 3.

! identify that v < 20 mi/h . This task is I, identification with a level of


difficulty of 1.

! Thus, the level of difficulty of this question is 3 + 3 + 1 = 7.

The discrimination index measures! extent to which the question discriminates the between individuals who fall into the top 27% of scorers on a test and those who fall into the bottom 27%. The index, as defined in Chapter 2, which has a value

"1 # D # +1, is determined automatically for each question on a test by the FAU
computer-based test scoring service. For the purposes of this study, we claim that

questions with values of D > 0.5 qualify as questions that are reasonable discriminators; hence, we only concentrated on such test items in our study.

! Altogether, we analyzed the results of six multiple-choice tests (labeled 1A, 1B, 2A,
2B, 3A and 3B) given in a Physical Science class (PSC2121), at Florida Atlantic University in the Fall 2004 semester and selected only those items for which D > 0.5 .

36

3.2 Analysis of the questions for which D>0.5

In Tables 3.1 to 3.6, we list the results of our analysis of the six tests for which D > 0.5 .

! In Figures 3.1 to 3.9, we show plots of the degree of difficulty and the discrimination
index for the individual tests. We have included a second order polynomial fit to the data simply to act as a guide to the eye.

Despite the limited statistics, due to a relatively small number of respondents (~50) on each test, a trend does appear to emerge. The data for each test suggests that there is a correlation between the degree of difficulty and the discrimination index. Specifically, initially, as the degree of difficulty increases the discrimination index also increases. However, there is an optimum degree of difficulty beyond which the discrimination begins to fall. (Such behavior was noted previously, in chapter 2, when the difficulty factor, defined as:
p= c "100 , n

where c is the number of students who selected the correct answer and n is the total
! number of respondents. But, as we argued in chapter 2, the difficulty factor is a

property of the obtained measurements and is not appropriate in our analysis, which is why we found it necessary to introduce a quantitative and independent degree of difficulty for each question, based on content of a question and the difficulty and complexity of the tasks required to find a solution.)

37

We can understand such behavior by identifying the two extremes, namely, (a) questions that have a low degree of difficulty ( <~ 8 ), i.e., questions that are too easy, and (b) questions with a high degree of difficulty ( >~ 14 ), i.e., questions that are too hard. Questions in these regimes! less effective in discriminating between are students of different abilities because:

in case (a) more of the lower scoring students are likely to answer the question correctly, so the test item is too easy for both the lower and higher scorers, resulting in less discrimination, and in case (b) fewer of the higher scoring students are likely to answer the question correctly, so the test item is too difficult for both the high scorers and the low scorers to answer and so it no longer discriminates effectively.

In spite of the limited size of the data sets, we suggest that, for the tests that we have analyzed, the optimum discrimination likely occurs when the degree of difficulty lies in the range from ~9 to ~14. It might be tempting to compare the degrees of difficulty for optimum discrimination from one test to the next; clearly, if students are learning then we might expect the degree of difficulty for optimum discrimination to increase! However, the sample set is simply not adequate for reliable comparisons. These results are very similar to those obtained by Hostetter and Haky who analyzed

38

the results of a number of multiple-choice tests given in an introductory General Chemistry course [1.4].

A further outcome of this study is that, in principle, it is now possible to design multiple choice items with a known degree of difficulty and, hence, discrimination.

Finally, in Figure 3.10 we show the correlation between the measured difficulty factors ( p ), as defined in Chapter 2, and our calculated degrees of difficulty for Test 1A. In (a) we have used the complete set of values; where there is more than one
!measured difficulty factor for a particular degree of difficulty; we have plotted the

averaged value. The plots indicate a close relationship between the measured and calculated values. When the calculated degree of difficulty is very small, most of the students get the correct answer, so p "100% ; when the calculated degree of difficulty is very large, most students fail to get the correct answer, so p " 0 .
! !

39

Question number 2 5 12 13 17 18 25

Type question

Degree of difficulty 8 9 13 14 8 6 12

Discrimination index 0.61 0.67 0.58 0.61 0.67 0.58 0.81

CS5

2CE
! K3AS6 K2VS5

!
! !

2IKAE IAE
KAVS5

! !
!

Table 3.1. The results for Test 1A.

40

Question number 7 9 15 17 21 22 23 24 25

Type question

Degree of difficulty 9 8 11 9 10 7 8 12 13

Discrimination index 0.58 0.52 0.65 0.66 0.66 0.51 0.52 0.64 0.58

2C3E

KAVI
! KAS5E IKAS5

!
! !

3KI3A
AS5

!
!

5KAI
K2I2AS5 KAIVS5

!
! !

Table 3.2. The results for Test 1B.

41

Question number 2 7 22 25 27 28

Question type

Degree of difficulty 4 10 12 9 11 11

Discrimination index 0.55 0.70 0.69 0.60 0.84 0.70

IKA
2IKAS5

!
!

5K3AI KACE 2I2K2AE 2I2K2AE

! ! ! !

Table 3.3. The results for Test 2A.

42

Question number

Question type

Degree difficulty

of Discrimination index 0.80 0.66 0.53 0.55 0.50

5 16 22 23 24

2IAS5

9 5 11 2 14

2KE
! KS5 2AI

!
!

2K
2KS5 3AI

!
!
Table 3.4. The results for Test 2B.

43

Question number 1 5 8 9 12 14 18 20 22 24 25 26 27 28 30

Question type

Degree of difficulty 6 5 10 6 8 14 23 15 12 16 11 12 7 6 6

Discrimination index 0.52 0.50 0.61 0.57 0.60 0.86 0.60 0.84 0.77 0.59 0.70 0.75 0.66 0.50 0.57

IAE 2IKA ! !
K 2AS 5

KAE
KAS5

!
! ! ! ! ! ! !

2I2AS5E 3I3AES5S6 K2A2S5 IDC3E K2AES5C3 2K2AS5

6K3A 4KE 4KA KAE

! ! ! !

Table 3.5. The results for Test 3A.

44

Question number

Question type

Degree of difficulty

Discrimination index 0.57 0.65 0.75 0.57 0.65 0.65 0.66 0.66 0.65 0.50

6 7 14 20 21 22 24 27 28 29

4KA
K 2AS 5

6 10 14 15 9 12 11 7 6 5

2I2AS5E K2A2S5

! ! ! ! !

KAS5I IDC3E 2K2AS5

4KE 4KA

! ! !

AE

Table 3.6. The results for Test 3B.

45

Figure 3.1. The discrimination index versus the degree of difficulty for Test 1A.

Figure 3.2. The discrimination index versus the degree of difficulty for Test 1B.

46

Figure 3.3. The discrimination index versus the degree of difficulty for Tests 1A and 1B.

Figure 3.4. The discrimination index versus the degree of difficulty for Test 2A

47

Figure 3.5. The discrimination index versus the degree of difficulty for Test 2B.

Figure 3.6. The discrimination index versus the degree of difficulty for Tests 2A and 2B.

48

Figure 3.7. The discrimination index versus the degree of difficulty for Test 3A.

Figure 3.8. The discrimination index versus the degree of difficulty for Test 3B.

49

Figure 3.9. The discrimination index versus the degree of difficulty for Tests 3A and 3B.

50

Figure 3.10. (a) the difficulty factor ( p ) and (b) the averaged difficulty factor ( p av ) versus the calculated degree of difficulty for test 1A. A linear trend line has been fitted to the data; in (a) R 2 = 0.73 !
!

and in (b) R 2 = 0.85 .

! !
51

Chapter 4 Concluding remarks

In this study we analyzed the questions and results of a total of six multiple-choice tests in a physical science course at Florida Atlantic University in the Fall 2004. Our main aim was to quantify two of the most important factors in creating valid and discriminating test items, namely, the degree of difficulty of each item and the corresponding discrimination index based the results of actual tests, and to investigate the relationship between them. Following the analysis of the results of a test, each item can be assigned a discrimination index, which determines how well it discriminates between the top scoring students of the test and the bottom group of students. In this study we confined our analysis to the questions from all tests for which discrimination index is >0.5.

In order to associate a degree of difficulty with each item, we identified the various tasks or operations, such as memorization and identification, application, unit conversion, algebraic manipulation, use of vectors, etc., required to answer each question. We assigned a numeric level of difficulty to each task, based on the range of knowledge, skill, and ability required, so that any question involving a number of different steps has an overall degree of difficulty, which is the sum of the individual levels of difficulty associated with each of the required steps.

52

Our results indicate a definite correlation between the degree of difficulty and the discrimination index. For example, as the degree of difficulty increases so does the discrimination index. However, there is a optimum degree of difficulty, in the range ~9 to ~12, beyond which the discrimination index starts to fall. At that point, the test items become too difficult for both the high scorers and the low scorers to answer, so the items no longer discriminate effectively. Clearly, there are two extremes;

questions that are too easy, i.e., with a low degree of difficulty, and those that are too hard, i.e., with a high degree of difficulty. Such questions are not effective in discriminating between students of different abilities.

By adopting our assigned levels of difficulty for each task or operation, one can actually design questions with the required level of difficulty and range of cognitive levels that will result in multiple-choice tests that truly discriminate between students of different abilities. For example, the results of our study indicate that the most discrimination questions, i.e., with D > 0.6 , have a degree of difficulty level is in interval 9 "14 . Using this result and we can set up an inequation

! !

9 " a # K + b # A + c # E + d # V + e # S5 + f # S6 " 14 ,

where K, A, E, V, S5 and S6 , etc., are the various tasks and operations required to ! solve a problem, as defined in chapter 2, and a, b, c, d, e, f represent the number of

! times each action ! performed. We found that it was possible to assign a numerical is
level of difficulty to each of these tasks, e.g., K = 1, A = 2 , E = 3, V = 4 , S5 = 5 ,

! 53 !

S6 = 6 , based on a hierarchy of the skills required.


becomes:

Therefore, the inequation

9 " a + 2b + 3c + 4d + 5e + 6f " 14 .

Although there are many possible solutions to this equation, there are, however, ! limits. So, in principle, we can use this inequation to develop items for multiplechoice tests in a physical science course where the requirement is to obtain optimum discrimination between students who have mastered the course material and those who have not. However, the design of items with optimum discrimination and the verification under test conditions is beyond the scope of this study; we suggest it might form the basis of further research.

54

References
[1.1] Bloom B. S. (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. [1.2] Victoria Clegg and William Cashin (1986), Improving Multiple-Choice Tests iDEA PAPER No. 16 available from http://www.idea.ksu.edu/papers/ [1.3] Improving Multiple http://ctl.unc.edu/fyc8.html. Choice Questions (1990) available from

[1.4] Laura Hostetter and Dr. J.E. Haky private communication. Also, A classification scheme for preparing effective multiple-choice questions based on item response theory, L. Hostetter and J.E. Haky, FLORIDA ACADEMY OF SCIENCES, Annual meeting, University of South Florida, March 2005. [1.5] Note that our definition of the degree of difficulty is different from that used by the FAU Testing and Evaluation Center. [2.1] B.S. Bloom (1956). Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. There is a considerable amount of information available about Blooms taxonomy on the internet, see, for example: http://www.nwlink.com/~donclark/hrd/bloom.html, http://www.coun.uvic.ca/learn/program/hndouts/bloom.html, http://www.valdosta.edu/~whuitt/psy702/cogsys/bloom.html. [2.2] D.R. Krathwohl, B.S. Bloom, and B.M. Bertram (1973). Taxonomy of Educational Objectives, the Classification of Educational Goals. Handbook II: Affective Domain. New York: David McKay Co., Inc. [2.3] E.J. Simpson (1972). The Classification of Educational Objectives in the Psychomotor Domain. Washington, DC: Gryphon House. [2.4] J. D. Bransford, A.L. Brown and R.R. Cocking (eds) (2000). How People Learn: expanded edition. Washington, D.C.: National Academy Press. [2.5] M. Suzanne Donovan and John D. Bransford (eds) (2005). How Students Learn. Washington, D.C.: The National Academies Press. [2.6] Handout entitled Computer based test scoring and analysis is available from the Florida Atlantic University, Testing and Evaluation Center. [2.7] The selection of upper and lower groups for the validation of test items, T.L. Kelley, J. Ed. Psych., 30, 17-24 (1939).

55

Вам также может понравиться