Development of Mathematics Diagnostic Tests

Development of Mathematics Diagnostic Test for DORSHS Second Year High School Students Using Item Response Theory
Jeremias C. Ceniza Donnell C. Cereno
Abstract The study developed a diagnostic test that determined to measure learning in Intermediate Algebra among second year high school students of Davao Oriental Regional Science High School (DORSHS). The test established validity, discrimination indices, difficulty indices and reliability. The construction of the test adapted the Research Methods for Educational Planning. Content validity was handled by 3 experts in the content area of secondary mathematics. Primary data for item analysis were extracted through 2 test tryouts: one on 59 third year students for grammatical checking and compatibility benchmarking while another was on 78 second year students for discrimination and difficulty indices and reliability. Analyses were attained through the use of an IRT modeling software called ConQuest, spreadsheet and statistical software known as SPSS. The study used the IRT two-parameter logistic model. The test was found to be valid and highly reliable. Experts guaranteed that the instrument can gauge all content in the learning competency of MathematicsII and that nothing was left untargeted. The reliability coefficient of the test is 0.84. The range of difficulty of the items was at -1.37 to 1.35 and discrimination indices were within 0.23 to 0.51. This standardized test is a tool that identifies the zone of proximal development, mastered and least learned content areas in Mathematics II among DORSHS second year students.
The Davao Oriental Regional Science High School (DORSHS) of Region XI offers additional subjects in Math, Science and English to its students as early as first year. For the past five years of its existence, the DORSHS National Achievement Test (NAT) results in Mathematics are much far from the 75% Mean Percentage Score (MPS) target. The researcher has become interested in creating a diagnostic test for DORSHS second year students t hat is valid and reliable since the school doesnt have any yet. This instrument would be used in detecting the strengths and weaknesses of second year students before undergoing a review program for NAT. In general, this instrument intends to help DORSHS during NAT review to improve results in Mathematics.
The study aimed to construct a standardized diagnostic test in Mathematics for second year students in Davao Oriental Regional Science High School (DORSHS). Specifically, it sought to answer the following questions:
1. Is the diagnostic test valid? 2. To what extent do the test items illustrate difficulty? 3. To what degree does the test exhibit discrimination level? 4. How reliable is the entire test according to IRT model?
Conceptual Framework Figure1. The Conceptual Paradigm of the Study

Diagnostic Test Content Areas Learning Areas in Mathematics II Test Objectives Test Construction Grammar Structure Diagnostic Test Results
Students Zone of Proximal Development Least Learned and Mastered Competencies Settings of Item Thinking Level
Test Development Content Validity IRT Analysis o Item Analysis - Difficulty Indices - Discrimination Indices o Test Reliability - KR-20 Assembly of the final test
Item Response Theory (IRT)
After the popular (or classical) measurement models for constructing test and interpreting test scores have served well its purpose for quite a long time, a new test theory had been developed over the past forty years that was conceptually more powerful than classical test theory. Based upon items rather than test scores, the new approach was known as item response theory (Baker, 2001).
In contrast to the limitations of classical test models, the item response theory has many desirable features. These features would include (a) item characteristics that are not groupdependent, (b) scores describing examinee proficiency that are not test-dependent, (c) a model that is expressed at the item level rather than at the test level, (d) a model that does not require strictly parallel tests for assessing reliability, and (e) a model that provides a measure of precision for each ability score (Hambleton, Swaminathan & Jane Rogers, 1991). Having this advancement in educational and psychological measurement, it is now safe to administer test tryouts to different groups of various abilities before giving to the target examinees. Kim, Cohen, & Park (1995) illustrated further that IRT allows researchers to conduct rigorous tests of measurement equivalence across experimental groups. IRT methods can distinguish item bias from true differences on the attribute measured for which the classical test theory (CTT) could not.
Point Biserials or point biserial correlation coefficient (rpb) is one common metric used to assess item quality. The pt bis as it is sometimes called is the correlation between an item score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is good at differentiating between high ability and low ability examinees (Bontempo, 2009).
Zone of Proximal Development, or ZPD, as established by the Russian psychologist Lev Vygotsky, refers to the distance between what a child can do with assistance and what the child can accomplish without assistance (Vygotsky, 1978). In other words, it is the learning of the students on a certain competency at 50% mastery level. Under IRT perspective, test items that the child can answer correctly at 50% chance are the items within the childs zone of proximal
development. The child is, at this point, has a 50-50 chance of achieving the learning task without teachers assistance.
Assessment
The main purpose of assessment is to improve the learning outcomes of students. In assessing students learning, systematic process is being followed, specifically; the Diagnostic test which is an assessment before a learning program, the Formative Test which is an assessment during a learning program and the Summative Test which refers to assessment after a learning program (Davies, Arbuckle & Bonneau, 2005).
Conducting a Diagnostic Test
The main purpose of diagnostic test is to pinpoint specific strengths and weaknesses of the learner in accordance to his or her grade level requirements. They are scored using true test score criteria, this means that they are not averaged or normed (Educational Diagnostic Prescriptive Services, 2009). Izard, J. (2005) says that scores in a criterion referenced test are interpreted as an individual performance of each student in the group on what he can do or not do rather than comparing the results with other groups of students.
Stages in Test Construction
Izard (2005) in his module on Quantitative Research Methods in Educational Planning laid an overview on test construction the test construction steps. His module describes the different stages of developing a test that could be used by teachers in the classroom and even
nationwide as achievement test. The methods revealed important details on how the test will be constructed in such a way that it can accumulate results that are valid, fair and reliable.
Validity, Reliability and Usability
Whatever the test is, it should apply the qualities of a good measuring instrument. The qualities of a good measuring instrument are, validity, reliability, and usability (Calmorin, 2004).
Validity is the extent to which a test measures what it claims to measure. One type of validity is the content validity. It refers to the extent to which the test reflects the content represented in curriculum statements and the skills implied by that content
Reliability is another important characteristic of a good test. This refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly that means, each time the test is administered to a subject, the results should be approximately the same (Cherry, 2005). A reliability coefficient within the range of 0.811.0 indicates high reliability; 0.61 0.80 signifies moderate reliability; 0.41 to 0.60 fair reliability; 0.10 to 0.40 slight reliability and less than 0.10 virtually no reliability (Shrout, 1998).
Methods
Research Locale and Duration Table 1. Timescale and Resource Requirement for the Test Development
TIMESCALE AND RESOURCES FOR TEST DEVELOPMENT Stages Time Resources I. Planning the Test a. Developing Test Item 1 week Basic Education Curriculum (BEC) list of Objectives competencies b. Constructing Test Grid or 1 week DepEd Memo on the Official number of Table of Specification School Days II. Preparing the Test a. Content Analysis 1 week Learning Competencies, textbooks b. Item Writing and 1 month Guidelines on test construction, Preparation of Answers Key Mathematics Textbooks III. Review and Testing Process a. Item Review I b. Planning of Item Scoring c. Production of Trial Tests I d. Trial Testing I e. Item Review II f. Production of Trial Tests II g. Trial Testing II h. Item Analysis IV. Assembly of Final Test a. Revising Test Items b. Finalization of the Test c. IRT Model Fitting d. Identification of the Zone of Proximal Development e. Test Difficulty Settings 2 weeks 2 days 1 day 1 day 3 days 1 day 1 day 2 weeks 5 days 2 days 1 day 1 day 1 day Test construction team, Research adviser Answers key, Spreadsheet software Word-processing DORSHS 3rd Year students Researcher Word-processing DORSHS 2nd Year students IRT software, Encoder IRT concepts and interpretations Data Findings and Results IRT ConQuest Modeling Software IRT ConQuest Modeling Software,Learning Competencies IRT Results, Test Grid
Table 1 is the timescale and resource requirement for test development patterned from the Quantitative Research Methods for Educational Planning Module 6 (Izard, 2005). The study was conducted in Davao Oriental Regional Science High School (DORSHS) campus at Mati City during the month of February of School Year 2010-2011.
Statistical Treatment
This study used IRT model in describing the data. Item Analysis and test reliability (as explained by Brannick, 2006; Baker, 2001; Hambleton, Swaminathan, & Jane Rogers, 1991) were computed through a computer software model and interpreted as discussed by the following:
Item Analysis. Under the IRT Two-Parameter Model was the last part of the item review that verified how each item performed in the final test tryout. The researcher together with an IRT expert did the following:
i.
Processing test responses through IRT model
In determining the difficulty and discrimination indices of test items, a computer software known as ConQuest: Generalised Item Response Modelling Software was used to construct the Item Characteristic Curve (ICC) of each test item. In here, the responses of all test takers for each item were recorded in a spreadsheet, converted into text format through SPSS, and then were run through the ConQuest software. The software then generated the summary of statistical results for each item and the corresponding ICC as shown on Figure 2.
Figure 2. Sample Item Characteristic Curve (ICC)
The ICC, as shown in Figure 2, was the basis of item estimations for the two parameters: item difficulty and item discrimination. The horizontal axis ( ) is the latent ability of the examinees while the vertical axis P( ) is the probability of the examinees to choose the correct answer. The ability ( ) where P( ) is equal to 0.5 is the value of the difficulty parameter. Also, the value estimated describing the steepness of the curve is the value of the item discrimination parameter.
ii.
Difficulty and Item Discrimination
With the item difficulty and item discrimination parameter values given by the ICC, each item of the test was interpreted as the following:
Labels for item discrimination parameter (a) values
Verbal label Negative Zero Low Moderate High
Range of values less than zero 0 0.01 0.20 0.21 0.60 above 0.60
The discrimination parameters were sometimes called slope parameters. A jumpy curve means that the expected test score responded to true ability unevenly. Flat curves means that the expected score is not very sensitive to differences in true ability. A steeper S-curve (like in Figure 2) means that the expected score is more sensitive to differences in ability. In other words, the test discriminates or distinguishes better between persons of different ability, which explains the term discrimination parameter. Thus, discrimination parameter describes how well an
item can differentiate between examinees having abilities below the item location and those having abilities above the item location.
Labels for item difficulty parameter (b) values Verbal Label Range of values less than 2 0.50 to 2.00 0.49 to 0.49 0.50 to 2.00 greater than 2.00
Very easy Easy Average Difficult Very difficult
The item difficulty parameter (b) value tells us how easy or how difficult an item is. Under item response theory, an items difficulty is a point on the ability scale where the probability of correct response is 0.5. One can find the value of b on the common ability axis at the point for which the predicted probability equals 0.5.
Test reliability. In classical test theory, there will be another test for reliability. In IRT, there is local reliability, that is, an amount of information at each point of the underlying continuum. With IRT model, each item of the test contained information. For the set of parameters associated with each term in a model, ConQuest computed a separation reliability index. This reliability was an index of the equality of the parameters. In the case of dichotomous data like the test conducted, the Coefficient Alpha given by ConQuest is equal to KR-20 (Wu, Adams, et. al., 2007).
Revising Test Items. The selection of the items suited for inclusion to the final output of the test was determined through the verbal interpretation of each item. This is illustrated by Table 2 below.
Table 2. Decision Table for Difficulty and Discrimination Indices Difficulty Level Easy Discrimination Level Low Moderate High Low Moderate High Low Moderate High
Decision Revise Retain Retain Revise Retain Retain Revise Retain Retain
Average
Difficult
An item was rejected as any of the following was observed: (a) its discrimination parameter value is negative or zero, and/or (b) its difficulty parameter value is very easy or very difficult.
To preserve the validity of the entire test and to maintain the total number of items, all items rejected were automatically replaced. The process of revision and replacement were subjected to IRT item analyses yet still accord to the Test Grid.
Revised and replacement items were re-administered to the target examinees, the second year students, and its test responses were re-run through IRT software for final verification of whether these items are retained or not. Retesting of selected items was done as necessary in preparation for the finalization of the test.
Evaluators of the Constructed Test
A group of experts and experienced persons in the field of test construction were incharged on the development of the test. This group served as consultants in connection to content
validity of the test. Another group, composed of students, took the test for item reviews and test reliability. These two teams were called as test evaluators.
Test Construction Team. The first set of evaluators was the group of persons who took charge in examining the content validity of the test. This group was made up of experts on test development or individuals with experience on testing or trainings related to. More importantly, these persons were familiar with the content areas in secondary mathematics and at least a graduate of Master of Arts in Mathematics or Master of Science Teaching in Mathematics.
Try Out Test Examinees. The second set of evaluators was the group of students in Davao Oriental Regional Science High School (DORSHS). Specifically, the members of the team involved all third year (first group) and all second year (second group) students of the said school. The administration of the test to the first group was vital in making the test structurally ready while the second group responses was used to distinguish the difficulty and discrimination indices of test items and the reliability coefficient of the over-all test.
Validation of the Diagnostic Test Validation of the diagnostic test did not utilize a statistical analysis. It relied on matching the test items from the objectives and presenting the whole test to the group of experts in the content areas of secondary mathematics for item review. The team guaranteed that the instrument had strong content validity in which each item represented at least one topic actually being investigated to students, rather than asking unrelated questions.
Test Administration
The administration of the test tryouts was done after securing permission from the concerned head of offices. The conduct of the test was officially approved by the OIC - Schools Division Superintendent. Likewise, the principal of DORSHS also posed no objection to proceed with the testing process. As a sort of ethical considerations, examinees involved in this study who were minors were dealt with proper information about the research.
The first tryout was administrated to 59 third year students of DORSHS. The purpose of administering the test was to determine the structural readiness of the test and assured its compatibility to examinees thinking level. The next tryout was administered to 78 second year high school students of DORSHS for interpretation of test results. Specifically, the test was conducted for item analysis purposes.
The researcher introduced to the test takers certain guidelines in taking the test. In answering the test, examinees were told to use the answer sheets as provided. They were instructed to shade on the answer sheet the letter that corresponds to the best answer for every test question. They were also been told to mark X on the previous answer if they decide to change their answer and that they must only have one answer for each item, otherwise, it will be marked wrong.
Results and Discussion
Test Validation
Validity of the test was done through certain revisions of some items as suggested by the test construction team in line with the following test attributes: grammar structuring, proper usage of punctuation, principles on item construction, setting of item difficulty, typographical precision and more importantly, on content criterion.
Difficulty and Discrimination Indices of Test Items
Table 3 shows the result of item analysis from the final tryout using IRT through a computer program called ConQuest: Generalized Item Response Modeling Software.
Table 3. Difficulty and Discrimination Indices Result of the Final Tryout Item Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Difficulty Index -1.08 -1.08 -1.41 0.58 -0.12 -0.18 0.58 -1.29 0.57 -0.58 -0.95 0.52 -2.53 -0.24 0.66 1.11 0.83 0.51 Difficulty Level Easy Easy Easy Difficult Average Average Easy Easy Difficult Easy Easy Difficult Very Easy Average Difficult Difficult Difficult Difficult Discriminatio n Index -0.01 0.34 0.20 0.08 0.46 0.35 0.30 0.34 0.23 0.27 0.36 0.13 0.14 0.46 0.46 0.24 0.32 0.33 Discrimination Level Negative Moderate Low Low Moderate Moderate Moderate Moderate Moderate Moderate Moderate Low Low Moderate Moderate Moderate Moderate Moderate Action Reject Retain Revise Revise Retain Retain Retain Retain Retain Retain Retain Revise Revise Retain Retain Retain Retain Retain
Item Number 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Difficulty Index -0.29 -0.29 0.27 0.33 -0.18 0.45 1.35 -0.70 -0.76 0.16 0.57 -0.18 -0.82 -1.37 -0.12 0.33 0.10 -0.70 -0.18 0.45 -0.95 -2.12 -1.37 -0.01 -1.22 0.44 -0.46 0.57 1.27 0.51 -0.20 -0.07 1.19 1.27 2.19 0.45 0.45 -0.76 0.88 -0.12 2.90 -0.18
Difficulty Level Average Average Average Average Average Average Difficult Easy Easy Average Difficult Average Easy Easy Average Average Average Easy Average Average Easy Very Easy Easy Average Easy Average Average Difficult Difficult Difficult Average Average Difficult Difficult Very Difficult Average Average Easy Difficult Average Very Difficult Average
Discriminatio n Index 0.33 0.34 0.34 0.50 0.27 0.30 0.33 0.34 0.49 0.23 0.34 0.30 0.41 0.37 0.47 0.17 0.39 0.28 0.48 0.17 0.34 0.14 0.14 0.35 0.35 0.33 0.38 0.44 0.38 0.13 0.31 0.41 0.01 -0.02 0.30 0.28 0.51 0.39 0.46 0.23 0.42 0.37
Discrimination Level Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Low Moderate Moderate Moderate Low Moderate Moderate Low Moderate Moderate Moderate Moderate Moderate Moderate Low Moderate Moderate Low Negative Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate
Action Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Revise Retain Retain Retain Revise Retain Reject Revise Retain Retain Retain Retain Retain Retain Revise Retain Retain Revise Reject Reject Retain Retain Retain Retain Retain Reject Retain
IRT discrimination and difficulty parameters for each item suggested that item numbers 1 (negative discrimination), 13 (very easy), 40 (very easy), 52 (negative discrimination), 53 (very difficult) and 59 (very difficult) be rejected. The item numbers subjected for revision according to Table 3 were numbers 3, 4, 12, 34, 38, 41, 48 and 51.
Revisions and Replacements of Weak Items
Data generated by Conquest facilitated the revisions or replacements of some items. Items with negative or low discrimination like numbers 1, 3, 4, 12, 34, 38, 41, 48 and 52 were treated with the aid of an IRT attribute called point biserial. The following table shows the IRT Generalized Item Analysis Result of item number 1. Other items subject for revisions had similar attributes with this table.
Table 4. IRT Generalized Item Analysis Result for Item No. 1

item:1 (1) Cases for this item 78 Discrimination -0.01 Item Threshold(s): -1.08 Weighted MNSQ 1.14 Item Delta(s): -1.08 ----------------------------------------------------------------------Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1 ----------------------------------------------------------------------1 1.00 56 71.79 -0.01 -0.06(.952) -0.00 0.72 2 0.00 2 2.56 -0.02 -0.21(.833) -0.31 0.05 3 0.00 2 2.56 -0.00 -0.00(.000) 0.11 1.29 4 0.00 18 23.08 0.02 0.14(.885) 0.02 0.64 =======================================================================
As observed in Table 4, item number 1 same as the other weak items had a low or negative discrimination. It was noticed that the point biserial of correct answer is negative, or if not, very close to zero while other wrong options become positive where ideally it should be negative. This simply means that the wrong options were attractive as correct answer to examinees with high ability. These options were reviewed then replaced for improvement and
ease of the test takers. Other items were also improved by restructuring the manner of questioning to lessen confusions in answering.
IRT Generalized Item Analysis Result also revealed that the Delta values for item numbers 40, 53 and 59 were -2.12, 2.19 and 2.90, respectively. It has a very easy and very difficult difficulty level. Thus, the level of questioning on these items was rephrased to fit the students level.
Finalization of the Test
Table 5 shows the item difficulty and discrimination results after retesting of the revised and replacement items. Table 5. Retesting Results on Difficulty and Discrimination Indices
Item Number Difficulty Index Difficulty Level Discrimination Index Discrimination Level Action
1 3 4 12 13 34 38 40 41 48 51 52 53 59 Test Reliability
-0.76 1.27 1.35 0.51 -1.37 -0.18 1.27 -0.70 0.45 0.57 0.83 0.88 0.57 0.57
Easy Difficult Difficult Difficult Easy Average Difficult Easy Average Difficult Difficult Difficult Difficult Difficult
0.39 0.38 0.33 0.33 0.37 0.27 0.38 0.34 0.30 0.23 0.32 0.46 0.44 0.23
Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate Moderate
Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain Retain
As observed in the summary results by Table 6 below, the Coefficient Alpha is 0.84. This is the Kuder-Richardson Formula 20 (KR-20) reliability coefficient (Wu et. al., 2007).
Table 6. Summary Statistics from Item Analysis Results

-----------------------------------------------------------------------
The following results are scaled to assume that a single response was provided for each item. N Mean Standard Deviation Variance Skewness Kurtosis Coefficient Alpha 78 28.77 8.56 73.30 0.93 1.05 0.84
=======================================================================
Zone of Proximal Development
The following Table translates the items into its corresponding learning areas in Mathematics-II that the students have shown 50% mastery which is, in fact, known as the Zone of Proximal Development.
Table 7. Students Achievement on Different Learning Areas in Mathematics-II
Students Group -
Mathematics Learning Area within the Groups Zone of Proximal Development

use systems of linear equations to solve problems represent the solution set of a system of linear inequalities by graphing simplify complex rational algebraic expressions solve problems involving expressions with exponents describe an arithmetic sequence by giving the formula for the nth term solve problems involving geometric mean represent the solution set of a system of linear inequalities by graphing find the solution set of a quadratic equation identify rational algebraic expression perform operations on rational algebraic expressions solve rational equations and check for extraneous solutions solve problems involving rational algebraic expressions solve equations involving variations demonstrate understanding of expressions rewrite algebraic expressions w/ zero and negative exponents
n
Upper Top Group (UTG)
Upper Middle Group (UMG)
Lower Middle Group (LMG)
name two rational numbers where x lies in between - simplify expression containing rational exponents using laws of exponents - solve problems involving radical equations - solve problems involving arithmetic means - solve problems involving arithmetic sequences - describe a geometric sequence given the first few terms - derive the formula for the sum of the terms of a geometric sequence - derive the formula for an infinite geometric series - solve problems involving geometric sequence - find the solution set of a quadratic equation - perform operations on radical expressions - define a system of linear equations in two variables - translate certain situations in real life to linear inequalities - draw the graph of a linear inequality in two variables - solve rational equations which can be reduced to quadratic equations - use quadratic equations to solve problems - simplify rational algebraic expressions - identify variation relationships of real life - represent variation relationships as equations - evaluate numerical expressions involving integral exponents - identify expressions which are perfect squares or perfect cubes - find the square root or cube root of expressions - rewrite expressions with rational exponents as radical expressions - simplify the radical expression - solve radical equations - list the next few terms of a sequence given several consecutive terms - derive a mathematical rule for generating the sequence - given few terms of an arithmetic sequence, find the common difference - given two terms of an arithmetic sequence, find the specified nth term
(Continuation)
derive the formula for the sum of the n terms of an arithmetic sequence describe a geometric sequence given the first few terms find the sum of the terms of a geometric sequence define a system of linear inequalities describe an arithmetic sequence by giving the first few terms
Lower Middle Group (LMG) Lower Bottom Group (LBG) solve systems of linear equations in two variables distinguish a quadratic equation from a linear equation translate verbal expressions into rational algebraic expressions rationalize a fraction whose denominator contains square roots define the sum of an arithmetic sequence
Table 7 was a mapping of students score to their zone of proximal development (ZPD) or the learning competencies that they had 50% mastery level. Thus, the competencies above their ZPD were the least learned learning areas of the group while those below their ZPD were the competencies the group mastered.
Settings on Thinking Hierarchy Table 8 shows the results of students thinking level measured by the item as set by intuition and by IRT calculation. It was revealed in the table that out of 60, only 23 items had equal setting of difficulty level between the researchers intuition and test responses result.
Table 8. Item Difficulty Level (Intuition versus IRT perspective) Item Level of Number Difficulty by Researchers Intuition 1 1st level 2 2nd level 3 3rd level 4 2nd level 5 1st level 6 2nd level 7 3rd level 8 1st level 9 2nd level 10 2nd level 11 3rd level 12 1st level 13 1st level 14 2nd level 15 2nd level 16 3rd level 17 3rd level 18 3rd level 19 1st level 20 2nd level 21 3rd level 22 1st level 23 2nd level 24 2nd level 25 3rd level 26 1st level 27 2nd level 28 2nd level 29 2nd level 30 2nd level Level of Difficulty by Test Responses 1st level 1st level 1st level 3rd level 2nd level 2nd level 3rd level 1st level 3rd level 1st level 1st level 3rd level 1st level 2nd level 3rd level 3rd level 3rd level 3rd level 2nd level 2nd level 2nd level 2nd level 2nd level 2nd level 3rd level 1st level 1st level 2nd level 3rd level 2nd level Item Number Level of Difficulty by Researchers Intuition 3rd level 2nd level 2nd level 3rd level 3rd level 2nd level 3rd level 1st level 2nd level 2nd level 3rd level 3rd level 3rd level 3rd level 2nd level 2nd level 3rd level 3rd level 3rd level 1st level 3rd level 3rd level 2nd level 2nd level 2nd level 2nd level 2nd level 1st level 2nd level 2nd level Level of Difficulty by Test Responses 1st level 1st level 2nd level 2nd level 2nd level 1st level 2nd level 2nd level 1st level 1st level 1st level 2nd level 1st level 2nd level 2nd level 3rd level 3rd level 3rd level 2nd level 2nd level 3rd level 3rd level 3rd level 2nd level 2nd level 1st level 3rd level 2nd level 3rd level 2nd level
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Conclusions and Recommendations
Conclusion
In view of the findings of this study, the following conclusions were drawn:
1. The results of this study reflect that the diagnostic test being developed is valid. It is an instrument that can measure the desired trait of second year students in Mathematics II. 2. The final output of the test showed that the over-all test difficulty were within the range of -1.37 to 1.35. The diagnostic test therefore has an average level of difficulty. 3. The discrimination indices were within the range of 0.23 to 0.51. Thus, the test is moderately discriminating instrument. 4. With the KR-20 coefficient at 0.84, the developed diagnostic test, as introduced by Shrout (1998), indicates high reliability. 5. Based from the previous items, the diagnostic test developed in this study is valid, highly reliable and fair. Hence, this test is standardized and can now be used in assessing Mathematics learning of second year students in DORSHS. 6. The test had identified the least learned, zone of proximal development and mastered competencies of DORSHS second year high school students in Mathematics-II. 7. The study revealed the considerable difference of intuitive item difficulty setting from actual item difficulty results of students.
Recommendations
Based on the findings and conclusions of this study, the following are the general recommendations:
1. The standardized diagnostic test can already be used in assessing Mathematics learning of second year students in DORSHS prior from any Achievement Test review program. 2. Other researches like determining the coherence of the test to the second year National Achievement Test (NAT) can be made. 3. A further validation of the test using IRT is highly encouraged to second year students of other schools or to another batch of second year students in DORSHS. 4. The test and its findings can also be utilized for any comparative study on item analysis between Classical Test Method and IRT. 5. The result of this study can be utilized for any possible research on students assessment.
References Baker, F. (2001). The Basics of Item Response Theory (2nd ed.). United States of America: ERIC Clearinghouse on Assessment and Evaluation. Bontempo, Brian D., PhD (2009). Measurement Art. The Point-Biserial Correlation Coefficient. Retrieved June 30, 2009, from http://www.mountainmeasurement.com/blog/?p=148. Brannick, M. (2006). Concepts from IRT that Move Beyond Classical Test Theory. Multiple Regression and Research Methods. Retrieved November 22, 2009, from http://luna.cas.usf.edu/~mbrannic/files/pmet/irt.htm. Calmorin, L. P. (2004). Educational Research Measurement and Evaluation (3rd ed.). Manila, Philippines: National Book Store, Inc. Cherry, K. (2009). Reliability - What Is Reliability. Retrived November 20, 2009 from http://psychology.about.com/od/researchmethods/ Davies, A., Arbuckle, M., Bonneau, D.(2005). Assessment For Learning: Planning for Professional Development. Retrieved October 14, 2009 from http://electronicportfolios.org/afl/Assessment4learning.pdf. Hambleton, R., Swaminathan, H., Jane Rogers, H. (1991). Fundamentals of Item Response Theory. United States of America: SAGE Publications, Inc. Izard, J. (2005). Quantitative Research Methods in Educational Planning. Overview of Test Construction. Paris, France. International Institute for Educational Planning/UNESCO. Kim, S. H., Cohen, A. S., & Park, T. H. (1995). Detection of Differential Item Functioning in Multiple Groups. Journal of Educational Measurement, 32: 261 - 276. Educational Diagnostic Prescriptive Services (2009). Educational Diagnostic Prescriptive. Retrived October 20, 2009 from http://homeschoolcreations.blogspot.com/2009/09/educational. Shrout, PE (1998). Measurement Reliability and Agreement in Psychiatry. Statistics Methods in Medical Results. United States of America: SAGE Publications, Inc. Vygotsky, L.S. (1978). Mind and society: The development of higher psychological processes. Cambridge, MA: Harvard University Press.

Development of Mathematics Diagnostic Tests

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Development of Mathematics Diagnostic Tests

Загружено:

Авторское право:

Доступные форматы

Development of Mathematics Diagnostic Test for DORSHS Second Year High School Students Using Item Response Theory

Jeremias C. Ceniza Donnell C. Cereno

Conceptual Framework Figure1. The Conceptual Paradigm of the Study

Item Response Theory (IRT)

Conducting a Diagnostic Test

Stages in Test Construction

Validity, Reliability and Usability

Processing test responses through IRT model

Figure 2. Sample Item Characteristic Curve (ICC)

Difficulty and Item Discrimination

Labels for item discrimination parameter (a) values

Verbal label Negative Zero Low Moderate High

Very easy Easy Average Difficult Very difficult

Evaluators of the Constructed Test

Results and Discussion

Difficulty and Discrimination Indices of Test Items

Revisions and Replacements of Weak Items

Table 4. IRT Generalized Item Analysis Result for Item No. 1

Finalization of the Test

Table 6. Summary Statistics from Item Analysis Results

Zone of Proximal Development

Table 7. Students Achievement on Different Learning Areas in Mathematics-II

Mathematics Learning Area within the Groups Zone of Proximal Development

Upper Top Group (UTG)

Upper Middle Group (UMG)

Lower Middle Group (LMG)

Conclusions and Recommendations

Вам также может понравиться