Вы находитесь на странице: 1из 12

Construction and Development of a Test Instrument

Carlo Magno
Ateneo De Manila University

Abstract
This study investigated the psychometric properties and item analysis of a one-unit test in
geography for grade three students. The skills and contents of the test were based on the contents
covered for the first quarter that is indicated in the syllabus. A table of specifications was
constructed to frame the items into three cognitive skills that includes knowledge,
comprehension, and application. The test has a total of 40 items on 10 different test types. The
items were reviewed by a social studies teacher and academic coordinator. The split-half
reliability was used and a correlation of .3 was obtained. Each test type was correlated and
resulted from low and high coefficients. The item analysis showed that most of the items turned
out to be easy and most are good items.

Introduction
The purpose of this study is to construct and analyze the items of a one-unit geography
test for grade three students. The test basically measures grade three student’s achievement on
Philippine Geography for the first quarter that served as a quarterly test. The test when
standardized through validation and reliability would be used for future achievement test in
Philippine Geography.
There is a need to construct and standardize a particular achievement test in Philippine
Geography since there is none yet available locally.
The test is in Filipino language because of the nature of the subject. The subject cover
topics on (1) Kapuluan ng Pilipinas; (2) Malalaki at Maliliit na Pulo ng Bansa; (3) Mapa at Uri
ng Mapa; (4) Mga Direksyon; (5) Anyong Lupa at Anyong Tubig; (6) Simbolong Ginagamit sa
Mapa; (7) Panahon at Klima; (8) Mga Salik na may Kinalaman sa Klima; (9) Mga Pangunahing
Hanapbuhay sa Bansa; (10) Pag-aangkop sa Kapaligiran. The topics were based upon the lessons
provided by the Elementary Learning Competence from the Department of Education.
The test aims for the students to: (1) Identify the important concepts and definitions; (2)
comprehend and explain the reasons for given situations and phenomena; (3) Use and analyze
different kinds of maps in identifying important symbols and familiarity of places.

Method

Search for Skills and Content Domain


The skills and contents of the test were identified based on the topics covered for grade
three students in the first quarter. The test is intended to be administered for the first quarter
exam. The skills intended for the first quarter’s topic include identifying concepts and terms,
comprehending explanations, applying principles on situations, using and analyzing maps,
synthesizing different explanations for a particular event, and evaluating the truthfulness and
validity of reasons and statements through inference.
In constructing the test, a table of specifications was first constructed to plan out the
distribution of items for each topic and the objectives to be gained by the students.

Table 1. Table of Specification for a unit in Philippine Geography for Grade 3


Nilalaman Natutukoy ang Nauunawaan ang Nagagamit at Total Number
mahahalagang mga dahilan sa nasusuri ang of Items
konsepto at mahahalagang mapa sa pagtukoy
kahulugan kapaliwangan sa ng mga
bawat sitwasyon mahahalagang
pananda
Kapuluang 4 4
Pilipinas
Malalaki at 4 4
maliliit na pulo ng
bansa
Mapa at Uri ng 4 4
mapa
Mga direksyon 6 6
Anying lupa at 5 5
Anyong Tubig
Simbolong 4 4
ginagamit sa mapa
Panahon at Klima 2 3 5
Mga salik na may 2 2
kinalaman sa lima
Mga pangunahing 3 3
hanapbuhay ng
bansa
Pag-aangkop sa 3 3
kapaligiran
Total Number of 11 16 13 40
Items
Percentage 27.5% 40% 32.5% 100

Table of Specifications
The Table of Specification contains 10 topics taken which is a unit about Philippine
Geography. The 27.5% of the items were placed for the knowledge level, 40% were placed for
comprehension, and 32.5% were placed on the application level. Most of the items were
concentrated on the comprehension since the main purpose is for the students to understand and
comprehend the unit on Philippine Geography and it is the foundation knowledge for the entire
lesson for the school year. Having mastered this base knowledge will help students explain and
give reasons for the next lessons that will be taken. Also, most of the items were distributed on
the application level since the students need to learn practically how to use maps, and how could
they benefit from using maps and figures of the unit. Few items were placed on the knowledge
part since there is a little need for the students to recall and memorize concepts and terms. The
main highlight of this unit is to gain the ability to explain geographical principles on Philippine
geography and its relatedness to our culture.

Item Writing
There were 40 items constructed based on the Table of Specification (see Table 1). A 40-
item test is just enough for grade three students since it is not too much or few for their capacity.
Also in determining the amount of items to place on the test, the attention span and time frame
for testing is considered. Basically in the quarterly test, a particular test on a subject is given a
time limit of one hour.
The items were based more from what the students gained from the discussion in the
classroom, reflection on the topic, work exercises, group works, activities in school, and from the
book.
The items were divided into 10 parts in the test. Test I contains four items in a True or
False type. Test II contains 5 items in a matching type of test. Test III contains 2 items in a
multiple choice type and the stem item is bases on a figure presented. Test IV contains 4 items
within 2 situations. Test V contains 4 items in a multiple choice type, a physical map as a basis
for answering. Test VI another multiple choice type and concentrates on the use of different types
of map. Test VII a short answer type of test in which the students will supply what direction is
asked from the question base on a map presented containing 6 items. Test VIII a 5-item
interpretive exercise type of test in which a situation is given and for each situation inferences
were listed and the task of the students is to choose the best inference applicable for the given
situation. Test IX a three-item multiple choice type in which the students will answer depending
on a figure of a Philippine map and whether condition id given. Test X a three-point essay
question evaluated according to the (a) correctness of answer (1.5pts) ; (b) Explanation (1 pt);
and, (c) followed instruction (o.5 pt). There were two raters who evaluated the answer for the
essay type of test.

Content Validation
The test was content validated and reviewed by a teacher in Social Studies from Ateneo
de Davao. The suggestions were considered and the test was revised accordingly. Also, before
arriving with final draft of test for administration, it was checked by the Academic coordinator of
the School where the test will be administered whether the items are appropriate for the level of
grade three students and some typographical errors. In the process of content validation, the
topics covered and the table of specification was provided in order to determine whether the
items were generally covered for the topics studied.

Test Administration
Respondents. There were 88 grade 3 students in three sections who took the test for the
purpose of a Quarter Examination. Out of the 88 students, the top 40 students were the ones that
were included in the sample. There are 11 (27%) respondents each for the upper and lower group
which scores is subjected for item analysis for difficulty and discrimination.
Procedure. The teacher for grade 3 Sibika at Kultura directly instructed the two other
teachers who will administer the test for the two other sections. It was kept into consideration the
constancy and the other factors that would affect the students’ performance on the test. The test
was administered simultaneously for the three classes in the morning as the first test to be taken
for that day. The students took the test for one hour, some students were able to finish the test
ahead of time, and they were just advised to review their work. When the bell rang the teacher
instructed the students to pass their paper forward. All the test papers were gathered and were
checked. After a week the students were informed about their results and the top 40 students that
were included in the sample for study was informed about the teachers’ concern for their test. A
letter of request for the parents was sent to inform them about the purpose of the research and the
students’ score, the parents replied positively.
Data-Analysis. The scores were tabulated and encoded in so that the computation of the
results will be easy. The split-half method for obtaining the internal consistency among the
scores was employed. The odd and the even items were separated and were correlated in using
the Pearson’s r moment correlation coefficient. The upper and lower groups were chosen
according to 27% of the lowest and the highest among the 40 respondents. The item analysis was
employed by computing for each item’s difficulty and the item discrimination. The remark for
each item was then given according to the standards of difficulty and its discrimination, whether
a good item or not. The Coefficient of Concordance was used in order to inter-rater reliability of
the essay type of test. There were two judges who evaluated and used criteria to score the essay
part of the test.

Result and Discussion


Reliability
The test’s reliability was generated through the split-half method by correlating the odd
numbered and even numbered items. The arrived internal consistency is 0.3, which is low but
definite correlation among the items. The low correlation between the odd and even numbered
items can be accounted with the different topic contents within the 40-item test. It should have
been more appropriate to construct a large pool of items for the 10 content topics or factors that
the test have, but 40 items is the usual standard of items of the school for the quarterly test. The
test has been administered for the purpose of quarterly test because the usability of the test is
considered. With regards with this type of measure it can only be accounted with the reliability
of half of the test. This explains the low value of the correlation coefficient. The split-half
coefficient is then transformed into a spearman brown coefficient since the correlation is only for
the half of the test. The resulting Spearman-Brown coefficient is 0.46 which means that the items
have a moderate relationship.
Also, it is a rule of thumb that there should at least be 30 pairs of scores to be correlated,
but in this case there were only 18 scores correlated. The last item was not included since it has
no partner item to be correlated with because the other items were essay type in which subjected
to a different analysis. The low coefficient of internal consistency can also be accounted with the
various types of tests used, thus can be accounted with the variation and difference s in the
performance of the respondents. In other words, the respondents may respond and perform
differently for each type of test.
The nature of the test cannot be measured on its general homogeneity since the test
contains several topics and several types of format responses. Thus, respondents perform
differently for different types of test. The test has 10 types measuring different skills such as
identifying the important concepts and definitions, comprehension and explanations on the
reasons for given situations and phenomena, and using and analyzing different kinds of maps in
identifying important symbols and familiarity of places.
Although the dilemma is that the content domains included in the test is part of a general
topic on Philippine geography. To test the internal consistency among the 9 different contents,
correlation matrix was done.

Table2. Intercorrelation among the Nine contents of the Test.

I II III IV V VI VII VIII IX


I --
II -0.13 --
III 0.98* 1 --
IV 0.18 -0.81* -0.48* --
V -0.21 -0.42 0.47* -0.19 --
VI 0.19 0.58* 0.47* 0.6 -0.65* --
VII -0.73 0.28 0.4I* -0.56* 0.73* -0.24 --
VIII 0.07 -0.19 -0.47* 0.96* 0.08 -0.8 -0.25 --
IX 0.85* -0.58* 0.48* 0.15 0.97* -0.52* -0.52* 0.28 --

There is a high relationship between test I and test IX. The higher the scores on
identification of concepts the higher the scores on comprehension of weather map. Also, a high
relationship existed between test V and test IX. The higher the scores on the interpretation of a
physical map the higher the scores on interpretation of the weather map. There is also a high
relationship between test IV and test VIII. The higher the scores on the inference about the
Philippine islands, the higher the scores on the comprehension on weather. Generally, the results
on inter-correlation among the contents showed pretty crude results due to the few items and the
items for each type of the test were not equal. The pairing in the computation was done base on
the minimum number of items for each test type.

Item Difficulty and Index Discrimination


To evaluate the quality of each type of item in the test, item analysis was done by
determining each items difficulty and index discrimination. The proportion of examinees getting
each of items correctly was evaluated according to the scale below.

Difficulty Index Remark


.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item
Source: Lamberte, B. (1998). Determining the Scientific Usefulness of Classroom Achievement
Test. Cutting Edge Seminar. De La Salle University.

Table 3 indicates each item’s difficulty value and discrimination index value. The difficulty index
shows a pattern that 67.6% of the items are easy and 32.43% of the test is on the average scale.
Considering that the test was constructed or grade three students the teacher was putting it down
on the level of the student’s capacity and ability. But it may also mean that the students gained
mastery of the subject matter that most of them are able to answer it correctly. It should be taken
note that the easiness and difficulty of the items are dictated on the proportion of the students
who answered the item correctly. In this case, most of the respondents got the answer that is why
most of the items turned out to be easy. It can be accounted that in general, the test was fairly
easy since most of the items turned out 76% and above.
Also, Table 3 indicated the index discrimination of each item. There were 27% items that
are considered poor. These items were rejected since most scores is in the high range of the low
group and some scores of the low group are near to the scores of the high group who have
answered it correctly. Considering the poor items such as item 2,4, 9, 13, 15, 30, 31, 32, 33, and
34 the pattern is indicative. There are very few marginal items that are subjected for
improvement. There are only 8% (3 items) that are remarked as marginal since the scores of the
low group and the high groups are almost the same. This means that both the high and the low
group can answer this item fairly. 21.6% (8 items) of the items are reasonably good items since
there is enough interval between the high and low groups. Also there are few items remarked as
good items and enough to be considered as very good items. 16.21% of the items are good items
and 24.3% are very good items. There is a pattern that there is a wide distance of scores between
the high group and the low group.

Interrater Reliability
The coefficient of concordance was used to determine the degree of agreement between
the two raters who judged the essay type in the test. The essay type basically measures the
student’s knowledge on the adaptation of farmers in farming. The criteria used for rating the
essay is that: (a) at least 2 answers are correct (1.5pts); (b) the answer was explained (1 pt); (c)
and the instruction on answering was followed (0.5 pt). The results indicate that here is low
agreement between the two raters. A high value of W which is 0.74 was computed indicating
close concordance between the raters. This means that the two raters showed a small variation in
rating the answers in the essay. The small error of variance can be accounted with the difference
of the disposition of the two raters. The first rater was the actual teacher in the subject but the
second rater was also an Araling Panlipunan teacher but teaching in the higher level. There was a
difference on how they view the answer even though they talked about the rating procedure at
the start.
Conclusion
A low internal consistency was generated due to the different subject content in the test and each
test measures different skills. These two factors affected the internal consistency of the test. It is
indeed difficult to make it entirely uniform since the subject contents are required as minimum
learning competence by the Department of education. Also the listed subject contents are the
planned focus for the first quarter of the schools subject matter budgeting. A multiple regression
analysis was performed to observe the relationship among the test types. It was found that the
higher the scores on the interpretation of a physical map the higher the scores on the
interpretation of the weather map and also the higher the scores on the inference about the
Philippine Islands, the higher the scores on the comprehension on topics about weather. A high
correlation coefficient was found between these types. Although the results may not be too
accurate since the basis for the matrix comparison does not have equal number of items and the
minimum number of items were the only ones subjected in the analysis. It is recommended that
equal number of items for each test should be made to account a more accurate result in the
regression analysis. There is also a low agreement between the two raters for the essay type since
they have different perceptions on giving points for the answers. The item difficulty showed the
most of the items are easy since the students have gained mastery of the subject matter. The
index discrimination showed that the items are distributed according to its power. There are
almost equal number of items that are poor (27%), marginal item (8%), reasonably good (22%),
good (16%) and very good (24%).

Table 3. Item Discrimination and Index Discrimination.


Item Total High Low PH PL Difficul Remark Item Remar
No. Group Group ty Index Discrimi k
nation
1 32 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
2 26 7 6 0.636 0.545 0.591 Average 0.091 Poor
Item item
3 34 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
4 38 11 10 1 0.909 0.955 Easy Item 0.909 Poor
item
5 36 11 8 1 0.727 0.864 Easy Item 0.273 Reason
ably
Good
item
6 34 11 5 1 0.455 0.727 Average 0.545 Very
Item Good
item
7 33 10 8 0.909 0.727 0.818 Easy Item 0.182 Margin
al item
8 34 11 8 1 0.909 0.864 Easy Item 0.273 Reason
ably
Good
item
9 39 11 10 1 0.634 0.955 Easy Item 0.091 Poor
item
10 24 9 4 0.818 0.456 0.591 Average 0.455 Very
Item Good
item
11 23 9 5 0.818 0.273 0.636 Average 0.364 Good
Item item
12 22 10 3 0.818 0.818 0.545 Average 0.545 Very
Item Good
item
13 36 11 9 0.909 0.727 0.864 Easy Item 0.091 Poor
item
14 34 11 8 1 1 0.864 Easy Item 0.273 Margin
al item
15 39 10 11 1 1 1 Easy Item 0 Poor
item
16 28 10 5 0.909 0.455 0.682 Average 0.455 Very
Item Good
item
17 28 11 5 0.909 0.455 0.682 Average 0.455 Very
Item Good
item
18 34 10 7 1 0.636 0.818 Easy Item 0.364 Good
item
19 24 11 5 0.909 0.455 0.682 Average 0.455 Very
Item Good
item
20 37 11 8 1 0.727 0.864 Easy Item 0.273 Reason
ably
Good
item
21 29 11 5 1 0.455 0.727 Average 0.545 Very
Item Good
item
22 26 11 5 1 0.455 0.727 Average 0.545 Very
Item Good
item
23 33 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
24 37 11 8 1 0.727 0.864 Easy Item 0.273 Reason
ably
Good
item
25 37 7 8 1 0.818 0.864 Easy Item 0.273 Reason
ably
Good
item
26 24 11 9 1 0.364 0.909 Easy Item 0.182 Margin
al item
27 37 11 4 0.636 0.818 0.5 Average 0.273 Reason
Item ably
Good
item
28 35 11 9 1 0.636 0.909 Easy Item 0.182 Margin
al item
29 39 11 7 1 0.909 0.818 Easy Item 0.364 Good
item
30 40 11 10 1 1 0.955 Easy Item 0.091 Poor
item
31 40 11 11 1 1 1 Easy Item 0 Poor
item
32 40 11 11 1 1 1 Easy Item 0 Poor
item
33 40 11 11 1 1 1 Easy Item 0 Poor
item
34 40 11 11 1 1 1 Easy Item 0 Poor
item
35 27 11 3 1 0.273 0.636 Easy Item 0.727 Margin
al item
36 24 9 4 0.818 0.364 0.591 Average 0.455 Very
Item Good
item
37 36 11 7 1 0.636 0.818 Easy Item 0.364 Good
item
Table 4. Coefficient of Concordance
cases R1 Ranks R2 Ranks sum D D2
1 3 12 3 11.5 23.5 -16.5 272.25
2 3 12 3 11.5 23.5 -16.5 272.25
3 3 12 3 11.5 23.5 -16.5 272.25
4 3 12 3 11 23 -17 289
5 3 12 3 11 23 -17 289
6 3 12 3 11 23 -17 289
7 3 12 3 11 23 -17 289
8 3 12 3 11 23 -17 289
9 3 12 3 11 23 -17 289
10 3 12 3 11 23 -17 289
11 3 12 3 11 23 -17 289
12 3 12 3 11 23 -17 289
13 3 12 3 11 23 -17 289
14 3 12 3 11 23 -17 289
15 3 12 3 11 23 -17 289
16 3 12 3 11 23 -17 289
17 3 12 3 11 23 -17 289
18 3 12 3 11 23 -17 289
19 3 12 3 11 23 -17 289
20 3 12 3 11 23 -17 289
21 3 12 2.5 24 36 -4 16
22 3 12 2.5 24 36 -4 16
23 3 12 2.5 24 36 -4 16
24 3 12 2.5 24 36 -4 16
25 2.5 26 3 11 37 -3 9
26 2.5 26 2.5 24 50 10 100
27 2.5 26 3 11 37 -3 9
28 2 29 1.5 28 57 17 289
29 2 29 1.5 28 57 17 289
30 2 29 1 33.5 62.5 22.5 506.25
31 1.5 33.5 1 33.5 67 27 729
32 1.5 33.5 1 33.5 67 27 729
33 1.5 33.5 1 33.5 67 27 729
34 1.5 33.5 2 30.5 64 24 576
35 1.5 33.5 2 30.5 64 24 576
36 1.5 33.5 1.5 28 61.5 21.5 462.25
37 1 37 0 38 75 35 1225
38 0.5 38 0 38 76 36 1296
39 0 39.5 0.5 36 75.5 35.5 1260.25
40 0 39.5 0 38 77.5 37.5 1406.25
1600.5 15984.8
40.025 .74977
APPENDIX C
Table 4. Coefficient of Concordance
cases R1 Ranks R2 Ranks D D2
1 3 10.5 3 11.5 -1 1
2 3 10.5 3 11.5 -1 1

3 3 10.5 3 11.5 -1 1

4 3 10.5 3 11.5 -1 1

5 3 10.5 3 11.5 -1 1

6 3 10.5 3 11.5 -1 1

7 3 10.5 3 11.5 -1 1

8 3 10.5 3 11.5 -1 1

9 3 10.5 3 11.5 -1 1

10 3 10.5 3 11.5 -1 1

11 3 10.5 3 11.5 -1 1

12 3 10.5 3 11.5 -1 1

13 3 10.5 3 11.5 -1 1

14 3 10.5 3 11.5 -1 1

15 3 10.5 3 11.5 -1 1

16 3 10.5 3 11.5 -1 1

17 3 10.5 3 11.5 -1 1

18 3 10.5 3 11.5 -1 1

19 3 10.5 3 11.5 -1 1

20 3 10.5 3 11.5 -1 1

21 3 10.5 2.5 25 -14.5 210.25

22 3 10.5 2.5 25 -14.5 210.25

23 3 10.5 2.5 25 -14.5 210.25

24 3 10.5 2.5 25 -14.5 210.25


25 2.5 22 3 11.5 10.5 110.25
26 2.5 22 2.5 25 3 9

27 2.5 22 3 11.5 10.5 110.25

28 2 25 1.5 31 -6 36
29 2 25 1.5 31 -6 36
30 2 25 1 34.5 -9.5 90.25
31 1.5 29.5 1 34.5 -5 25

32 1.5 29.5 1 34.5 -5 25

33 1.5 29.5 1 34.5 -5 25

34 1.5 29.5 2 28.5 1 1

35 1.5 29.5 2 28.5 1 1

36 1.5 29.5 1.5 31 -1.5 2.25

37 1 33 0 39 -6 36
38 0.5 34 0 39 -5 25
39 0 35.5 0.5 37 -1.5 2.25
40 0 35.5 0 39 -1.5 12.25