Вы находитесь на странице: 1из 67

Assessing and Evaluating

Learning
Outline:
Introduction
Definition of assessment and evaluation
Aim of student evaluation
Steps in student evaluation
The basic principles of assessment/ evaluation
Regulation of learning by the teacher
Types of evaluation
Qualities of a test
Characteristics of measurement instrument
Advantages and disadvantages of different types of tests
Assessing and Evaluating

Learning
Definition of assessment:
Assessment is the process of gathering information on student learning.

Definition of evaluation:
Evaluation is the process of analyzing, reflecting upon, and summarizing
assessment information, and making judgments and/or decisions based on
the information collected.
Aim
Incentiveof student evaluation
to learn
Feedback to student
Modification of learning activities
Selection of students
Success or failure
Feedback to teacher
Protection of society
Types of evaluation

1- Formative evaluations:
It is an ongoing classroom process
that keeps students and
educators informed of students
progress toward program learning
objectives.

The main purpose of formative


evaluation is to improve
instruction and student learning.
2- Summative evaluations
It occurs most often at the end of a unit.

The teacher uses summative evaluation to determine what has been learned
over a period of time, to summarize student progress, and to report to
students, parents and educators on progress relative to curriculum
objectives.

3- Diagnostic evaluation
It usually occurs at the beginning of the school year or before a new unit.
It identifies students who lack prerequisite knowledge, understanding or
skills.
Diagnostic testing also identifies student interests.
Diagnostic evaluation provides information essential to teachers in designing
appropriate programs for all students.
Steps in student evaluation
The criteria of the educational objectives

Development and use of measuring instruments

Interpretation of measurement data

Formulating of judgment and taking of appropriate action


Principles of Evaluation
Evaluation should be
1. Based on clearly stated objectives
2. Comprehensive
3. Cooperative
4. Used Judiciously
5. Continuous and integral part of the teaching
learning process
Qualities of a Good Measuring Instrument

Validity: the extent to which the instrument


measures what it is intended to measure.
Reliability: the consistency with which an
instrument measures a given variable.
Objectivity: the extent to which independent and
competent examiners agree on what constitute a
good answer for each of the elements of a
measuring instruments
Practicability: the overall simplicity of the use of
a test both for test constructor and for students.
Qualities of a test

Directly related to educational objectives


Realistic& practical
Concerned with important & useful matters
Comprehensive but brief
Precise& clear
QUALITIES OF A GOOD MEASURING
INSTRUMENT

Validity means the degree to which a test or


measuring instrument measures what it intends to
measure. The validity of a measuring instrument
has to do with its soundness, what the test
measures its effectiveness, and how well it could
be applied.

For instance, to judge the validity of a performance


test, it is necessary to consider what kind of
performance test is supposed to measure and how
well it manifests itself.
VALIDITY

Denotes the extent to which


an instrument is measuring
what it is supposed to
measure.
Criterion-Related Validity

A method for assessing the


validity of an instrument by
comparing its scores with another
criterion known already to be a
measure of the same trait or skill.
Criterion-related validity is usually
expressed as a correlation between
the test in question and the
criterion measure. The correlation
coefficient is referred to as a

validity coefficient
Types of Validity
Content Validity. Content validity means the extent to which the content or
topic of the test is truly representative of the course. It involves,
essentially, the systematic examination of the test content to determine if it
covers a representative sample of the behaviour domain to be measured. It
is very important the behaviour domain to be tested must be systematically
analysed to make certain that all major aspects are covered by the test
items and in correct proportions. The domain under consideration should be
fully described in advance rather than defined after the test has been
prepared.
CONTENT
Content validity isVALIDITY
described by the relevance of a test to different types of
criteria, such as thorough judgment and systematic examination of relevant
course syllabi and textbooks, pooled judgment of subject matter expert,
statement of behavioural objectives, analysis of teacher-made test
questions, and among others. Thus content validity depends on the
relevance of the individuals responses to the behavior are under
consideration rather on the apparent relevance of item content.
Content
Content validityvalidity
is commonly used in evaluating achievement test. A well-
constructed achievement test should cover the objective of instruction, not
just its subject matter. The Taxonomy of educational Objectives by Bloom
would be of great help in listing the objectives to be covered in an
achievement test.
Content validity is particularly appropriate for the criterion referenced
measure. It also applicable to certain occupational test designed to select
and classify employees. But content validity is inappropriate for aptitute
and personality tests.
CONTENT VALIDITY

Whether the individual


items of a test
represent what you
actually want to
assess
ILLUSTRATION
For instance, a teacher wishes to validate a test in Mathematics, He request
experts in Mathematics to judge if the test items or questions measure the
knowledge, skills, and values supposed to be measured. Another way of
testing validity is for the teacher to check if the test items or questions
represent the knowledge, skills and values suggested in the Mathematics
course content.
Good and Scates(1972) suggested the
1. Is the question
evidence on theor
of test subject? Yes ___
questionnaire No___
validity
2. Is the questions perfectly clear
which are as follows.
and unambiguous? Yes ___ No___
3. Does the question get at something
stable which is typical of the
individual or of the situation? Yes ___ No___
4. Does the question pull? Yes ___ No___
5. Do the responses show a reasonable
range of variation? Yes ___ No___
6. Is the information obtained consistent? Yes ___ No___
7. Is the item sufficiently inclusive? Yes ___ No___
8. Is there a possibility of using an
external criterion to evaluate the test/
questionnaire? Yes ___ No___
CONCURRENT
Concurrent validity is the VALIDITY
degree to which the test agrees or correlates with
a criterion set up as an acceptable measure. The criterion is always available
at the time of testing. It is applicable to test employed for the diagnosis of
existing status rather than for the prediction of further outcome.
CONCURRENT VALIDITY

The extent to which a


procedure correlates
with the current
behavior of subjects
ILLUSTRATION
For example, a teacher wishes to validate a Science achievement test he has
constructed. He administers the test to group of Science students. The
results of the test is correlated with an acceptable Science test which has
been previously proven as valid. If the correlation is high the Science test
he has constructed is valid.
PREDICTIVE VALIDITY
Predictive Validity is determined by showing how well predictions made from
the test are confirmed by evidence gathered at some subsequent time. The
criterion measure against this type of validity is important because the outcome
of the subject is predicted.
PREDICTIVE VALIDITY

The extent to which a


procedure allows
accurate predictions
about a subjects
future behavior
ILLUSTRATION
For instance, the teacher wants to estimate how well a student may be able
to do in graduate school courses on the bases of how well he has done on
test he took in the undergraduate courses. The criterion measure against
which the test scores are validated and obtained are available after a long
period of interval.
CONSTRUCT VALIDITY
Construct validity of the test is the extent to which the measures a
theoretical trait. This involves such test as those of understanding,
appreciation and interpretation of data. Examples are intelligence and
mechanical aptitude tests.
CONSTRUCT VALIDITY
The extent to which a test measures a
theoretical construct or attribute.

CONSTRUCT
Abstract concepts such as intelligence,
self-concept, motivation, aggression and
creativity that can be observed by some
type of instrument.
ILLUSTRATION
For example, a teacher wishes to establish the validity of an IQ using the
Culture fair Intelligence Test. He hypothesizes that students with high IQ
also have high achievement and those with low IQ, low achievement. He
therefore administers both Culture Fair Intelligence Test and achievement
test to groups of students with high IQ have high scores in the achievement
test and those with low IQ have low scores in achievement test, the test is
valid.
A tests construct
validity is often
assessed by its
convergent and
discriminant validity.
FACTORS AFFECTING
VALIDITY
1. Test-related factors
2. The criterion to which you
compare your instrument may
not be well enough established
3. Intervening events
4. Reliability
RELIABILITY
Reliability means the extent to which a test is dependable, self-consistent
and stable. In other words, the test agrees with itself. It is concern with
the consistency of responses from moment to moment. Even if a person
takes the same test twice, the test yields the same results. How ever a
reliable test may not always be valid.
RELIABILITY
The consistency of measurements

A RELIABLE TEST
Produces similar scores across various
conditions and situations, including
different evaluators and testing
environments.
How do we account for an individual
who does not get exactly the same
test score every time he or she takes
the test?
1. Test-takers temporary psychological or
physical state
2. Environmental factors
3. Test form
4. Multiple raters
RELIABILITY COEFFICIENTS
The statistic for expressing
reliability.
Expresses the degree of
consistency in the measurement
of test scores.
Donoted by the letter r with two
identical subscripts (rxx)
RELIABILITY
For instance, Student C took Chemistry test twice. His anwser in item 5
What is the neutral ph? is 6.0. In the second administration of the same
test and question, his answer is still 6.0, thus, his response is reliable but not
valid. His answer is reliable due to consistency of responses, 6.0, but not
valid due to no veracity of his answer. The correct answer is pH 7.0. Hence,
a reliable tst may not always be valid.

METHODS IN TESTING THE RELIABILITY
TEST RETEST METHOD. The same measuring instrument is
OF GOOD MEASURING
administered twice to theINSTRUMENT
same group of students and the
correlation coefficient is determined. The limitations of
this method are (1) when the time interval is short, the
respondents may recall their previous responses and this
tends to make the correlation coefficient high, (2) when
the time interval is long, such factors as unlearning,
forgetting, among others may occur and may result in low
correlation of the measuring instrument, and (3)
regardless of the time interval separating the two
administrations, other varying environmental conditions
such as noise, temperature, lighting, and other factors
may affect the correlation coefficient of the measuring
instrument.
TEST-RETEST RELIABILITY

Suggests that subjects


tend to obtain the same
score when tested at
different times.
A Spearman Rank Correlation
Coefficient
Spearman rho is a statistical tool used to measure the relationship between
paired ranks assigned to individual scores on two variables, X and Y. Thus,
this is used to correlate the scores in a test-retest method.
Spearman rho formula:
rs = 1 - 6D2
N3 N
Rs = Spearman rho
D2 = Sum of the squared difference between ranks
N total number of cases
Ex. Spearman rho Computation of the First
Students
and Second XAdministration
Y Rx Ry D
of Achievement D2
Test in
1 English(Artificial
90 40 70 30
data)
2.0 7.5 5.5 30.25
2 43 43 31 31 13.0 12.5 0.5 0.25
3 84 48 79 31 6.5 3.0 3.5 12.25
4 86 55 70 43 4.5 7.5 -3.0 9.00
5 55 75 43 43 11.0 10.5 0.5 0.25
6 77 77 70 70 8.5 7.5 1.0 1.0
7 84 77 75 70 6.5 4.5 2.0 4.00
8 91 84 88 70 1.0 1.0 0.0 0.00
9 40 84 31 70 14.0 12.5 1.5 2.25
10 75 86 70 75 10.0 7.5 2.5 6.25
11 86 86 80 75 4.5 2.0 2.5 6.25
12 89 89 75 79 3.0 4.5 -1.5 2.25
13 48 90 30 80 12.0 14.0 2.0 4.0
14 77 91 43 88 8.5 10.5 -2.5 4.0
TOTAL D2 = 82.00
SPEARMAN
Rs = 1 - 6D2 RHO VALUE
N3 N
= 1 - 6 (82)
(14)3 14
= 1 492
2744 14
= 1 - 492/2730
= 1- 0.18021978
= 0.82(high relationship)
PARALLEL-FORMS
Parrallel-forms method. ParallelMETHOD.
or equivalent forms of a test may be
administered to the group of students, and the paired observations
correlated. In estimating reliability by the administration of parallel or
equivalent forms of a test, criteria of parallelism is required (Ferguson and
Takane, 1989). The two forms of the test must be constructed so that the
content, type of item, difficulty, instructions for administration, and may
others, are similr but not identical>
ALTERNATE FORMS
RELIABILITY
Also known as equivalent forms reliability
or parallel forms reliability
Obtained by administering two equivalent
tests to the same group of examinees
Items are matched for difficulty on each
test
It is necessary that the time frame between
giving the two forms be as short as
possible
Split-Half Reliability

Sometimes referred to as
internal consistency
Indicates that subjects scores
on some trials consistently
match their scores on other
trials
For instance, a test is administered to the students as pilot
sample to test the reliability coefficient of the odd and even
items.

Scores Ranks Difference


Students X odd Y Even Rx Ry D D2
1 23 30 9 7.5 1.5 2.25
2 25 24 7.5 9.5 -2.0 4.00
3 27 30 6 7.5 -1.5 2.25
4 35 40 5 5 0.0 0.00
5 48 55 3 2.5 0.5 0.25
6 21 24 10 9.5 0.5 0.25
7 25 35 7.5 6.0 1.5 2.25
8 50 51 2 4.0 -2.0 4.00
9 38 60 4 1 3.0 9.00
10 55 55 1.0 2.5 -1.5 2.25
Total 26.50
rht = .84

rwt =. 91 (very high reliability)


Internal Consistency method
This method is used with psychological test which consist of dichotomously
score items. The examinee either passes or fails in an item. A rating of 1 is
assigned for a pass and 0(zero) for failure. The Method of obtaining this
method is determined by Kuder Richardson Formula 20. The formula is a
measure of internal consistency of homogeneity of measuring instrument.
The Formula is
rxx = N SD 2 pi qi; SD = (X X)2 X = X/N
N1 SD 2 N-1
COMPUTATION OF KUDER
I 1RICHARDON
2 3 4 5 6 FORMULA
7 8 9 10 20
11 12 13 14 f pi qi piqi
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 12 .86 .1 .1204
4
2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 12 .86 .1 .1204
4
3 1 1 1 1 1 1 1 1 1 1 1 0 0 0 11 .79 .1 .1659
2
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .71 .2 .1059
9
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .71 .2 .2059
9
6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .7 .2 .2059
9
7 1 1 1 1 1 1 1 1 1 0 0 0 0 0 9 .64 .3 .2304
6
8 0 1 1 1 1 1 1 1 1 0 1 0 0 0 8 .57 .4 .2451
3
9 0 1 1 1 1 1 1 1 1 0 0 0 0 0 8 .57 .4 ..2451
3
1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 4 .29 .7 .2059
Rxx = .79 High relationship
INTERRATER RELIABILITY
Involves having two raters independently
observe and record specified behaviors,
such as hitting, crying, yelling, and getting
out of the seat, during the same time period

TARGET BEHAVIOR
A specific behavior the observer is
looking to record
Interpretation of Correlation of
Coefficient
An r from + 0.00Values
to + 0.20 denotes negligible correlation
An r from + 0.21 to + 0.40 denotes low correlation
An r from + 0.41 to + 0.70 denotes marked or moderate correlation
An r from + 0.71 to + 0.90 denotes high correlation
An r from + 0.91 to + 0.99 denotes very high correlation
An r from + 1.00 denotes perfect correlation
OBTAINED SCORE
The score you get when you administer a test
Consists of two parts: the true score and the
error score

STANDARD ERROR of
MEASUREMENT (SEM)
Gives the margin or error that you should
expect in an individual test score because of
imperfect reliability of the test
Evaluating the Reliability Coefficients

The test manual should indicate why a


certain type of reliability coefficient was
reported.
The manual should indicate the conditions
under which the data were obtained
The manual should indicate the important
characteristics of the group used in
gathering reliability information
FACTORS AFFECTING
RELIABILITY
1. Test length
2. Test-retest interval
3. Variability of scores
4. Guessing
5. Variation within the test situation
Test reliability can be improved by the following
INCREASED NUMBER OF TEST ITEMS.
factors
HETEROGENEITY OF THE LEARNER GROUP.
MODERATE ITEM DIFFICULTY.
OBJECTIVE SCORING.
LIMITED TIME
USABILITY
Usability means the degree to which the measuring instrument can be
satisfactorily used by teachers, researchers, supervisors and school managers
without undue expenditure of time, money, and effort. In other words,
usability means practicality
Factors
Ease that determine
of administration usability
Ease of scoring
Construction of the test in objective type
Answer keys are adequately prepared
Scring directions are fully understood
Ease of interpretation and application
Low cost
Proper mechanical makeup
Advantages and disadvantages of different types of
tests

1- Oral examinations:

Advantages
1. Provide direct personal contact with candidates.
2. Provide opportunity to take mitigating circumstances into
account.
3. Provide flexibility in moving from candidate's strong points to
weak areas.
4. Require the candidate to formulate his own replies without
cues.
5. Provide opportunity to question the candidate about how he
arrived at an answer.
6. Provide opportunity for simultaneous assessment by two
examiners.
Disadvantages
1- Oral examinations
1. Lack standardization.
2. Lack objectivity and reproducibility of results.
3. Permit favoritism and possible abuse of the
personal contact.
4. Suffer from undue influence of irrelevant factors.
5. Suffer from shortage of trained examiners to
administer the examination.
6. Are excessively costly in terms of professional time
in relation to the limited value of the information
it yields.
Advantages
1. 2- Practical
Provide opportunityexaminations
to test in realistic setting skills
involving all the senses while the examiner observes and
checks performance.
2. Provide opportunity to confront the candidate with
problems he has not met before both in the laboratory
and at the bedside, to test his investigative ability as
opposed to his ability to apply ready-made "recipes".
3. Provide opportunity to observe and test attitudes and
responsiveness to a complex situation (videotape
recording).
4. Provide opportunity to test the ability to communicate
under Pressure, to discriminate between important and
trivial issues, to arrange the data in a final form.
2- Practical examinations
Disadvantages
1. Lack standardized conditions in laboratory
experiments using animals, in surveys in the
community or in bedside examinations with patients
of varying degrees of cooperativeness.
2. Lack objectivity and suffer from intrusion or
irrelevant factors.
3. Are of limited feasibility for large groups.
4. Entail difficulties in arranging for examiners to
observe candidates demonstrating the skills to be
tested.
3- Essay examinations
Advantages
1. Provide candidate with opportunity to
demonstrate his knowledge and his ability to
organize ideas and express them effectively

Disadvantages
1. Limit severely the area of the student's total work
that can be sampled.
2. Lack objectivity.
3. Provide little useful feedback.
4. Take a long time to score
4- Multiple-choice questions

Advantages
1. Ensure objectivity, reliability and validity; preparation of
questions with colleagues provides constructive criticism.
2. Increase significantly the range and variety of facts that can
be sampled in a given time.
3. Provide precise and unambiguous measurement of the
higher intellectual processes.
4. Provide detailed feedback for both student and teachers.
5. Are easy and rapid to score.
4- Multiple-choice questions
Disadvantages
1. Take a long time to construct in order to avoid arbitrary and ambiguous
questions.
2. Also require careful preparation to avoid preponderance of questions testing
only recall.
3. Provide cues that do not exist in practice.
4. Are "costly" where number of students is small.

Вам также может понравиться