Вы находитесь на странице: 1из 122

TOPIC 1

OVERVIEW OF ASSESSMENT:
CONTEXT, ISSUES AND TRENDS


1.0 SYNOPSIS

Topic 1 provides you with some meanings of test, measurement, evaluation
and assessment, some basic historical development in language
assessment, and the changing trends of language assessment in the
Malaysian context.

1.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

1. define and explain the important terms of test, measurement,
evaluation, and assessment;

2. examine the historical development in Language Assessment;


3. describe the changing trends in Language Assessment in the
Malaysian context and discuss the contributing factors.


1.2 FRAMEWORK OF TOPICS








CONTENT

SESSION ONE (3 hours)

1.3 INTRODUCTION

Assessment and examinations are viewed as highly important in most Asian
countries such as Malaysia. Language tests and assessment have also
become a prevalent part of our education system. Often, public examination
results are taken as important national measures of school accountability.
While schools are ranked and classified according to their students
performance in major public examinations, scores from language tests are
used to infer individuals language ability and to inform decisions we make
about those individuals.
In this topic, lets discuss about the concept of measurement at its
numerous definitions. We will also look into the historical development in
language assessment and the changing trends of language assessment
in our country.


1.4 DEFINITION OF TERMS test, measurement, evaluation, and
assessment.

1.4.1 Test


OVERVIEW OF
ASSESSMENT: CONTEXT,
ISSUES & TRENDS
Definitions
Differences
of various
tests
Purposes


The four terms above are frequently
used interchangeably in any academic
discussions. A test is a subset of assessment intended to measure a test-
taker's language proficiency, knowledge, performance or skills. Testing is a
type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are
being evaluated and measured.A test is first a method of measuring a test-
takers ability, knowledge or performance in a given area; and second it must
measure.
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-takers performance according to explicit
procedures or rules.

1.4.2 Assessment

Assessment is every so oftena misunderstood term. Assessment is
a comprehensive process of planning, collecting, analysing, reporting, and
using information on students over time(Gottlieb, 2006, p. 86).Mousavi
(2009)is of the opinion that assessment is appraising or estimating the level
of magnitude of some attribute of a person. Assessment is an important
aspect in the fields of language testing and educational measurement and
perhaps, the most challenging partof it. It is an ongoing process in
educational practice, which involves a multitude of methodological
techniques. It can consist of tests, projects, portfolios, anecdotal information
and student self-reflection.A test may be assessed formally or informally,
subconsciously or consciously, as well as incidental or intended by an
appraiser.


1.4.3 Evaluation

Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In
reality, evaluation is involved when the results of a test (or other assessment


procedure) are used for decision-making
(Bachman, 1990, pp. 22-23). Evaluation
involves the interpretation of information. If a teacher simply records
numbers or makes check marks on a chart, it does not constitute evaluation.
When a tester or marker evaluate, s/he values the results in such a way
that the worth of the performance is conveyed to the test-taker. This is
usually done with some reference to the consequences, either good or bad
of the performance.This is commonly practised in applied linguistics
research, where the focus is often on describing processes, individuals, and
groups, and the relationships among language use, the language use
situation, and language ability.

Test scores are an example of measurement, and conveying the
meaning of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a students
correct oral response with words like Excellent insight, Lilly!it is evaluation.

1.4.4 Measurement

Measurement is the assigning of numbers to certain attributes of
objects, events, or people according to a rule-governed system. For our
purposes of language testing, we will limit the discussion to unobservable
abilities or attributes, sometimes referred to as traits, such as grammatical
knowledge, strategic competence or language aptitude. Similar to other
tyoes of assessment, measurement must be conducted according to explicit
rules and procedures as spelled out in test specifications, criteria, and
procedures for scoring.Measurement could be interpreted as the process of
quantifying the observed performance of classroom learners. Bachman
(1990) cautioned us to distinguish between quantitative and qualitative
descriptions. Simply put, the former involves assigning numbers (including
rankings and letter grades) to observed performance, while the latter
consists of written descriptions, oral feedback, and non-quantifiable reports.


The relationships among test, measurement,
assessment, and their uses are illustrated in
Figure 1.



Figure 1:The relationship between tests, measurement and assessment.
(Source: Bachman, 1990)

2.0 Historical development in language assessment

From the mid-1960s, through the 1970s, language testingpractices
reflected in large-scale institutional language testing and in most language
testing textbooks of the time - was informed essentially bya theoretical view
of language ability as consisting of skills (listening, speaking, reading and
writing) and components (e.g. grammar, vocabulary, pronunciation)
and an approach to test design that focused on testing isolated
discrete points of language, while theprimary concern was with
psychometric reliability (e.g. Lado,1961; Carroll,1968). Language
testingresearchwas dominated largely bythe hypothesis that language
proficiency consisted of a single unitarytrait, and a quantitative,
statisticalresearch methodology (Oller, 1979).

The 1980s saw other areas of expansion in language testing,
mostimportantly, perhaps, in the influence of second language
acquisition(SLA) research, which spurred language testers to
investigate not only a wide variety of factors such as field
independence/dependence (e.g. Stansfield and Hansen, 1983; Hansen,


1984; Chapelle, 1988), academic
discipline and background knowledge
(e.g. Erickson and Molly, 1983; Alderson and Urquhart, 1985; Hale, 1988)
and discoursedomains (Douglas and Selinker, 1985) on language test
performance, but also the strategies involved in the process of test-
taking itself(e.g. Grotjahn, 1986; Cohen, 1987).

If the 1980s saw a broadening of the issues and concerns of language
testing into other areas of applied linguistics, the 1990s saw a continuation
of this trend. In this decade the field also witnessed expansionsin a number
of areas:
a) research methodology;
b) practical advances;
c) factors that affect performance on language tests;
d) authentic, or performance, assessments; and
e) concerns with the ethics of language testing and
professionalising the field
The beginning of the new millennium is another exciting time for
anyone interested in language testing and assessment research. Current
developments in the fields of applied linguistics, language learning and
pedagogy, technological innovation, and educational measurement have
opened up some rich new research avenues.

3.0 Changing trends in Language Assessment-Malaysian context

History has clearly shown thatteaching and assessment should be
intertwined in education.Assessment and examinations are viewed as highly
important in Malaysia. One does not need to look very far to see how
important testing and assessment havebecome in our education system.
Often, public examination results are taken as important national measures
of school accountability. Schools are ranked and classified according to their
students performance in major public examinations. Just as assessment
impacts student learning and motivation, it also influences the natureof
instruction in the classroom. There has been considerable recent literature


that haspromoted assessment as something
that is integrated with instruction, and not an
activitythat merely audits learning (Shepard, 2000). When assessment is
integrated with instructions, it informs teachers about what activities and
assignments will be most useful, what level of teaching is most appropriate,
and how summative assessments provide diagnostic information.

With this in mind, we have to look at the changing trends in
assessment particularly language assessment in this country, which has
been carried out mainly through the examination system until recent
years.Starting from the year 1845, written tests in schools were introduced
for a number of subjects. This trend in assessment continued with the intent
to gauge the effectiveness of the teaching-learning process. In Malaysia,
the development of formal evaluation and testing in education began after
Independence. Public examinations have long been the only measurement
of students achievement. Figure 1 shows the four stages/phases of
development of examination system in our country. The stages are as
follow:
Pre-Independence
Razak Report
RahmanTalib Report
Cabinet Report
Malaysia Education Blueprint (2013-2025)
On 3
rd
May 1956, the Examination Unit (later known as Examination
Syndicate) in the Ministry of Education (MOE) was formed on the
recommendation of the Razak Report (1956). The main objective of the
Malaysia Examination Syndicate (MES) was to fulfil one of the Razak
Reports recommendations, which was to establish a common examination
system for all the schools in the country.

In line with the on-going transformation of the national educational
system, the current scenario is gradually changing. A new evaluation system
known as the School Based Assessment (SBA) was introduced in 2002 as a


move away from traditional teaching to keep
abreast with changing trends of assessment
and to gauge the competence of students by taking into consideration both
academic and extra curricular achievements.

According to the Malaysian Ministry of Education (MOE), the new
assessment system aims to promote a combination of centralised and
schoolbased assessment. Malaysian Teacher Education Division (TED) is
entrusted by the Ministry of Education to formulate policies and guidelines to
prepare teachers for the new implementation of assessment. As emphasised
in the innovation of the student assessment, continuous school-based
assessment is administered at all grades and all levels. Additionally,
students sit for common public examinations at the end of each level. It is
also a fact that the role of teachers in the new assessment system is vital.
Teachers will be given empowerment in assessing their students.

The Malaysia Education Blueprint was launched in September this
year, and with it, a three-wave initiative to revamp the education system over
the next 12 years. One of its main focuses is to overhaul the national
curriculum and examination system, widely seen as heavily content-based
and un-holistic.It is a timely move, given our poor results at the 2009
Programme for International Student Assessment (PISA) tests. Based on the
2009 assessment, Malaysia lags far behind regional peers like Singapore,
Japan, South Korea, and Hong Kong in every category.
Poor performance in Pisa is normally linked to students not being
able to demonstrate higher order thinking skill. To remedy this, the Ministry
of Education has started to implement numerous changes to the
examination system. Two out of the three nationwide examinations that we
currently administer to primary and secondary students have gradually
seen major changes. Generally, the policies are ideal and impressive, but
there are still a few questions on feasibility that have been raised by
concern parties. Figure 2 below shows the development of educational
evaluation in Malaysia since pre-independence until today.

































Implementation of the
RahmanTalib
Report (1960)










Implementation
of the Razak
Report (1956)


Pr e -
Independenc e







Implementation of the Cabinet
Report (1979)






Implementation of
the Malaysia

Education Blueprint (2013 2025)




Examinations were
conducted according to the
needs of school or based
on overseas examinations
such as the Overseas
School Certificate.

Razak Report gave birth
to the National
Education Policy and the
creation of Examination
Syndicate (LP). LP
conducted examinations
such as the Cambridge
and Malayan Secondary School Entrance
Examination (MSSEE), and Lower
Certificate of Education (LCE) Examination.

RahmanTalib Report recommended the
following actions:
1. Extend schooling age to 15 years old.
2. Automatic promotion to higher classes.
3. Multi-stream education (Aneka Jurusan).
The following changes in examination
were made:
- The entry of elective subjects in LCE and
SRP.
- Introduction examination of the Standard 5
Evaluation Examination. - The introduction
of Malaysia's Vocational Education
Examination.
- The introduction of the Standard 3
Dignostic
The implementation of Cabinet Report
resulted in evolution of the education
system to its present state, especially with
KBSR and KBSM. Adjustments were
made in examination to fulfill the new
curriculum's needs and to ensure it is in
line with the National Education
Philosophy.

The emphasis is on School-Based
Assessment (SBA). It was first introduced in
2002. It is a new system of assessment and is
one of the new areas where teachers are
directly involved. The revamp of the national
examination and schoolbased assessments in
stages, whereby by 2016, at least 40% of
questions in
UjianPenilaianSekolahRendah (UPSR) and
50% in SijilPelajaran Malaysia (SPM) are of
high order thinking skills questions.

Figure 2: The development of educational evaluation in Malaysia
Source: Malaysia Examination Board (MES)
http://apps.emoe.gov.my/1pm/maklumatam.htm

By and large, the role of MES is to complement and complete the
implementation of the national education policy. Among its achievements
are:





Figure 3: The achievements of Malaysia Examination Syndicate (MES)
Source:Malaysia Examination Board (MES)
http://apps.emoe.gov.my/1pm/maklumatam.htm



Exercise
Describe the stages involved in the development of
educational evaluation in Malaysia.

Read
more:
http://www.nst.com.my/nation/general/schoolbased-
assessment-plan-may-need-tweaking1.166386


Tutorial question
Examine the contributing factors to the changing trends of language
assessment.
Create and present findings using graphic organisers.


TOPIC 2
ROLE AND PURPOSES OF
ASSESSMENT IN
TEACHING AND LEARNING


The
achievements
of Malaysia
Examination
Syndicate
Implementation
of Malay
Language as the
National
Language (1960)
Pioneering the
use of
computer in
the country
(1967)
Taking over the
work of the
Cambridge
Examination
Syndicate
Putting in place an
examination system
to meet national
needs
Recognition of
Examination
certificates
Implementation of
the Open
Certificate
Syndicate
ii

i

iii

iv


v


vi





2.0 SYNOPSIS

Topic 2 provides you an insight on the reasons/purposes of assessment. It
also looks at the different types of assessments and the classifications of
tests according to their purpose.


2.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

4. explain the reasons/purposes of assessment;
5. distinguish the differences between assessment of learning
and assessment for learning;

6. name and differentiate the different test types.


2.2 FRAMEWORK OF TOPICS







CONTENT

SESSION TWO (3 hours)

2.3 Reasons/Purpose of Assessment









Role and
Purposes of
Assessment in
Teaching and
Learning



Reasons / Purposes
of Assessment


Assessment of
Learning /
Assessment for
Learning

Types of Tests:
Proficiency,
Achievement,
Diagnostic, Aptitude,
and Placement Tests



Critical to educators is the use of
assessment to both inform and guide
instruction. Using a wide variety of assessment tools allows a teacher to
determine which instructional strategies are effective and which need to be
modified. In this way, assessment can be used to improve classroom
practice, plan curriculum, and research one's own teaching practice. Of
course, assessment will always be used to provide information to children,
parents, and administrators. In the past, this information was primarily
expressed by a "grade". Increasingly, this information is being seen as a
vehicle to empower students to be self-reflective learners who monitor and
evaluate their own progress as they develop the capacity to be self-directed
learners. In addition to informing instruction and developing learners with the
ability to guide their own instruction, assessment data can be used by a
school district to measure student achievement, examine the opportunity for
children to learn, and provide the basis for the evaluation of the district's
language programmes.
Assessment instruments, whether formal tests or informal
assessments, serve multiple purposes. Commercially designed and
administered tests may be used for measuring proficiency, placing students
into one of several levels of course, or diagnosing students strengths and
weaknesses according to specific linguistic categories, among other
purposes. Classroom-based teacher-made tests might be used to diagnose
difficulty or measure achievement in a given unit of a course. Specifying the
purpose of an assessment instrument and stating its objectives are an
essential first step in choosing, designing, revising, or adapting the
procedure an educator will finally use.
We need to rethink the role of assessment in effective schools, where
effective means maximising learning for the most students. What uses of
assessment are most likely to maximise student learning and well being?
How best can we use assessment in the service of student learning and
wellbeing? We have a traditional answer to these questions. Our traditional
answer says that to maximise student learning we need to develop rigorous
standardised tests given once a year to all students at approximately the
same time. Then, the results are used for accountability, identifying schools


for additional assistance, and certifying the
extent to which individual students are
meeting competency.

Let us take a closer look at the two assessments below i.e.
Assessment of Learning and Assessment for Learning.


2.4 Assessment of Learning

Assessment of learning is the use of a task or an activity to
measure, record, and report on a students level of achievement in
regards to specific learning expectations.
This traditional way of using assessment in the service of student
learning is assessment of learning - assessments that take place at a point
in time for the purpose of summarising the current status of student
achievement. This type of assessment is also known as summative
assessment.
This summative assessment, the logic goes, will provide the focus to
improve student achievement, give everyone the information they need to
improve student achievement, and apply the pressure needed to motivate
teachers to work harder to teach and learn.

2.5 Assessment for leaning

Now compare this to assessment for learning. Assessment for
learning is roughly equivalent to formative assessment - assessment
intended to promote further improvement of student learning during the
learning process.
Assessment for learning is more commonly known as formative and
diagnostic assessments. Assessment for learning is the use of a task or an
activity for the purpose of determining student progress during a unit or
block of instruction. Teachers are now afforded the chance to adjust
classroom instruction based upon the needs of the students. Similarly,
students are provided valuable feedback on their own learning.



Formative assessment is not a new
idea to us as educators. However, during
the past several years there has been literally an explosion of applications
linked to sound research.In this evolving conception, formative assessment
is more than testing frequently, although frequent information is important.
Formative assessment also involves actually adjusting teaching to take
account of these frequent assessment results. Nonetheless, formative
assessment is even more than using information to plan next steps.
Formative assessment seems to be most effective when students are
involved in their own assessment and goal setting.

2.6 Types of tests

The most common use of language tests is to identify strengths and
weaknesses in students abilities. For example, through testing we can
discover that a student has excellent oral abilities but a relatively low level of
reading comprehension. Information gleaned from tests also assists us in
deciding who should be allowed to participate in a particular course or
programme area. Another common use of tests is to provide information
about the effectiveness of programmes of instruction.
Henning (1987) identifies six kinds of information that tests provide about
students. They are:

o Diagnosis and feedback o
Screening and selection o
Placement o Program
evaluation o Providing
research criteria o
Assessment of attitudes and
socio-psychological
differences

Alderson, Clapham and Wall (1995) have a different classification
scheme. They sort tests into these broad categories: proficiency,
achievement, diagnostic, progress, andplacement. Brown (2010), however,


categorised tests according to their purpose,
namely achievement tests, diagnostic tests,
placement tests, proficiency test, and aptitude tests.

Proficiency Tests

Proficiency tests are not based on a particular curriculum or language
programme. They are designed to assess the overall language ability of
students at varying levels. They may also tell us how capable a person is
in a particular language skill area.Their purpose is to describe what
students are capable of doing in a language.

Proficiency tests are usually developed by external bodies such as
examination boards like Educational Testing Services (ETS) or Cambridge
ESOL. Some proficiency tests have been standardised for international use,
such as the American TOEFL test which is used to measure the English
language proficiency of foreign college students who wish to study in North-
American universities or the British-Australian IELTS test designed for those
who wish to study in the United Kingdom or Australia (Davies et al., 1999).

Achievement Tests

Achievement tests are similar to progress tests in that their purpose is
to see what a student has learned with regard to stated course outcomes.
However, they are usually administered at mid-and end- point of the
semester or academic year. The content of achievement tests is generally
based on the specific course content or on the course objectives.
Achievement tests are often cumulative, covering material drawn from an
entire course or semester.

Diagnostic Tests

Diagnostic tests seek to identify those language areas in which a
student needs further help. Harris and McCann (1994 p. 29) point out that
where other types of tests are based on success, diagnostic tests are
based on failure. The information gained from diagnostic tests is crucial for


further course activities and providing
students with remediation. Because
diagnostic tests are difficult to write, placement tests often serve a dual
function of both placement and diagnosis (Harris & McCann, 1994; Davies et
al., 1999).

Aptitude Tests

This type of test no longer enjoys the widespread use it once had. An
aptitude test is designed to measure general ability or capacity to learn a
foreign language a priori (before taking a course) and ultimate predicted
success in that undertaking. Language aptitude tests were seemingly
designed to apply to the classroom learning of any language. In the United
States, two common standardised English Language tests once used were
the Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the
Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is
no research to show unequivocally that these kinds of tasks predict
communicative success in a language, apart from untutored language
acquisition, standardised aptitude tests are seldom used today with the
exception of identifying foreign language disability (Stansfield & Reed,
2004).

Progress Tests

These tests measure the progress that students are making towards
defined course or programme goals. They are administered at various
stages throughout a language course to see what the students have
learned, perhaps after certain segments of instruction have been completed.
Progress tests are generally teacher produced and are narrower in focus
than achievement tests because they cover a smaller amount of material
and assess fewer objectives.
Placement Tests

These tests, on the other hand, are designed to assess students level
of language ability for placement in an appropriate course or class. This


type of test indicates the level at which a
student will learn most effectively. The main
aim is to create groups, which are homogeneous in level. In designing a
placement test, the test developer may choose to base the test content
either on a theory of general language proficiency or on learning objectives
of the curriculum. In the former, institutions may choose to use a well-
established proficiency test such as the TOEFL or IELTS exam and link it to
curricular benchmarks. In the latter, tests are based on aspects of the
syllabus taught at the institution concerned.

In some contexts, students are placed according to their overall rank in
the test results. At other institutions, students are placed according to their
level in each individual skill area. Elsewhere, placement test scores are
used to determine if a student needs any further instruction in the language
or could matriculate directly into an academic programme.

Discuss and present the various types of tests and assessment
tasks that students have experienced.

Discuss the extent tests or assessment tasks serve their purpose.



The end of the topic. Happy reading!











TOPIC 3

BASIC TESTING TERMINOLOGY



3.0 SYNOPSIS

Topic 3 provides input on basic testing
terminology. It looks at the definitions, purposes and differences of various
tests.


3.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

7. explain the meaning and purpose of different types of
language
tests;
8. compare between Norm-Referenced Test and
CriterionReferenced Test, Formative and Summative Tests,
Objective and Subjective Tests


3.2 FRAMEWORK OF TOPICS




CONTENT























Norm - Referenced
and Criterion -
Referenced



Types of Tests



Formative and
Summative



Objective and
Subjective



SESSION THREE (3 hours)


3.3 Norm-Referenced Test (NRT)

According to Brown (2010), in NRTs an individual test-takers score is
interpreted in relation to a mean (average score), median (middle score),
standard deviation (extent of variance in scores), and/or percentile rank.
The purpose of such tests is to place test-takers along a mathematical
continuum in rank order. In a test, scores are commonly reported back to
the test-taker in the form of a numerical score for example, 250 out of 300
and a percentile rank for instance 78 percent, which denotes that the test-
takers score was higher than 78 percent of the total number of test-takers
but lower than 22 pecent in the administration. In other words, NRT is
administered to compare an individual performance with his peers and/or
compare a group with other groups. In the School-Based Evaluation, NRT is
used for the summative evaluation, such as in the end of the year
examination for the streaming and selection of students.

3.4 Criterion-Referenced Test (CRT)

Gottlieb (2006) on the other hand refers Criterion-referenced tests as
the collection of information about student progress or achievement in
relation to a specified criterion. In a standards-based assessment model,
the standards serve as the criteria or yardstick for measurement. Following
Glaser (1973), the word criterion means the use of score values that can be
accepted as the index of attainment to a test-taker. Thus, CRTs are
designed to provide feedback to test-takers, mostly in the form of grades, on
specific course or lesson objectives. Curriculum Development Centre
(2001) defines CRT as an approach that provides information on students
mastery based on the criteria determined by the teacher. These criteria are
based on learning outcomes or objectives as specified in the syllabus. The
main advantage of CRTs is that they provide the testers to make inferences
about how much language proficiency, in the case of language proficiency
tests, or knowledge and skills, in the aspect of academic achievement tests,
that testtakers/students originally have and their successive gains over time.
As opposed to NRTs, CRTs focus on students mastery of a subject matter
(represented in the standards) along a continuum instead of ranking student


on a bell curve. Table 3 below shows the
differences between NormReferenced Test
(NRT) and Criterion-Referenced Test (CRT).

Norm-Referenced Test Criterion-Referenced Test
Definition




Purpose
A test that measures
students achievement
as compared to other
students in the group

Determine performance
difference among
individual and groups
An approach that
provides information on
students mastery based
on a criterion specified
by the teacher
Determine learning
mastery based on
specified criterion and
standard
Test Item


Frequency
From easy to difficult
level and able to
discriminate examinees
ability
Continuous assessment
in the classroom
Guided by minimum
achievement in the
related objectives
Continuous assessment
Appropriateness
Example
Summative evaluation
Public exams: UPSR,
PMR, SPM, and STPM
Formative evaluation
Mastery test: monthly
test, coursework,
project, exercises in the
classroom
Table 3: The differences between Norm-Referenced Test (NRT) and
Criterion-Referenced Test (CRT)


3.5 Formative Test

Formative test or assessment, as the name implies, is a kind of
feedback teachers give students while the course is progressing. Formative
assessment can be seen as assessment for learning. It is part of the
instructional process.We can think of formative assessment as practice.
With continual feedback the teachers may assist students to improve their
performance. The teachers point out on what the students have done wrong
and help them to get it right. This can take place when teachers examine
the results of achievement and progress tests. Based on the results of
formative test or assessment, the teachers can suggest changes to the
focus of curriculum or emphasis on some specific lesson elements. On the
other hand, students may also need to change and improve. Due to the
demanding nature of this formative test, numerous teachers prefer not to
adopt this test although giving back any assessed homework or


achievement test present both teachers and
students healthy and ultimate learning
opportunities.

3.6 Summative Test

Summative test or assessment, on the other hand, refers to the kind of
measurement that summarise what the student has learnt orgive a one-off
measurement.In other words, summative assessment is assessment of
student learning. Students are more likely to experience assessment carried
out individually where they are expected to reproduce discrete language
items from memory.The results then are used to yield a school report and to
determine what students know and do not know.It does not necessarily
provide a clear picture of an individuals overall progress or even his/her full
potential, especially if s/heis hindered by the fear factor of physically sitting
for a test, but may provide straightforward and invaluable results for
teachers to analyse. It is given at a point in time to measure student
achievement in relation to a clearly defined set of standards, but it does not
necessarily show the way to future progress. It is given after learning is
supposed to occur. End of the year tests in a course and other general
proficiency or public exams are some of the examples of summative tests or
assessment.Table 3.1 shows formative and summative assessments that
are common in schools.

Formative Assessment Summative Assessment
Anecdotal records Final exams
Quizzes and essays National exams (UPSR, PMR, SPM,
STPM)
Diagnostic tests Entrance exams
Table 3.1: Common formative and summative assessments in schools



3.7 Objective Test

According to BBC Teaching English, an objective test is a test that
consists of right or wrong answers or responses and thus it can be marked
objectively. Objective tests are popular because they are easy to prepare


and take, quick to mark, and provide a
quantifiable and concrete result. They tend to
focus more on specific facts than on general ideas and concepts.

The types of objective tests include the following:
i. Multiple choice items/questions ii.
True-falseitems/questions: iii.
Matchingitems/questions; and iv.
Fill-in the blanks items/questions.

In this topic, let us focus on the multiple-choice questions, which may
look easy to construct but in reality, it is very difficult to build correctly. This
is congruent with the viewpoint of Hughes (2003, pp76-78) who warns
against many weaknesses of multiple-choice questions. The weaknesses
include:

It may limit beneficial washback;
It may enable cheating among test-takers;
It is very challenging to write successful items;
This technique strictly limits what can be tested;
This technique tests only recognition knowledge;
It may encourage guessing,which may have a considerable effect on
test scores.

Lets look at some important terminology when designing multiple-choice
questions. This objective test item comprises five terminologies namely:

1. Receptive or selective response
Items that the test-takers chooses from a set of responses, commonly
called a supply type of response rather than creating a response.
2. Stem
Every multiple-choice item consists of a stem (the body of the item
that presents a stimulus). Stem is the question or assignment in an item. It
is in a complete or open, positive or negative sentence form. Stem must be
short or simple, compact and clear. However, it must not easily give away
the right answer.



3. Options or
alternatives
They are known as a list of possible responses to a test item. There are
usually between three and five options/alternatives to choose from.

4. Key
This is the correct response. The response can either be
correct or the best one. Usually for a good item, the correct answer is not
obvious as compared to the distractors.

5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as
the correct answer but it is not.


When building multiple-choice items for both classroom-based and
large-scaled standardised tests, consider the four guidelines below:

i. Design each item to measure a single objective; ii. State
both stem and options as simply and directly as possible; iii. Make
certain that the intended answer is clearly the one correct one;
iv. (Optional) Use item indices to accept, discard or revise item.
3.8 Subjective Test
Contrary to an objective test, a subjective test is evaluated by giving an
opinion, usually based on agreed criteria.Subjective tests include essay,
short-answer, vocabulary, and take-home tests. Some students become
very anxious of these tests because they feel their writing skills are not up to
par.
In reality, a subjective test provides more opportunity to test-takers to
show/demonstrate their understanding and/or in-depth knowledge and skills
in the subject matter. In this case, test takers might provide some
acceptable, alternative responses that the tester, teacher or test developer
did not predict. Generally, subjective tests will test the higher skills of


analysis, synthesis, and evaluation. In short,
subjective test will enable students to be
more creative and critical. Table 3.2 shows various types of objective and
subjective assessments.

Objective Assessments Subjective Assessments
True/False Items Extended-response Items
Multiple-choice Items Restricted-response Items
Multiple-responses Item Essay
Matching Items
Table 3.2: Various types of objective and subjective assessments


Some have argued that the distinction between objective and
subjective assessments is neither useful nor accurate because, in reality,
there is no such thing as objective assessment. In fact, all assessments are
created with inherent biases built into decisions about relevant subject
matter and content, as well as cultural (class, ethnic, and gender) biases.

Reflection

1. Objective test items are items that have only one answer or
correct response. Describe in-depth the multiple-choice test
item.

2. Subjective test-items allocate subjectivity in the response given by
thetest-takers. Explain in detail the various types of subjective
testitems.
Discussion
1. Identify at least three differences between formative and summative
assessment?

2. What are the strengths of multiple-choice items compared to essay
items?

3. Informal assessments are often unreliable, yet they are still
important in classrooms. Explain why this is the case, and defend your
explanation with examples.

4. Compare and contrast Norm-Referenced Test with Criterion-
Referenced Test.

































TOPIC 4

BASIC PRINCIPLES OF ASSESSMENT


4.0 SYNOPSIS

Topic 4 defines the basic principles of assessment (reliability, validity,
practicality, washback, and authenticity) and the essential sub-
categories within reliability and validity.
4.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:

1. define the basic principles of assessment (reliability, validity,
practicality, washback, and authenticity) and the essential sub
categories within reliability and validity;

2. explain the differences between validity and reliability;



3. distinguish the different
types of validity and reliability in tests
and other instruments in language
assessment.

4.2 FRAMEWORK OF TOPICS




CONTENT

SESSION FOUR (3 hours)

4.3 INTRODUCTION

Assessment is a complex, iterative process requiring skills,
understanding, and knowledge-in the exercise of professionally judgment.
In this process, there are five important criteria that the testers ought to look
into for testing a test: reliability, validity, practicality, washback and
authenticity. Since these five principles are context dependent, there is no
priority order implied in the order of presentation.


















Types of
Tests



Reliability



Validity



Practicality



Objectivity



Interpretability



Authenticity



Washback Effect




4.4 RELIABILITY

Reliability means the degree to which
an assessment tool produces stable and consistent results. It is a concept,
which is easily being misunderstood (Feldt & Brennan, 1989).
Reliability essentially denotes consistency, stability, dependability,
and accuracy of assessment results (McMillan, 2001a, p.65 in Brown, G. et
al, 2008). Since there is tremendous variability from either teacher or tester
to teacher/tester that affects student performance, thus reliability in planning,
implementing, and scoring student performances gives rise to valid
assessment.
Fundamentally, a reliable test is consistent and dependable. If a
tester administers the same test to the same test-taker or matched test-
takers on two circumstances, the test should give the same results.In a
validity chain, it is stated that test administrators need to be sure that the
scoring performance has to be carried out properly. If scores used by the
tester do not reflect accurately what the test-taker actually did, would not be
rewarded by another marker, or would not be received on a similar
assessment, then these scores lack reliability. Errors occur in scoring in any
ways-for example, giving Level 2 when another rater would give Level 4,
adding up marks wrongly, transcribing scores from test paper to database
inaccurately, students performing really well on the first half of the
assessment and poorly on the second half due to fatigue, and so on. Thus,
lack of reliability in the scores students receive is a treat to validity.
According to Brown (2010), a reliable test can be described as follows:
Consistent in its conditions across two or more
administrations Gives clear directions for scoring /
evaluation Has uniform rubrics for scoring / evaluation
Lends itself to consistent application of those rubrics by the
scorer
Contains item / tasks that are unambiguous to the test-taker

4.4.1 Rater Reliability
When humans are involved in the measurement procedure,


there is a tendency of error, biasness
and subjectivity in determining the
scores of similar test.There are two kinds of rater reliability namely inter-
rater reliability and intra-rater reliability.

Inter-rater reliability refers to the degree of similarity between
different tester or rater; can two or more testers/raters, without
influencing one another, give the same marks to the same set of scripts
(contrast with intra-rater reliability).

One way to test inter-rater reliability is to have each rater
assign each test item a score. For example, each rater might
score items on a scale from 1 to 10. Next, you would calculate the
correlation between the two ratings to determine the level of inter-
rater reliability. Another means of testing inter-rater reliability is to
have raters determine which category each observation falls into and
then calculate the percentage of agreement between the raters. So, if
the raters agree 8 out of 10 times, the test has an 80% inter-rater
reliability rate. Rater reliability is assessed by having two or more
independent judges score the test. The scores are then compared to
determine the consistency of the raters estimates.
Intra-rater reliability is an internal factor. In intra-rater reliability,
its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does nothave
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intra
rater reliability is the consistency of grading by a single rater. Scores
on a test are rated by a single rater/judge at different times. When we
grade tests at different times, we may become inconsistent in our
grading for various reasons. Some papers that are graded during the
day may get our full and careful attention, while others that are
graded towards the end of the day are very quickly glossed over. As
such, intra rater reliability determines the consistency of our grading.



Both inter-and intra-rater reliabilitydeserve
close attention in
that test scores are likely to vary from rater to rater or even from the
same rater (Clark, 1979).

4.4.2 Test Administration Reliability
There are a number of reasons which influences test
administration reliability. Unreliability occurs due to outside
interference like noise, variations in photocopying, temperature
variations, the amount of light in various parts of the room, and even
the condition of desk and chairs. Brown (2010) stated that he once
witnessed the administration of a test of aural comprehension in which
an audio player was used to deliver items for comprehension, but due
to street noise outside the building, test-taker sitting next to open
windows could not hear the stimuli clearly. According to him, that was
a clear case of unreliability caused by the conditions of the test
administration.

4.4.3 Factors influencing Reliability


Figure 4.4.3 Factors that affect the reliability of a test
The outcome of a test is influenced by many factors.
Assuming that the factors are constant and not subject to
change, a test is considered to be reliable if the scores
are consistent and not different from other equivalent and
reliable test scores. However, tests are not free from
errors. Factors that affect the reliability of a test include
Factors that can
affect the
reliability of a
test
Test Factor
Teacher and
Student Factor
Environment
Factor
Test
Administration
Factor
Marking Factor


test length factors, teacher and
student factors,
environment factors, test administration factors, and
marking factors.

a. Test length factors

In general, longer tests produce higher reliabilities. Due to
thedependency on coincidence and guessing, the scores will be more
accurate if the duration of the test is longer. An objective test has
higher consistency because it is not exposed to a variety of
interpretations. A valid test is said to be reliable but a reliable test
need not be valid. A consistent score does not necessary measure
what is intended to measure. In addition, the test items that are the
samples of the subject being tested and variation in the samples may
be found in two equivalent tests and there can be one of the causes
test outcomes are unreliable.

b. Teacher-Student factors

In most tests, it is normally for teachers to construct and
administer tests for students. Thus, any good teacher-student
relationship would help increase the consistency of the results. Other
factors that contribute to positive effects to the reliability of a test
include teachers encouragement, positive mental and physical
condition, familiarity to the test formats, and perseverance and
motivation.


c. Environment factors

An examination environment certainly influences test-takers and
their scores. Any favourable environment with comfortable chairs and
desks, good ventilation, sufficient light and space will improve the
reliability of the test. On the contrary, a non-conducive environment will
affect test-takers performance and test reliability.

d. Test administration factors



Because students' grades are
dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.

e. Marking factors

Unfortunately, we human judges have many opportunities to introduce
error in our scoring of essays (Linn & Gronlund, 2000; Weigle, 2002).It
is possible that our scoring invalidates many of the interpretations we
would like to make based on this type of assessment.Brennan (1996)
has reported that in large-scale, high-stakes marking panels that are
tightly trained and monitored marker effects are small. Hence, it can
be concluded that in low-stakes, small-scale marking, there is
potentially a large error introduced by individual markers. It is also
common that different markers award different marks for the same
answer even with a prepared mark scheme. A markers assessment
may vary from time to time and with different situations. Conversely, it
does not happen to the objective type of tests since the responses are
fixed. Thus, objectivity is a condition for reliability.


4.5 VALIDITY

Validity refers to the evidence base that can be provided about
appropriateness of the inferences, uses, and consequences that come from
assessment (McMillan, 2001a).Appropriateness has to do with the
soundness, trustworthiness, or legitimacy of the claims or inferences that
testers would like to make on the basis of obtained scores. Clearly, we have
to evaluate the whole assessment process and its constituent parts by how
soundly we can defend the consequences that arise from the inferences and
decisions we make. Validity, in other words, is not a characteristic of a test
or assessment; but a judgment, which can have varying degrees of strength.




So, the second characteristic of good
tests is validity, which refers to whether the
test is actually measuring what it claims to measure. This is important for us
as we do not want to make claims concerning what a student can or cannot
do based on a test when the test is actually measuring something else.
Validity is usually determined logically although several types of validity may
use correlation coefficients.

According to Brown (2010), a valid test of reading ability actually
measures reading ability and not 20/20 vision, or previous knowledge of a
subject, or some other variables of questionable relevance. To measure
writing ability, one might ask students to write as many words as they can in
15 minutes, then simply count the words for the final score. Such a test is
practical (easy to administer) and the scoring quite dependable (reliable).
However, it would not constitute a valid test of writing ability without taking
into account its comprehensibility, rhetorical discourse elements, and the
organisation of ideas.

The following are the different types of validity:

Face validity: Do the assessment items appear to be appropriate?

Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?

Construct validity: Are you measuring what you think you're measuring? Is
the test based on the best available theory of language and language
use?

Concurrent validity: Can you use the current test score to estimate scores
of other criteria? Does the test correlate with other existing measures?

Predictive validity: Is it accurate for you to use your existing students
scores to predict future students scores? Does the test successfully
predict future outcomes?

It is fairly obvious that a valid assessment should have a good coverage
of the criteria (concepts, skills and knowledge) relevant to the purpose of
the examination. The important notion here is the purpose.







4.5.1 Face validity

Face validity is validity which is determined impressionistically;
for example by asking students whether the examination was
appropriate to the expectations (Henning, 1987). Mousavi (2009)
refers face validity as the degree to which a test looks right, and
appears to measure the knowledge or abilities it claims to measure,
based on the subjective judgement of the examinees who take it, the
administrative personnel who decide on its use, and other
psychometrically unsophisticated observers.
It is pertinent that a test looks like a test even at first impression.
If students taking a test do not feel that the questions given to them
are not a test or part of a test, then the test may not be valid as the
students may not take it seriously to attempt the questions. The test,
hence, will not be able to measure what it claims to measure.



Figure 4.5: Types of Validity
Types of Validity
a. Face validity
b. Content Validity
c. Construct Validity
d. Concurrent Validity
e. Predictive Validity


4.5.2 Content validity

Content validityis concerned with
whether or not the content of the test is sufficiently representative and
comprehensive for the test to be a valid measure of what it is
supposed to measure (Henning, 1987).The most important step in
making sure of content validity is to make sure all content domains
are presented in the test. Another method to verify validity is through
the use of Table of Test Specification that can give detailed
information on each content, level of skills, status of difficulty, number
of items, and item representation for rating in each content or skill or
topic.
We can quite easily imagine taking a test after going through an
entire language course. How would you feel if at the end of the
course, your final examination consists of only one question that
covers one element of language from the many that were introduced
in the course? If the language course was a conversational course
focusing on the different social situations that one may encounter,
how valid is a final examination that requires you to demonstrate your
ability to place an order at a posh restaurant in a five-star hotel?

4.5.3 Construct validity

Construct is a psychological concept used in measurement.
Construct validity is the most obvious reflection of whether a test
measures what it is supposed to measure as it directly addresses the
issue of what it is that is being measured. In other words, construct
validity refers to whether the underlying theoretical constructs that the
test measures are themselves valid. Proficiency, communicative
competence, and fluency are examples of linguistic constructs;
selfesteem and motivation are psychological constructs.
Fundamentally every issue in language learning and teaching
involves theoretical constructs. When you are assessing a students
oral proficiency for instance. To possess construct validity, the test
should consist of various components of fluency: speed, rhythm,
juncture, (lack of) hesitations, and other elements within the construct


of fluency. Tests are, in a manner of
speaking, operational definitions of
constructs in that their test tasks are the building blocks of the entity
that is being measured (see Davidson, Hudson, & Lynch, 1985; T.
McNamara, 2000).

4.5.4 Concurrent validity

Concurrent validity is the use of another more reputable and
recognised test to validate ones own test. For example, suppose you
come up with your own new test and would like to determine the
validity of your test. If you choose to use concurrent validity, you
would look for a reputable test and compare your students
performance on your test with their performance on the reputable and
acknowledged test. In concurrent validity, a correlation coefficient is
obtained and used to generate an actual numerical value. A high
positive correlation of 0.7 to 1 indicates that the learners score is
relatively similar for the two tests or measures.

For example, in a course unit whose objective is for students to
be able to orally produce voiced and unvoiced stops in all possible
phonetics environments, the results of one teachers unit test might
be compared with an independent assessment such as a
commercially produced test of similar phonemic proficiency. Since
criterion-related evidence usually falls into one of two categories of
concurrent and predictive validity, a classroom test designed to
assess mastery of a point of grammar in a communicative use will
have criterion validity if test scores are verified either by observed
subsequent behaviour or by other communicative measures of
grammar point in question.

4.5.5 Predictive validity

Predictive validity is closely related to concurrent validity in that it too
generates a numerical value. For example, the predictive validity of a
university language placement test can be determined several


semesters later by correlating the scores on
the test to the GPA of the students who took
the test. Therefore, a test with high predictive validity is a test that
would yield predictable results in a latter measure. A simple example
of tests that may be concerned with predictive validity is the trial
national examinations conducted at schools in Malaysia as it is
intended to predict the students performance on the actual SPM
national examinations. (Norleha Ibrahim, 2009)

As mentioned earlier validity is a complex concept, yet it is
crucial to the teachers understanding of what makes a good test. It is
good to heed Messicks (1989, p. 36) caution that validity is not an all
or-none proposition and that various forms of validity may need to be
applied to a test in order to be satisfied worth its overall effectiveness.


What are reliability and validity? What determines the reliability of a
test?

What are the different types of validity? Describe any three types and
cite examples.

http://www.2dix.com/pdf-2011/testing-and-evaluation-in-esl-pdf.php


4.5.6 Practicality

Although practicality is an important characteristic of tests, it is
by far a limiting factor in testing. There will be situations in which after
we have already determined what we consider to be the most valid
test, we need to reconsider the format purely because of practicality
issues. A valid test of spoken interaction, for example, would require
that the examinees be relaxed, interact with peers and speak on
topics that they are familiar and comfortable with. This sounds like the
kind of conversations that people have with their friends while sipping
afternoon teaby the roadside stalls. Of course such a situation would
be a highly valid measure of spoken interaction if we can setit up.
Imagine if we even try to do so. It would require hidden cameras as
well as a lot of telephone calls and money.



Therefore, a more practical form of
the test especially if it is to be administered at the national level as a
standardised test, is to have a short interview session of about fifteen
minutes using perhaps a picture or reading stimulus that the
examinees would describe or discuss. Therefore, practicality issues,
although limiting in a sense, cannot be dismissed if we are to come
up with a useful assessment of language ability. Practicality issues
can involve economics or costs, administration considerations such
as time and scoring procedures, as well as the ease of interpretation.
Tests are only as good as how well they are interpreted. Therefore
tests that cannot be easily interpreted will definitely cause many
problems.


4.5.7 Objectivity

The objectivity of a test refers to the ability of
teachers/examiners who mark the answer scripts. Objectivity refers
to the extent, in which an examiner examines and awards scores to
the same answer script. The test is said to have high objectivity
when the examiner is able to give the same score to the similar
answers guided by the mark scheme. An objective test is a test that
has the highest level of objectivity due to the scoring that is not
influenced by the examiners skills and emotions. Meanwhile,
subjective test is said to have the lowest objectivity. Based on
various researches, different examiners tend to award different scores
to an essay test. It is also possible that the same examiner would
give different scores to the same essay if s/he is to re-check at
different times.

4.5.8 Washback effect

The term 'washback' or backwash (Hughes, 2003, p.1) refers to
the impact that testshave on teaching and learning. Such impact is
usuallyseen as being negative: tests are said to force teachersto do
things they do not necessarily wish to do.However, some have


argued that tests are potentiallyalso 'levers
for change' in language education:
theargument being that if a bad test has negative impact,a good test
should or could have positive washback(Alderson, 1986b; Pearson,
1988).

Cheng, Watanabe, and Curtis (2004) offered an entire anthology
to the issue of washback while Spratt (2005) challenged teachers to
become agents of beneficial washback in their language classrooms.
Brown (2010) discusses the factors that provide beneficial washback
in a test.He mentions that such a test can positively influence what
and how teachers teach, students learn; offer learners a chance to
adequately prepare, give learners feedback that enhance their
language development, is more formative in nature than summative,
and provide conditions for peak performance by the learners.

In large-scale assessment, washback often refers to the effects
that tests have on instruction in terms of how students prepare for the
test. In classroom-based assessment, washback can have a number
of positive manisfestations, ranging from the benefit of preparing and
reviewing for a test to the learning that accrues from feedback on
ones performance. Teachers can provide information that washes
back to students in the form of useful diagnoses of strengths and
weaknesses.

The challenge to teachers is to create classroom tests that serve
as learning devices through which washback is achieved. Students
incorrect responses can become a platform for further improvements.
On the other hand, their correct responses need to be complimented,
especially when they represent accomplishments in a students
developing competence. Teachers can have various strategies in
providing guidance or coaching. Washback enhances a number of
basic principles of language acquisition namely intrinsic motivation,
autonomy, self-confidence, language ego, interlanguage, and
strategic investment, among others.


Washback is generally said to be either positive or negative.
Unfortunately, students and teachers tend to
think of the negative effects of testing such
as test-driven curricula and only studying and learning what they
need to know for the test. Positive washback, or what we prefer to
call guided washback can benefit teachers, students and
administrators. Positive washback assumes that testing and
curriculum design are both based on clear course outcomes, which
are known to both students and teachers/testers. If students perceive
that tests are markers of their progress towards achieving these
outcomes, they have a sense of accomplishment. In short, tests must
be part of learning experiences for all involved. Positive washback
occurs when a test encourages good teaching practice.

Washback is particularly obvious when the tests or examinations
in question are regarded as being very vital and having a definite
impact on the students or test-takers future. We would expect, for
example, that national standardised examinations would have strong
washback effects compared to a school-based or classroom-based
test.

4.5.9 Authenticity

Another major principle of language testing is authenticity. It is a
concept that is difficult to define, particularly within the art and science
of evaluating and designing test. Citing Bachman and Palmer (1996)
in Brown (2010) authenticity is the degree of correspondence of the
characteristics of a given language test task to the features of a target
language task (p.23) and then suggested an agenda for identifying
those target language tasks and for transforming them into valid test
items.

Language learners are motivated to perform when they are
faced with tasks that reflect real world situations and contexts. Good
testing or assessment strives to use formats and tasks that reflect the
types of situation in which students would authentically use the target


language. Whenever possible, teachers
should attempt to use authentic materials in
testing language skills.

4.6.0 Interpretability
Test interpretation encompasses all the ways that meaning is
assigned to the scores. Proper interpretation requires knowledge
about the test, which can be obtained by studying its manual and other
materials along with current research literature with respect to its
use; no one should undertake the interpretation of scores on any test
without such study. In any test interpretation, the following
considerations should be taken into account.
A. Consider Reliability: Reliability is important because it is a
prerequisite to validity and because the degree to which a score may
vary due to measurement error is an important factor in its
interpretation.
B. Consider Validity: Proper test interpretation requires knowledge of
the validity evidence available for the intended use of the test. Its
validity for other uses is not relevant. Indeed, use of a measurement
for a purpose for which it was not designed may constitute misuse.
The nature of the validity evidence required for a test depends upon its
use.
C. Scores, Norms, and Related technical Features: The result of
scoring a test or subtest is usually a number called a raw score, which
by itself is not interpretable. Additional steps are needed to translate
the number directly into either a verbal description (e.g., pass or
fail) or into a derived score (e.g., a standard score). Less than full
understanding of these procedures is likely to produce errors in
interpretation and ultimately in counseling or other uses.
D. Administration and Scoring Variation: Stated criteria for score
interpretation assume standard procedures for administering and


scoring the test. Departures from
standard conditions and procedures
modify and often invalidate these criteria.

Study some of commercially produced tests and evaluate the authenticity of
these tests/ test items.

Discuss the importance of authenticity in testing.

Based on samples of formative and summative assessments, discuss
aspects of reliability/validity that must be considered in these assessments.

Discuss measures that a teacher can take to ensure high validity of
language assessment for the primary classroom.




































TOPIC 5

DESIGNING CLASSROOM LANGUAGE
TEST

5.0 SYNOPSIS

Topic 5 exposes you the stages of test construction, the preparing of test
blueprint/test specifications, the elements in a Test Specifications Guidelines
And the importance of following the guidelines for constructing tests items.
Then we look at the various test formats that are appropriate for language
assessment.


5.1 LEARNING OUTCOMES

By the end of this topic, you will be able to:
1. identify the different stages of test construction
2. describe the features of a test specification
3. draw up a test specification that reflect both the purpose and
the objectives of the test
4. compare and contrast Blooms taxonomy and SOLO taxonomy
5. categorise test items according to Blooms taxonomy
6. discuss the elements of test items of high quality, reliability and
validity
7. identify the elements in a Test Specifications Guidelines
8. demonstrate an understanding of the importance of following
the guidelines for constructing tests items
9. illustrate test formats that are appropriate and meet the
requirements of the learning outcomes

5.2 FRAMEWORK OF TOPICS




CONTENT

SESSION FIVE (3 hours)


5.3 Stages of Test Construction

Constructing a test is not an easy task; it requires a variety of skills
along with deep knowledge in the area for which the test is to be
constructed. The steps include:

i determining vi pre-testing ii planning
vii validating
iii writing
iv preparing v
reviewing


5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be
answered.

Who are the examinees?
What kind of test is to be made?
What is the precise purpose?
What abilities are to be tested?
How detailed and how accurate the results must be?
How important is the backwash effect?
What constraints are set by the unavailability of expertise,
facilities, time of construction, administration, and scoring?
Stages of Test
Construction
Preparing Test
Blueprint / Test
Specifications
Bloom's and SOLO
Taxonomies
Guidelines for
constructing Test
Items
Test Format


What is the
scope of the test?

5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures. In this
stage, the test constructor has to determine the content by answering
the following questions: Describing the purpose of the test;
Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
Defining the nature of the ability we want to measure; Developing
a plan for evaluating the qualities of test usefulness, which is the
degree to which a test is useful for teachers and students, it includes
six qualities: reliability, validity, authenticity, practicality
interactiveness, and impact;
Identifying resources and developing a plan for their allocation and
management;
Determining format and timing of the test; Determining levels of
performance; Determining scoring procedures

5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way
to identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop
to items that they have created, they must be open to, and ready to
accept, the criticisms that are offered to them. Good personal
relations are a desirable quality in any test writing team.

Test items writers should possess the following characteristics:
They have to be experienced in test construction. They
have to be quite knowledgeable of the content of the test.


They should
have the capacity in using language clearly
and economically.
They have to be ready to sacrifice time and energy.

Another basic aspect in writing the items of the test is sampling.
Sampling means that test constructors choose widely from the whole
area of the course content. It is most unlikely that everything found
under the heading of 'Content in the specifications can be included in
any one version of the test. Choices have to be made for content
validity and for beneficial backwash. One should not concentrate solely
on elements known to be easy to test. Rather, the content of the test
should be a representative sample of the course material.
I

5.3.4 Preparing
One has to understand the major principles, techniques and
experience of preparing the test items. Not every teacher can make a
good tester. To construct different kinds of tests, the tester should
observe some principles. In the production-type tests, we have to
bear in mind that no comments are necessary. Test writers should
also try to avoid test items, which can be answered through test-
wiseness. Testwiseness refers to the capacity of the examinees to
utilise the characteristics and formats of the test to guess the correct
answer.

5.3.5 Reviewing
Principles for reviewing test items:
The test should not be reviewed immediately after its construction,
but after some considerable time.
Other teachers or testers should review it. In a language test, it is
preferable if native speakers are available to review the test.

5.3.6 Pre-testing
After reviewing the test, it should be submitted to pre-testing.


The tester
should administer the newly-developed test
to a group of examinees similar to the target group and the
purpose is to analyse every individual item as well as the whole
test.
Numerical data (test results) should be collected to check the
efficiency of the item, it should include item facility and
discrimination.

5.3.7 Validating
Item Facility (IF) shows to what extent the item is easy or difficult.
The items should neither be too easy nor too difficult. To measure the
facility or easiness of the item, the following formula is used: IF=
number of correct responses (c) / total number of candidates (N) And to
measure item difficulty:
IF= (w) / (N)
The results of such equations range from 0 1. An item with a facility
index of 0 is too difficult, and with 1 is too easy. The ideal item is one
with the value of (0.5) and the acceptability range for item facility is
between [0.37 0.63], i.e. less than 0.37 is difficult, and above 0.63 is
easy.
Thus, tests which are too easy or too difficult for a given sample
population, often show low reliability. As noted in Topic 4, reliability is
one of the complementary aspects of measurement.

5.4 Preparing Test Blueprint / Test Specifications

Test specifications (specs) for classroom use can be an outline of
your test (Brown, 2010), what it will look like. Consider your test
specs as a blueprint of the test that include the following:
a description of its content
item types (methods, such as multiple-choice, cloze, etc.)
tasks (e.g. written essay, reading a short passage, etc.)
skills to be included how the test will be scored how it
will be reported to students



For classroom purposes (Davidson & Lynch, 2002), the specs
are your guiding plan for designing an
instrument that effectively fulfils your desired principles, especially validity.
It is vital to note that for large-scale standardised tests like Test
of English as a Foreign Language (TOEFL Test), International
English Language Testing System (IELTS), Michigan English
Language Assessment Battery) MELAB, and the like, that are
intended to be widely distributed and thus are broadly generalised,
test specifications are much more formal and detailed (Spaan, 2006).
They are also usually confidential so that the institution that is
designing the test can ensure the validity of subsequent forms of a
test.
Many language teachers claim that it is difficult to construct an item.
In reality, it is rather easy to develop an item, if we are committed in
the planning of the measuring instruments to evaluate students
achievement.
However, what exactly is an item for a test? An item is a tool, an
instrument, instruction or question used to get feedback from
testtakers, which is an evidence t of something that is being
measured. An item is an instrument used to get feedback, which is a
useful information for consideration in measuring or asserting a
construct measurement. Items can be classified as a recall and
thinking item. A recall item is the item that requires one to recall in
order to answer, and a thinking item refers to an item that requires
test-takers to use their thinking skills to attempt.
For instance, in a grammar unit test that will be administered at
the end of a three-week grammar course for high beginning adult
learners (Level 2). The students will be taking a test that covers verb
tenses and two integrated skills (listening/speaking and
reading/writing)
and the grammar class they attend serves to reinforce the
grammatical forms that they have learnt in the two earlier classes.
Based on the scenario above, the test specs that you design
might consist of the four sequential steps: 1. a broad outline of
how the test will be organised


2. which of the eight sub-skills you will
test
3. what the various tasks and item types
will be
4. how results will be scored, reported to students, and used
in future class (washback)

Besides knowing the purpose of the test you are creating, you
are required to know as precisely as possible what it is you want to
test. Do not conduct a test hastily. Instead, you need to examine the
objectives for the unit you are testing carefully.
5.5 Blooms and SOLO Taxonomies
5.5.1 Blooms Taxonomy (Revised)
Blooms Taxonomy is a systematic way of describing how a
learners performance develops from simple to complex levels in their
affective, psychomotor and cognitive domain of learning. The Original
Taxonomy provided carefully developed definitions for each of the six
major categories in the cognitive domain. The categories were
Knowledge, Comprehension, Application, Analysis, Synthesis, and
Evaluation. With the exception of Application, each of these was
broken into subcategories. The complete structure of the original
Taxonomy is shown in Figure 5.1.


Figure 5.1: Original Terms of Blooms Taxonomy
Retrieved from: http://www.
kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm The categories were
ordered from simple to complex and from concrete to abstract. Further, it was
assumed that the original Taxonomy represented a cumulative hierarchy; that
is, mastery of each simpler category was prerequisite to mastery of the next


more complex one. In their cognitive
domain, there are six stages, namely:
Knowledge, Comprehension, Application, Analysis, Synthesis and
Evaluation. Unfortunately, traditional education tends to base the student
learning in this domain. In the original Taxonomy, the Knowledge category
embodied both noun and verb aspects. The noun or subject matter aspect
was specified in Knowledge's extensive subcategories. The verb aspect was
included in the definition given to Knowledge in that the student was expected
to be able to recall or recognise knowledge. This brought uni-dimensionality to
the framework at the cost of a Knowledge category that was dual in nature
and thus different from the other Taxonomic categories. In 1990s, Anderson
(former student of Bloom) eliminated this inconsistency in the revised
Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge
dimension and the verb forming the basis for the Cognitive Process
dimension as shown in Figure 5.2.

Figure 5.2: Blooms Revised Taxonomy
Retrieved from: http://www. kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm
In the revised Blooms Taxonomy, the names of six major
categories were changed from noun to verb forms. As the taxonomy
reflects different forms of thinking and thinking is an active process
verbs were used instead of nouns.


Besides, the subcategories of
the six major categories were also
replaced by verbs and some subcategories were re-organised. The
knowledge category was renamed. Knowledge is an outcome or product
of thinking not a form of thinking per se. Consequently, the word
knowledge was inappropriate to describe a category of thinking and
was replaced with the word remembering instead. Comprehension and
synthesis were retitled to understanding and creating respectively, in order
to better reflect the nature of the thinking defined in each category.
Table 3 below provides a summary of the above.
Table 3: The Cognitive Dimension Process
Level 1 C1
Categories &
Cognitive Processes
Alternative Names Definition
Remember Retrieve knowledge
from long-term
memory
Recognising Identifying Locating knowledge in
long-term memory that
is consistent with
presented material
Recalling Retrieving Retrieving relevant
knowledge from
longterm memory




Level 2 C2
Categories &
Cognitive Processes
Alternative Names Definition
Understand Construct meaning
from instructional
messages,
including oral,
written, and graphic
communication


Interpreting Clarifying
Paraphrasing
Representing
Translating

Changing from one form
of representation to
another
Exemplifying Illustrating
Instantiating
Finding a specific
example or illustration
of a concept or principle
Classifying Categorising
Subsuming
Determining that
something belongs to
a category
Summarising Abstracting
Generalising
Abstracting a general
theme or major point(s)
Inferring Concluding
Extrapolating
Interpolating
Predicting
Drawing a logical
conclusion from
presenting
information
Comparing Contrasting
Mapping
Matching
Detecting
correspondences
between two ideas,
objects, and the
like
Explaining Constructing models Constructing a
cause and effect
model of a system

Level 3 C3

Categories &
Cognitive Processes
Alternative Names Definition
Apply Applying a procedure
to a familiar task
Executing Carrying out

Applying a procedure to
a familiar task
Exemplifying Illustrating
Instantiating
Applying a procedure to
an unfamiliar task
Analyse Using Break materials into
its constituent parts

and determine how the
parts relate to one
another and to an
overall structure or
purpose


Differentiating Discriminating
Distinguishing
Focusing
Selecting

Distinguishing relevant
from irrelevant parts or
important from
unimportant parts of
presented material
Organising Finding coherence
Integrating
Outlining
Parsing
Structuring
Determining how
elements fit or function
within a structure
Attributing Deconstructing Determining a point of
view, bias, values, or
intent underlying
presented material
Evaluating Make judgments
based on criteria and
standards
Checking Coordinating
Detecting
Monitoring
Testing
Detecting
inconsistencies or
fallacies within a
process or product,
determining whether a
process or product has
internal consistency;
detecting the
effectiveness of a
procedure as it is being
implemented
Critiquing Judging Detecting
inconsistencies
betweena product and
external
criteria;determining
whether a product has
external consistency;
detecting the
appropriateness of a
procedure for a given
problem
Create Putting elements
together to form a
coherent or functional
whole; reorganise
elements into a new
pattern or structure


Generating Hypothesising Coming upwith
alternative hypotheses
based on criteria
Planning Designing Devising a procedure for
accomplishing some
task
Producing Constructing Inventing a product

The Knowledge Domain
Categories &
Cognitive Processes
Definition
Factual Knowledge The basic elements students must know to the
acquainted with a discipline or solve problems in it
Conceptual Knowledge The interrelationships among the basic
elements within a larger structure that enable
them to function together
Procedural Knowledge How to do something, methods of inquiry, and
criteria for using skills, algorithms, techniques,
and methods
Metacognitive
Knowledge
Knowledge of cognition in general as well as
awareness and knowledge of ones own cognition



5.5.2 SOLO Taxonomy

On the other hand, SOLO, which stands for the Structure of the
Observed Learning Outcome, taxonomy is a systematic way of
describing how a learners performance develops from simple to
complex levels in their learning. Biggs & Collis first introduced it, in their
1982 study. There are 5 stages, namely Prestructural, Unistructural,
Multistructural, which are in a quantitative phrase and Relational and
Extended Abstract, which are in a qualitative phrase.

Students find learning more complex as it advances. SOLO is
a means of classifying learning outcomes in terms of their complexity,
enabling teachers to assess students work in terms of its quality not of
how many bits of this and of that they got right. At first we pick up only


one or few aspects of
the task (unistructural), then several aspects
but they are unrelated (multistructural), then we learn how to integrate
them into a whole (relational), and finally, we are able to generalise
that whole to as yet untaught applications (extended abstract). The
diagram below shows lists verbs typical of each such level.


Figure 5.3: SOLO Taxonomy
The SOLO taxonomy maps the complexity of a students work by linking
it to one of five phases: little or no understanding (Prestructural), through a
simple and then more developed grasp of the topic (Unistructural and
Multistructural), to the ability to link the ideas and elements of a task
together (Relational) and finally (Extended Abstract) to understand the topic
for themselves, possibly going beyond the initial scope of the task (Biggs &
Collis, 1982; Hattie & Brown, 2004). In their later research into multimodal
learning, Biggs & Collis noted that there was an increase in the structural
complexity of their (the students) responses (1991:64).

It may be useful to view the SOLO taxonomy as an integrated strategy,
to be used in lesson design, in task guidance and formative and summative


assessment (Smith & Colby, 2007; Black &
William, 2009; Hattie, 2009; Smith, 2011).
The structure of the taxonomy encourages viewing learning as an on-going
process, moving from simple recall of facts towards a deeper understanding;
that learning is a series of interconnected webs that can be built upon and
extended. Nckles et al., (2009:261) elaborates:

Cognitive strategies such as organization and elaboration are at
the heart of meaningful learning because they enable the
learner to organize learning into a coherent structure and
integrate new information with existing knowledge, thereby
enabling deep understanding and long-term retention.

This would help to develop Smiths (2011:92) self-regulating, self-evaluating
learners who were well motivated by learning.

A range of SOLO based techniques exist to assist teachers and
students. Use of constructional alignment (Biggs & Tang, 2009) encourages
teachers to be more explicit when creating learning objectives, focusing on
what the student should be able to do and at which level. This is essential
for a student to make progress and allows for the creation of rubrics, for use
in class (Black &Wiliam, 2009; Nckles et al., 2009; Huang, 2012), to make
the process explicit to the student. Use of HOTS viz. Higher Order Thinking
Skills) maps (Hook & Mills, 2011) can be used in English to scaffold in depth
discussion, encouraging students to:

Develop interpretations, use research and critical thinking
effectively to develop their own answers, and write essays that
engage with the critical conversation of the field (Linkon,
2005:247, cited in Allen, 2011).

It may also be helpful in providing a range of techniques for differentiated
learning (Anderson, 2007; Hook & Mills, 2012).

The SOLO taxonomy has a number of proponents. Hook & Mills
(2011:5) refer to it as a model of learning outcomes that helps schools
develop a common understanding. Moseley et al. (2005:306) advocates its
use as a framework for developing the quality of assessment citing that it is


easily communicable to students. Hattie
(2012:54), in his wide-ranging investigation
into effective teaching and visible learning, outlines three levels of
understanding: surface, deep and conceptual. He indicates that:

The most powerful model for understanding these three levels
and integrating them into learning intentions and success criteria
is the SOLO model.

However, the taxonomy is not without critics; Chick (1998:20) believes
that there is potential to misjudge the level of functioning and Chan et al.
(2002:512) criticises its conceptual ambiguity stating that the
categorisation is unstable. In these two studies, the SOLO taxonomy was
used primarily for assessing completed work, so use throughout the
teaching process may alleviate these issues.

An additional criticism, in particular when the taxonomy is compared
with that of Bloom (1956), is the SOLO taxonomys structure. Biggs & Collis
(1991) refers to the structure as a hierarchy, as does Moseley et al. (2005);
naturally, there are concerns when complex processes, such as human
thought, are categorised in this manner. However, Campbell et al. (1992)
explained the structure of the SOLO taxonomy as consisting as a series of
cycles (especially between the Unistructural, Multistructural and Relational
levels), which would allow for a development of breadth of knowledge as
well as depth.

However, SOLO taxonomy can be used not only in designing the
curriculum in terms of the learning outcomes intended, but also in
assessment.It can be effectively used for students to deconstruct exam
questions to understand marks awarded and as a vehicle for self-
assessment and peer-assessment.
5.6 Guidelines for constructing test items

Tests do not work without well-written test items. Test-takers appreciate
clearly written questions that do not attempt to trick or confuse them into


incorrect responses. The following presents
the major characteristics of well-written test
items.

5.6.1 Aim of the test
Test item development is a critical step in building a test that properly
meets certain standards. A good test is only as good as the quality of the
test items. If the individual test items are not appropriate and do not perform
well, how can the test scores be meaningful? The topic to be evaluated
(construct) and where the evaluation is done (title/context) must be part of
the curriculum. If it is evaluated outside the curriculum, the curricular validity
of the item can be disputed. Therefore, test items must be developed to
precisely measure the objectives prescribed by the blueprint and meet
quality standards.

5.6.2 Range of the topics to be tested
A test must measure the test-takers ability or proficiency in applying the
knowledge and principles on the topics that they have learnt. Ample
opportunity must be given to students to learn the topics that are to be
evaluated. This opportunity would include the availability of language
teachers, well-equipped facilities, and the expertise of the language teachers
in conducting the lessons and providing the skills and knowledge that would
be evaluated to the test-takers or students.

5.6.3 Range of skills to be tested
Test item writers should always attempt to write test items that
measure higher levels of cognitive processing. This is not an easy task. It
should be a goal of the writer to ensure their items have cognitive
characteristics exemplifying understanding, problem-solving, critical
thinking, analysis, synthesis, evaluation and interpreting rather than just
declarative knowledge. There are many theories that provide frameworks
on levels of thinking and Blooms taxonomy is often cited as a tool to use in
item writing. Always stick to writing important questions that represent and
can predict that a test-taker is proficient at high levels of cognitive
processing in doing their test proficiently.



5.6.4 Test format
Test items should always follow a consistent design so that the
questioning process in itself does not give unnecessary difficulty to
answering questions. Therefore a logical and consistent stimulus format for
writing test items can help expedite the laborious process of writing test
items as well as supply a format for asking basic questions. A format that
provides an initial starting structure to use in writing questions can be
valuable for item writers. When these formats are used, test takers can
quickly read and understand the questions, since the format is expected. For
example, to measure understanding of knowledge or facts, questions can
begin with the following:

What best defines .?

What is not the characteristic of .?

What is an example of .?

5.6.5 Level of difficulty

A test has a planned number of questions at a level of difficulty and
discrimination to best determine mastery and non-mastery performance
states. Test-takers should clearly understand what is needed in education
and language assessment to prepare for the examination and how much
experience performing certain activities would help in preparation. This
should be the road map that helps item writers create test items and helps
test takers understand what will be required of them to pass an examination.
In any test item construction, we must assure that weak students could
answer easy item, intermediate language proficiency students could answer
easy and moderate items whereas high language proficiency students could
answer easy, moderate and advance test items. A reliable and valid test
instrument should encompass all three levels of difficulties.

5.6.6 International and Cultural Considerations (biasness)



In standardised tests when exams
are distributed internationally, either in a
single language or translated to other languages, always refrain from the
use of slang, geographic references, historical references or dates
(holidays) that may not be understood by an international examinee. Tests
need to be adapted to other society so that meaning is fully translated
correctly and benefits are not given to a particular group of test-takers.
Steps should be taken to avoid item content that may bias gender, race or
other cultural groups.
What are the good characteristics of a test item?
Explain each characteristic of a test item in a graphic
organiser.
http://books.google.com.my/books/about/Constructing_Test_Items.html
?id=Ia3SGDfbaV0C&redir_esc=y



6.0 Test format

What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test,
which is organised a little bit different from the existing test items, what do
you say? Test format or test type? Test format refers to the layout of
questions on a test. For example, the format of a test could be two essay
questions, 50 multiple- choice questions, etc.For the sake of brevity, I will
consider providing the outlines of some large-scale standardised tests.




UPSR
Primary School Evaluation Test, also as known Ujian Penilaian
Sekolah Rendah (commonly abbreviated as UPSR; Malay), is a national
examination taken by all pupils in our country at the end of their sixth year in
primary school before they leave for secondary school. It is prepared and


examined by the Malaysian Examinations
Syndicate. This test consists of two papers
namely Paper 1 and Paper 2.
Multiple-choice questions are tested using a standardised optical answer
sheet that uses optical mark recognition for detecting answers for Paper 1
and Paper 2 comprises three sections, namely Sections A, B, and C.

TOEFL (Teaching of Foreign Language)

The TOEFL test is administered two ways; as an Internet-based test
(TOEFL iBT), and as a paper-based test (TOEFL PBT). Most of the 4,500+
test sites in the world use the TOEFL iBT.The TOEFL iBT test is given in
English and administered via the Internet. There are four sections (listening,
reading, speaking and writing), which take a total of about four and a half
hours to complete.
IELTS Test Format

IELTS is a test of all four language skills Listening, Reading, Writing
& Speaking. Test-takers will take the Listening, Reading and Writing tests all
on the same day one after the other, with no breaks in between. Depending
on the examinees test centre, ones Speaking test may be on the same day
as the other three tests, or up to seven days before or after that. The total
test time is under three hours. The test format is illustrated below.





Figure 6: IELTS Test Format






















TOPIC 6
ASSESSING LANGUAGE SKILLS
CONTENT

6.0 SYNOPSIS
Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language
skills and language content. It also provides teachers with
suggestions on ways a teacher can assess the listening, speaking,
reading and writing skills in a classroom. It also discusses concepts
of and differences between discrete point test, integrative test and
communicative test.

6.1 LEARNING OUTCOMES
At the end of Topic 6, teachers will be able to:
Identify and carry out the different types of assessment to assess
language skills and language content
Understand anddifferentiate between objective and subjective
testing
Understand and differentiate between discrete point test,
integrative test and communicative test in assessing language.

6.2 FRAMEWORK OF TOPICS


ASSESSING
LANGUAGE SKILLS
AND
LANGUAGE CONTENT
LANGUAGE SKILLS
OBJECTIVE AND
SUBJECTIVE TESTING
LISTENING
SPEAKING
READING
WRITING
LANGUAGE CONTENT
DISCRETE TEST
INTEGRATIVE
TEST
COMMUNICATIVE
TEST


CONTENT

SESSION SIX (6 hours)

6.2.1 Types of test items to assess language skills
a. Listening
Basically there are two kinds of listening tests: tests that test specific
aspects of listening, like sound discrimination; and task based tests which
test skills in accomplishing different types of listening tasks considered
important for the students being tested. In addition to this, Brown 2010
identified four types of listening performance from which assessment could
be considered.
i. Intensive : listening for perception of the components (phonemes, words,
intonation, discourse markers,etc) of a ;larger stretch of language. ii.
Responsive : listening to a relatively short stretch of language ( a
greeting, question, command, comprehension check, etc.) in order to
make an equally short response iii. Selective : processing stretches of
discourse such as short monologues for several minutes in order to scan
for certain information. The purpose of such performance is not
necessarily to look for global or general meaning but to be able to
comprehend designated information in a context of longer stretches of
spoken language( such as classroom directions from a teacher, TV or
radio news items, or stories).
Assessment tasks in selective listening could ask students, for
example, to listen for names, numbers, grammatical category,
directions (in a map exercise), or certain facts and events.
iv. Extensive : listening to develop a top-down , global
understanding of spoken language. Extensive performance
ranges from listening to lengthy lectures to listening to a
conversation and deriving a comprehensive message or
purpose. Listening for the gist or the main idea- and
making inferences are all part of extensive listening.



b. Speaking


In the assessment of oral production, both
discrete feature objective tests and
integrative task-based tests are used. The first type tests such
skills as pronunciation, knowledge of what language is
appropriate in different situations, language required in doing
different things like describing, giving directions, giving
instructions, etc. The second type involves finding out if pupils
can perform different tasks using spoken language that is
appropriate for the purpose and the context. Task-based
activities involve describing scenes shown in a picture,
participating in a discussion about a given topic, narrating a
story, etc. As in the listening performance assessment tasks,
Brown 2010 cited four categories for oral assessment.

1. Imitative . At one end of a continuum of types of speaking
performance is the ability to imitate a word or phrase or
possibly a sentence. Although this is a purely phonetic level of
oral production, a number of prosodic (intonation, rhythm,etc.),
lexical , and grammatical properties of language may be
included in the performance criteria. We are interested only in
what is traditionally labelled pronunciation; no inference are
made about the test-takers ability to understand or convey
meaning or to participate in an interactive conversation. The
only role of listening here is in the short-term storage of a
prompt, just long enough to allow the speaker to retain the
short stretch of language that must be imitated.
2. Intensive. The production of short stretches of oral language
designed to demonstrate competence in a narrow band of
grammatical, phrasal, lexical, or phonological relationships.
Examples of intensive assessment tasks include directed
response tasks (requests for specific production of speech),
reading aloud, sentence and dialogue completion, limited
picture-cued tasks including simple sentences, and translation
up to the simple sentence level.


3. Responsive.
Responsive assessment tasks include
interaction and test comprehension but at somewhat limited
level of very short conversation, standard greetings, and small
talk, simple requests and comments, etc. The stimulus is
almost always a spoken prompt (to preserve authenticity) with
one or two followup questions or retorts:

A. Liza : Excuse me, do you have the time?
Don : Yeah. Six-fifteen.

B. Jo : What is the most urgent social problem today?
Sue : I would say bullying.

C. Lan : Hey, Shan, hows it going?
Shan: Not bad, and yourself?
Lan : Im good.
Shan: Cool. Okay gotta go.

4. Interactive. The difference between responsive and interactive
speaking is in the length and complexity of the interaction, which
sometimes includes multiple exchanges and/or multiple
participants. Interaction can be broken down into two types : (a)
transactional language, which has the purpose of exchanging
specific information, and (b) interpersonal exchanges, which
have the purpose of maintaining social relationships. (In the
three dialogues cited above, A and B are transactional, and C is
interpersonal).
5. Extensive (monologue). Extensive oral production tasks include
speeches, oral presentations, and storytelling, during which the
opportunity for oral interaction from listeners is either highly
limited (perhaps to nonverbal responses) or ruled out together.
Language style is more deliberative (planning is involved) and
formal for extensive tasks.In can include informal monologue
such as casually delivered speech (e.g., recalling a vacation in


the mountains, conveying recipes,
recounting the plot of a novel or movie).

c. Reading
Cohen (1994), discussed various types of reading and meaning
assessed. He describes skimming and scanning as two different
types of reading. In the first, a respondent is given a lengthy passage
and is required to inspect it rapidly (skim) or read to locate specific
information (scan) within a short period of time. He also discusses
receptive reading or intensive reading which refers to a form of
reading aimed at discovering exactly what the author seeks to
convey (p. 218). This is the most common form of reading especially
in test or assessment conditions. Another type of reading is to read
responsively where respondents are expected to respond to some
point in a reading text through writing or by answering questions.

A reading text can also convey various kinds of meaning and reading
involves the interpretation or comprehension of these meanings.
First, grammatical meaning are meanings that are expressed
through linguistic structures such as complex and simple sentences
and the correct interpretation of those structures. A second meaning
is informational meaning which refers largely to the concept or
messages contained in the text. Respondents may be required to
comprehend merely the information or content of the passage and
this may be assessed through various means such as summary and
prcis writing. Compared to grammatical or syntactic meaning,
informational meaning requires a more general understanding of a
text rather than having to pay close attention to the linguistic
structure of sentences. A third meaning contained in many texts is
discourse meaning. This refers to the perception of rhetorical
functions conveyed by the text. One typical function is discourse
marking which adds cohesiveness to a text. These words, such as
unless, however, thus, therefore etc., are crucial to the correct
interpretation of a text and students may be assessed on their ability
to understand the discoursal meaning that they bring in the passage.
Finally, a fourth meaning which may also be an object of assessment


in a reading test is the meaning conveyed by
the writers tone. The writers tone whether
it is cynical, sarcastic, sad or etc.- is important in reading
comprehension but may be quite difficult to identify, especially by
less proficient learners. Nevertheless, there can be many situations
where the reader is completely wrong in comprehending a text
simply because he has failed to perceive the correct tone of the
author.
d. Writing
Brown (2004), identifies three different genres of writing which are
academic writing, job-related writing and personal writing, each of
which can be expanded to include many different examples. Fiction,
for example, may be considered as personal writing according to
Browns taxonomy. Brown (2010) identified four categories of
written performance that capture the range of written production
which can be used to assess writing skill.

1. Imitative. To produce written language, the learner must attain
the skills in the fundamental, basic tasks of writing letters,
words, punctuation, and brief sentences. This category includes
the ability to spell correctly and to perceive phoneme-grapheme
correspondences in the English spelling system. At this stage
the learners are trying to master the mechanics of writing. Form
is the primary focus while context and meaning are of
secondary concern.
2. Intensive (controlled). Beyond the fundamentals of imitative
writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical
features up to the length of a sentence. Meaning and context
are important in determining correctness and appropriateness
but most assessment tasks are more concerned with a focus on
form and are rather strictly controlled by the test design.
3. Responsive. Assessment tasks require learners to perform at a
limited discourse level, connecting sentences into a paragraph
and creating a logically connected sequence of two or three
paragraphs. Tasks relate to pedagogical directives, lists of


criteria, outlines, and other guidelines.
Genres of writing include brief narratives and
descriptions, short reports, lab reports, summaries, brief
responses to reading, and interpretations of charts and graphs.
Form-focused attention is mostly at the discourse level, with a
strong emphasis on context and meaning.
4. Extensive. Extensive writing implies successful management of
all the processes and strategies of writing for all purposes, up to
the length of an essay, a term paper, a major research project
report, or even a thesis. Focus is on achieving a purpose,
organizing and developing ideas logically, using details to
support or illustrate ideas, demonstrating syntactic and lexical
variety, and in many cases, engaging in the process of multiple
drafts to achieve a final product. Focus on grammatical form is
limited to occasional editing and proofreading of a draft.

6.2.2 Objective and Subjective test
Tests have been categorized in many different ways. The most
familiar terms regarding tests are the objective and subjective
tests . We normally associate objective tests with multiple choice
question type tests and subjective tests with essays. However, to
be more accurate we will consider how the test is graded.
Objective tests are tests that are graded objectively while
subjective tests are thought to involve subjectivity in grading.

There are many examples of each type of test. Objective type
tests include the multiple choice test, true false items and
matching items because each of these are graded objectively. In
these examples of objective tests, there is only
one correct response and the grader does not need to
subjectively assess the response.

Examples of the subjective test include essays and short answer
questions. However some other types of common tests such as
the dictation test, filling in the blank type tests, as well as
interviews and role plays can be considered subjective and


objective type tests where they fall on some
sort of continuum where some tests are
more objective than others. As such, some of these tests would
fall closer to one end of the continuum or the other.

Two other terms, select type tests and supply type tests are
related terms when we think of objective and subjective tests. In
most cases, objective tests are similar to select type tests where
students are expected to select or choose the answer from a list
of options. Just as a multiple choice question test is an objective
type test, it can also be considered a select type test. Similarly,
tests involving essay type questions are supply type as the
students are expected to supply the answer through their essay.
How then would you classify a fill in the blank type test? Definitely
for this type of test, the students need to supply the answer, but
what is supplied is merely a single word or a short phrase which
differs tremendously from an essay. It may therefore be helpful to
once again consider a continuum with supply type and select type
items at each end of the continuum respectively.

It is possible to now combine both continua as shown in Figure
6.1 with the two different test formats placed within the two
continua:


Figure 6.1: Continua for different types of test formats
It is not by accident that we find there are few, if any, test formats that
are either supply type and objective or select type and subjective.
Select type tests tend to be objective while supply type tests tend to be
subjective.



In addition to the above, Brown and Hudson
(1998), have also suggested three broad
categories to differentiate tests according to
how students are expected to respond. These categories are the selected
response tests, the constructed response tests, and the personal
response tests. Examples of each of these types of tests are given in
Table 6.1.

Table 6.1: Types of Tests According to Students Expected Response
Selected response Constructed response Personal response
True false Fill-in Conferences
Matching

Portfolios
Multiple choice Performance test
Self and peer
assessments


Selected response assessments, according to Brown and Hudson
(1998), are assessment procedures in which students typically do not
create any language but rather select the answer from a given list (p.
658). Constructed response assessment procedures require students
to produce language by writing, speaking, or doing something else (p.
660). Personal response assessments, on the other hand, require
students to produce language but also allows each students response
to be different from one another and for students to communicate what
they want to communicate (p. 663). These three types of tests,
categorised according to how students respond, are useful when we
wish to determine what students need to do when they attempt to
answer test questions.

6.2.3 Types of test items to assess language content
Short answer


a. Discrete Point Test and
Integrative Test
Language tests
may also be categorised as either discrete point or integrative.
Discrete point tests examine one element at a time.
Integrative tests, on the other hand, requires the candidate to
combine many language elements in the completion of a task
(Hughes, 1989: 16). It is a simultaneous measure of knowledge
and ability of a variety of language features, modes, or skills.

A multiple choice type test is usually cited as an example of a
discrete point test while essays are commonly regarded as the
epitome of integrative tests. However, both the discrete point test
and the integrative test are a matter of degree. A test may be
more discrete point than another and similarly a test may be more
integrative than another. Perhaps the more important aspect is to
be aware of the discrete point or integrative nature of a test as we
must be careful of what we believe the test measures.

This brings us to the question of how discrete point is a multiple
choice question type item? While it is definitely more discrete point
than an essay, it may still require more than just one skill or ability
in order to complete. Lets say you are interested in testing a
students knowledge of the relative pronoun and decide to do so
by using a multiple choice test item. If he fails to answer this test
item correctly, would you conclude that the student has problems
with the relative pronoun? The answer may not be as straight
forward as it seems. The test is presented in textual form and
therefore requires the student to read. As such, even the multiple
choice test item involves some integration of language skills as
this example shows, where in addition to the grammatical
knowledge of relative pronouns, the student must also be able to
read and understand the question.

Perhaps a clearer way of viewing the distinction between the
discrete point and the integrative test is to examine the


perspective each takes toward language. In
the discrete point test, language is seen to
be made up of smaller units and it may be possible to test
language by testing each unit at a time. Testing knowledge of the
relative pronoun, for example, is certainly assessing the students
on a particular unit of language and not on the language as a
whole. In an integrative test, on the other hand, the perspective of
language is that of an integrated whole which cannot be broken up
into smaller units or elements. Hence, the testing of language
should maintain the integrity or wholeness of the language.

b. Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not
surprising that communicative tests have also been given
prominence. A communicative emphasis in testing involves many
aspects, two of which revolve around communicative elements in
tests and meaningful content. Both these aspects are briefly
addressed in the following sub sections:
Integrating Communicative Elements into Examinations
Alderson and Banerjee (2002), report on various studies that seem to
point to the difficulty in achieving authenticity in tests. They cite
Spence-Brown (2001) who posits that the very act of assessment
changes the nature of a potentially authentic task and compromises
authenticity and that authenticity must be related to the
implementation of an activity, not to its design (p. 99). In her study,
students were required to interview native speakers outside the
classroom and submit a tape-recording of the interview. While this
activity seems quite authentic, the students were observed to prepare
for the interview by rehearsing the interview, editing the results, and
engaging in spontaneous, but flawed discourse (Alderson &
Banerjee, 2002: 99), all of which are inauthentic when viewed in
terms of real life situations. Alderson himself argues that because
candidates in language tests are not interested in communicating but
to display their language abilities, the test situation is a


communicative event in itself and therefore
cannot be used to replicate any real world
event (p. 98).

Chalhoub-Deville (2003), argues for tests that take context into
consideration. She believes that there should be a shift in focus of
our measurement from traditional examinations of the construct in
terms of response consistency, to investigations that systematically
explore inconsistent (which does not mean random) performances
across contexts (p. 378). In the future, besides context, tests will also
need to integrate elements of communication such as topic initiation,
topic maintenance, and topic change in order for the test to become
more authentic and realistic. Due to issues of practicality, involving
especially the amount of time and extent of organisation to allow for
such communicative elements to emerge, it will not be an easy task to
achieve.

The idea of bringing communicative elements into the language test is
not a new one. In his review of communicative tests, Fulcher (2000),
notes the descriptors of a communicative test as suggested by
several theorists. The three principles of communicative tests that he
highlights are that communicative tests:
involve performance;
are authentic; and
are scored on real-life outcomes.

In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree
of unpredictability which is typical of any language interaction
situation. These tests would also take the communicative purpose of
the interaction into consideration and require the student to interact
with language that is actual and unsimplified for the learner. Fulcher
finally points out that in a communicative test, the only real criterion
of success is the behavioural outcome, or whether the learner was


able to achieve the intended communicative
effect (p. 493). It is obvious from this
description that the communicative test may not be
so easily developed and implemented. Practical reasons may hinder
some of the demands listed. Nevertheless, a solution to this problem
has to be found in the near future in order to have valid language that
are purposeful and can stimulate positive washback in teaching and
learning.







Exercise 1

1. In your opinion and based on your teaching
experience, how would you conduct the testing
of reading, writing and speaking skills of your
own
students? What are the methods that you employ?
Share this with your classmates and exchange
ideas.

2. Describe three different types of writing
performance as suggested by Brown (2004)
and relate their relationship to academic writing,
job related writing and personal writing.






















TOPIC 7

SCORING, GRADING AND
ASSESSMENT CRITERIA


7.0 SYNOPSIS
Topic 7 focuses on the scoring, grading and assessment criteria. It
provides teachers with brief descriptions on the different approaches to
scoring namely:-objective, holistic and analytic.

7.1 LEARNING OUTCOMES
By the end of Topic 7, teachers will be able to:
Identify and differentiate the different approaches used in scoring
Use the different approaches used in scoring in assessing language


7.2 FRAMEWORK OF TOPICS





CONTENT

SESSION SEVEN (3 hours)

7.2.1 Objective approach

A type of scoring approach is the objective scoring approach. This
scoring approach relies on quantified methods of evaluating students
writing. A sample of how objective scoring is conducted is given by
Bailey (1999) as follows:
Approaches to
scoring
Objective Holistic Analytic


Establish
standardization by limiting the length of the
assessment: Count the first 250 words of the essay.
Identify the elements to be assessed: Go through the essay up to the
250th word underlining every mistake from spelling and mechanics
through verb tenses, morphology, vocabulary, etc. Include every error
that a literate reader might note.
Operationalise the assessment: Assign a weight score to each error,
from 3 to 1. A score of 3 is a severe distortion of readability or flow of
ideas; 2 is a moderate distortion; and 1 is a minor error that does not
affect readability in any significant way.
Quantify the assessment: Calculate the essay Correctness Score by
using 250 words as the numerator of a fraction, and the sum of error
scores as the denominator: The denominator is the sum of all the error
scores:

7.2.2 Holistic approach
In holistic scoring, the reader reacts to the students compositions as a
whole and a single score is awarded to the writing. Normally this score is
on a scale of 1 to 4, or 1 to 6, or even 1 to 10.(Bailey, 1998 : 187). Each
score on the scale will be accompanied with general descriptors of
ability. The following is an example of a holistic scoring scheme based on
a 6 point scale.












Table 7.1: Holistic Scoring Scheme


Source: S.S. Moya, Evaluation Assistance
Center (EAC)-East, Georgetown
University, Washington

R


The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic
development and communication. Mechanics such as punctuation are
secondary to communication.
Rating C Criteria
5-6
Vocabulary is precise, varied, and vivid.
Organization is appropriate to writing
assignment and contains clear introduction,
development of ideas, and conclusion.
Transition from one idea to another is smooth
and provides reader with clear understanding
that topic is changing.
Meaning is conveyed effectively.
A few mechanical errors may be present but do
not disrupt communication.
Shows a clear understanding of writing and topic
development.
4
Vocabulary is adequate for grade level.
Events are organized logically, but some part
of the sample may not be fully developed.
Some transition of ideas is evident.
Meaning is conveyed but breaks down at times.
Mechanical errors are present but do not disrupt
communication.
Shows a good understanding of writing and topic
development.
3
Vocabulary is simple. Organization may be
extremely simple or there may be evidence of
disorganization.
There are a few transitional markers or
repetitive transitional markers.
Meaning is frequently not clear.
Mechanical errors affect communication.
Shows some understanding of writing and
topic development.
2
Vocabulary is limited and repetitious. Sample
is comprised of only a few disjointed
sentences.
No transitional markers.
Meaning is unclear.
Mechanical errors cause serious disruption in
communication.
Shows little evidence of discourse
understanding.
1
Responds with a few isolated words. No
complete sentences are written.
No evidence of concepts of writing.
0 No response.


Bailey also describes another type of scoring
related to the holistic approach which she
refers to as primary trait scoring. In primary trait scoring, a particular
functional focus is selected which is based on the purpose of the writing and
grading is based on how well the student is able to express that function. For
example, if the function is to persuade, scoring would be on how well the
author has been able to persuade the grader rather than how well organised
the ideas were, or how grammatical the structures in the essay were. This
technique to grading emphasises functional and communicative ability rather
than discrete linguistic ability and accuracy.

7.2.3 Analytic approach
Analytical scoring is a familiar approach to many teachers. In analytical
scoring, raters assess students performance on a variety of categories
which are hypothesised to make up the skill of writing. Content, for
example, is often seen as an important aspect of writing i.e. is there
substance to what is written? Is the essay meaningful? Similarly, we may
also want to consider the organisation of the essay. Does the writer begin
the essay with an appropriate topic sentence?
Are there good transitions between paragraphs? Other categories that we
may want to also consider include vocabulary, language use and
mechanics. The following are some possible components used in
assessing writing ability using an analytical scoring approach and the
suggested weightage assigned to each:


Components Weight
Content 30 points
Organisation 20 points
Vocabulary 20 points
Language Used 25 points
Mechanics 5 points

The points assigned to each component reflect the importance of
each of the components.

Comparing the Three Approaches



Each of the three scoring approaches claims
to have its own advantages and
disadvantages. These can be illustrated by
Table 7.2


Table 7.2: Comparison of the Advantages and Disadvantages of the
Three Approaches to Scoring Essays
Scoring
Approach
Advantages Disadvantages

Holistic



Quickly graded
Provide a public standard that is
understood by the teachers and
students alike
Relatively higher degree of rater
reliability
Applicable to the assessment of
many different topics
Emphasise the students
strengths rather than their
weaknesses.
The single score may actually mask differences
across individual compositions.
Does not provide a lot of diagnostic feedback

Analytical
It provides clear guidelines in
grading in the form of the
various components.
Allows the graders to
consciously address important
aspects of writing.
Writing ability is unnaturally split up into
components.

Objective
Emphasises the students
strengths rather than
their weaknesses.

Still some degree of subjectivity involved.
Accentuates negative aspects of the learners
writing without giving credit for what they can
do well.




EXERCISE

1. Based on your understanding, draw a mind map to indicate the
advantages and disadvantages of the three approaches to
scoring essays.




















TOPIC 8

ITEM ANALYSIS AND INTERPRETATION

8.0 SYNOPSIS
Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median,
mean, standard deviation, standard score and interpretation of data. It will
also look at some item analysis that deals with item difficulty and item
discrimination.
Teachers will also be introduced to distractor analysis in language
assessment.

8.1 LEARNING OUTCOMES
By the end of Topic 8, teachers will be able to:
Identify and differentiate some basic statistics terminologies used.
determine how well items discriminate using item discrimination; and
Analyse how well a distractor in a test item performs

8.2 FRAMEWORK OF TOPICS



CONTENT


ITEM ANALYSIS
AND
INTERPRETATIO
N
BASIC
STATISTICS
MODE
MEDIAN
MEAN
STANDARD
DEVIATION
STANDARD
SCORE
INTERPRETATIO
N OF DATA
ITEM ANALYSIS
ITEM
DIFFICULTY
ITEM
DISCRIMINATIO
N
DISTRACTOR
ANALYSIS


SESSION EIGHT (6 hours)


8.2.1 Basic Statistics
Let us assume that you have just graded the test papers for your class. You
now have a set of scores. If a person were to ask you about the
performance of the students in your class, it would be very difficult to give
all the scores in the class. Instead, you may prefer to cite only one score.
Or perhaps you would like to report on the performance by giving some
values that would help provide a good indication of how the students in your
class performed. What values would you give? In this section, we will look
at two kinds of measures, namely measures of central tendency and
measures of dispersion. Both these types of measures are useful in score
reporting.

Central tendency measures the extent to which a set of scores gathers
around. There are three major measures of central tendency. They are the
mode, median and mean.
MODE Mode is the most frequently occurring raw score in a set of
scores.
The following is a set of scores:
15, 13, 12, 12, 13, 16, 13, 17, 14, 18
What is the mode for this set of scores? If you said 13, then
you are correct as it occurs more often than others. It is
possible to have one mode in a set of scores. If there are
two modes, then the set of scores is referred to as being
bimodal.
MEDIAN The median refers to the score that is in the middle of the
set of scores when the scores are arranged in ascending or
descending order. There are seven scores in the set of
scores above. If we arrange it in order based on value, it
would be 45, 47, 50, 51, 52, 54, 65. In this set of scores, the
median will be 51 as it is the middle score. There are three
scores lower than it and an equal number of scores higher
than it.
What happens when there are an even number of scores?
Lets take the following set of scores as an example:
45, 47, 50, 51, 52, 53, 54, 65
As there is no one score that is in the middle, we need to
take the two in the middle, add them up and divide by two.
As such, the median is 51.5 as (51 + 52)/2 or 103/2 =51.5.
Always remember, however, that when we wish to find the
median, we have to first arrange the scores in either
ascending or descending order of value.



MEAN
The mean of a set of test scores is the arithmetic mean or
average and is calculated as SX/N where S (sigma) refers
to the sum of, X refers to the raw or observed scores, and
N is the number of observed scores. Look at the following
set of scores:
47, 65, 45, 54, 50, 52, 51
The mean for this set of scores is 364/7 = 52


8.2.2 Standard deviation
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the
following formulae.


To illustrate this, we will use 20, 25,30. Using standard deviation method,
we come up with the following table:

Table 8.1:Calculating the Standard Deviation Using the Deviation Method




Using the raw score method, we can come up with the following:





Table 8.2 : Calculating the Standard Deviation Using the Raw Score
Method




Both methods result in the same final value of 5. If you are
calculating standard deviation with a calculator, it is suggested that
the deviation method be used when there are only a few scores and



the raw score method be used when there
are many scores. This is because when
there are many scores, it will be tedious to calculate the square of
the deviations and their sum.


8.2.3 Standard score

Standardised scores are necessary when we want to make
comparisons across tests and measurements. Z scores and T scores
are the more common forms of standardised scores although you
may come up with your own standardised score. A standardised
score can be computed for every raw score in a set of scores for a
test.




i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is
as follows:


Table 8.3: Calculating the Z Score for a Set of Scores




Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting
especially for those unaccustomed to the concept. Imagine what a
parent may say if his child comes home with a report card with a Z
score of 0.47 in English Language! Fortunately, there is another
form of standardised score - the T score with values that are more
palatable to the relevant parties.

ii. The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10
(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values
seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.

8.2.4 Interpretation of data

The standardised score is actually a very important score if we want to
compare performance across tests and between students. Let us take the
following scenario as an example:

How can En. Abu solve this problem? He would have to
have standardised scores in order to decide. This would
require the following information:



Test 1 : X = 42 standard deviation= 7
Test 2 : X = 47 standard deviation= 8

Using the information above, En. Abu can find the Z score for each
raw score reported as follows:

Table 8.4: Z Score for Form 2A

Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into
consideration.





THE NORMAL CURVE

The normal curve is a hypothetical curve that is supposed to represent all
naturally occurring phenomena. It is assumed that if we were to sample a
particular characteristic such as the height of Malaysian men, then we will
find that while most will have an average height of perhaps 5 feet 4 inches,
there will be a few who will be relatively shorter and an equal number who
are relatively taller. By plotting the heights of all Malaysian men according
to frequency of occurrence, it is expected that we would obtain something
similar to a normal distribution curve. Similarly, test scores that measure
any characteristic such as intelligence, language proficiency or writing
ability of a specific population is also expected to provide us with a normal
curve. The following is a diagram illustrating how the normal curve would
look like.




Figure 8.1: The normal distribution or Bell curve


The normal curve in Figure 8.1 is partitioned according to standard
deviations (i.e. 4s, -3s, + 3s, + 4s) which are indicated on the
horizontal axis. The area of the curve between standard deviations is
indicated in percentage on the diagram. For example, the area between
the mean (0 standard deviation) and +1 standard deviation is 34.13%.
Similarly, the area between the mean and 1 standard deviation is also
34.13%. As such, the area between 1 and 1 standard deviations is
68.26%.
In using the normal curve, it is important to make a distinction between
standard deviation values and standard deviation scores. A standard
deviation value is a constant and is shown on the horizontal axis of the
diagram above. The standard deviation score, on the other hand, is the
obtained score when we use the standard deviation formula provided
earlier. So, if we find the score to be 5 as in the earlier example, then
the score for the standard deviation value of 1 is 5 and for the value of
2 is 5 x 2 = 10 and for the value of 3 is 15 and so on. Standard
deviation values of 1, -2, and 3 will have corresponding negative
scores of 5, 10, and 15.
8.2.5 Item analysis



a. Item difficulty
Item difficulty refers
to how easy or difficult an item is. The
formula used to measure item difficulty is quite straightforward. It
involves finding out how many students answered an item correctly
and dividing it by the number of students who took this test. The
formula is therefore:

For example, if twenty students took a test and 15 of them correctly
answered item 1, then the item difficulty for item 1 is 15/20 or 0.75.
Item difficulty is always reported in decimal points and can range
from 0 to 1. An item difficulty of 0 refers to an extremely difficult item
with no students getting the item correct and an item difficulty of 1
refers to an easy item which all students answered correctly.
The appropriate difficulty level will depend on the purpose of the test.
According to Anastasi & Urbina (1997), if the test is to assess
mastery, then items with a difficulty level of 0.8 can be accepted.
However, they go on to describe that if the purpose of the test is for
selection, then we should utilise items whose difficulty values come
closest to the desired selection ratio for example, if we want to
select 20%, then we should choose items with a difficulty index of
0.20.

b. Item discrimination

Item discrimination is used to determine how well an item is able to
discriminate between good and poor students. Item discrimination
values range from 1 to 1. A value of 1 means that the item
discriminates perfectly, but in the wrong direction. This value would tell
us that the weaker students performed better on a item than the better
students. This is hardly what we want from an item and if we obtain
such a value, it may indicate that there is something not quite right
with the item. It is strongly recommended that we examine the item to
see whether it is ambiguous or poorly written. A discrimination value of


1 shows positive discrimination with the
better students performing much better than
the weaker ones as is to be expected.

Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:








Table 8.5: Item Discrimination

As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
consist of students L, A, E, and G while the lower group would consist
of students J, H, D and I.


We now need to look at the performance
of these students for each item in order to
find the item discrimination index of each item. For item 1, all four
students in the upper group (L, A, E, and G) answered correctly
while only student H in the lower group answered correctly. Using
the formula described earlier, we can plug in the numbers as
follows:


Two points should be noted. First, item discrimination is especially
important in norm referenced testing and interpretation as in such
instances there is a need to discriminate between good students who
do well in the measure and weaker students who perform poorly. In
criterion referenced tests, item discrimination does not have as
important a role. Secondly, the use of 33.3% of the total number of
students who took the test in the formula is not inflexible as it is
possible to use any percentage between 27.5% to 35% as the value.

c. Distractor analysis

Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers
select the correct answer, but how the distractors were able to
function effectively by drawing the test takers away from the correct
answer. The number of times each distractor is selected is noted in
order to determine the effectiveness of the distractor. We would
expect that the distractor is selected by enough candidates for it to be
a viable distractor.



What exactly is an acceptable value? This
depends to a large extent on the difficulty of
the item itself and what we consider to be an acceptable item difficulty
value for test items. If we are to assume that 0.7 is an appropriate
item difficulty value, then we should expect that the remaining 0.3 be
about evenly distributed among the distractors.



Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in
their role as distractors. Similarly, if 15 students selected D and another
15 selected B, then C is not an effective distractor and should be
replaced.
Therefore, the ideal situation would be for each of the three distractors to
be selected by an equal number of all students who did not get the
answer correct, i.e. in this case 10 students. Therefore the effectiveness
of each distractor can be quantified as 10/100 or 0.1 where 10 is the
number of students who selected the tiems and 100 is the total number of
students who took the test. This technique is similar to a difficulty index
although the result does not indicate the difficulty of each item, but rather
the effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the
indices would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty
of an item, the value of the difficulty index formula for the distractors must
be interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
Let us take the following test item as an example:

In the story, he was unhappy because_____________________________


A. it rained all day

B. he was scolded

C. he hurt himself

D. the weather was hot





would still remain, but the analysis and
expectation would differ slightly from the
regular item discrimination that we have looked at earlier. Instead of
expecting a positive value, we should logically expect a negative value as
more students from the lower group should select distractors. Each
distractor can have its own item discrimination value in order to analyse
how the distractors work and ultimately refine the effectiveness of the test
item itself.

Table 8.6: Selection of Distractors

For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one
student from the lower group did so. If we assume that the three remaining
students from the lower group all selected distractor B, then the
discrimination index for item 1, distractor B will be:


This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This
result is to be expected of a distractor and a value of -1 to 0 is preferred.

EXERCISE

1. Calculate the mean, mode, median and range of the following set of scores:
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.







2. What is a normal curve
and what does this show? Does the final result always
show a normal curve and how does this relate to
standardised tests?



























TOPIC 9

REPORTING OF ASSESSMENT DATA

9.0 SYNOPSIS
Topic 9 focuses on reporting assessment data. It provides teachers with
brief descriptions on the purposes of reporting and the reporting methods.

9.1 LEARNING OUTCOMES
By the end of Topic 9, teachers will be able to:
Understand the purposes of reporting of assessment data


Understand and use the
different reporting methods in language
assessment


9.2 FRAMEWORK OF TOPICS




CONTENT

SESSION NINE (3 hours)








9.2.1 Purposes of reporting

We can say that the main purpose of tests is to obtain information
concerning a particular behaviour or characteristic. Based on
information obtained from tests, several different types of decisions can
be made. Kubiszyn & Borich (2000), mention eight different types of
decisions made on the basis of information obtained from tests. These
educational decisions are shown in Figure 9.1
REPORTING OF
ASSESSMENT
DATA
PURPOSES OF
REPORTING
REPORTING
METHODS



Figure 9.1 :Eight Types of Decisions Mode


Instructional decisions are made based on test results when, for
example, teachers decide to change or maintain their instructional
approach. If a teacher finds out that most of his class have failed
his test, there are many possible reactions he can have. The
teacher could evaluate the effectiveness of his own teaching or
instructional approach and implement the necessary
changes.Tests yield scores and teachers will have to make
decisions in terms of the kind of grades to give students. As grades
are indicators of student performance, teachers need to decide
whether a student deserves a high grade perhaps an A on the
basis of some form of assessment.
Traditionally, and perhaps for a long time to come, this assessment will
be in the form of tests. Sometimes, we give tests to find out the
strengths and weaknesses of our students.
Decisions related to selection, placement, counselling and guidance,
programme or curriculum, and administrative policy are all made at
levels higher than the classroom.
Administrators, educational agencies and institutions may be involved
in these decisions.
Selection and placement decisions are somewhat similar. However, a
selection decision relates to whether or not a student is selected for a
programme or for admission into an institution based on a test score.


Tests such as TOEFL and IELTS are often
used by universities to decide whether a
candidate is suitable, and hence selected for admission.
A placement decision, however, deals with where a candidate
should be placed based on performance on the test. A clear
example is the language placement examination for newly admitted
students commonly administered by many local and foreign
universities. Based on their performance on such a test, students are
placed into different language classes that are arranged according to
proficiency levels.
Counselling and guidance decisions are also made by relevant parties
such as counsellors and administrators on the basis of exam results.
Counsellors often give advice in terms of appropriate vocations for
some of their students. These advice is likely to be made on the basis
of the students own test scores. Programme or curriculum decisions
reflect the kinds of changes made to the educational programme or
curriculum based on examination results. Finally, there are also
administrative policy decisions that need to be made which are also
greatly influenced by test scores.





9.2.2 Reporting methods
Student achievement progress can be reported by comparing:

i. Norm - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to other students.
ii Criterion - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to predetermined criteria.
An outcomes-approach to assessment will provide information about
student achievement to enable reporting against a standards
framework.


iii An outcomes-approach
Acknowledges that
students, regardless of their class or grade,
can be working towards syllabus outcomes anywhere along the
learning continuum.



Principles of effective and informative assessment and reporting Effective
and informative assessment and reporting practice:
Has clear, direct links with outcomes
The assessment strategies employed by the teacher in the
classroom need to be directly linked to and reflect the syllabus
outcomes. Syllabus outcomes in stages will describe the
standard against which student achievement is assessed and
reported.

Is integral to teaching and learning
Effective and informative assessment practice involves selecting
strategies that are naturally derived from well structured teaching
and learning activities. These strategies should provide
information concerning student progress and achievement that
helps inform ongoing teaching and learning as well as the
diagnosis of areas of strength and need.


Is balanced, comprehensive and varied
Effective and informative assessment practice involves teachers
using a variety of assessment strategies that give students
multiple opportunities, in varying contexts, to demonstrate what
they know, understand and can do in relation to the syllabus
outcomes. Effective and informative reporting of student
achievement takes a number of forms including traditional
reporting, student profiles, Basic Skills Tests, parent and student
interviews, annotations on student work, comments in
workBooks, portfolios, certificates and awards.
Is valid
Assessment strategies should accurately and appropriately assess
clearly defined aspects of student achievement. If a strategy does


not accurately assess what it is designed to
assess, then its use is misleading.
Valid
assessment strategies are those that reflect the actual
intention of teaching and learning activities, based on
syllabus outcomes.
Where values and attitudes are expressed in syllabus outcomes,
these too should be assessed as part of student learning.
Is fair
Effective and informative assessment strategies are designed to
ensure equal opportunity for success regardless of students' age,
gender, physical or other disability, culture, background
language, socio-economic status or geographic location.
Engages the learner
Effective and informative assessment practice is student centred.
Ideally there is a cooperative interaction between teacher and
students, and among the students themselves.
The syllabus outcomes and the assessment processes to be used
should be made explicit to students. Students should participate in
the negotiation of learning tasks and actively monitor and reflect
upon their achievements and progress.
Values teacher judgement
Good assessment practice involves teachers making
judgements, on the weight of assessment evidence, about
student progress towards the achievement of outcomes.
Teachers can be confident a student has achieved an outcome
when the student has successfully demonstrated that outcome a
number of times, and in varying contexts.
The reliability of teacher judgement is enhanced when teachers
cooperatively develop a shared understanding of what constitutes
achievement of an outcome. This is developed through
cooperative programming and discussing samples of student
work and achievements within and between schools. Teacher
judgement based on well defined standards is a valuable and rich
form of student assessment.
Is time efficient and manageable
Effective and informative assessment practice is time efficient and
supports teaching and learning by providing constructive feedback to
the teacher and student that will guide further learning.
Teachers need to plan carefully the timing, frequency and nature of
their assessment strategies. Good planning ensures that assessment


and reporting is manageable and maximises
the usefulness of the strategies selected (for
example, by addressing several outcomes in
one assessment task).
Recognises individual achievement and progress Effective and
informative assessment practice acknowledges that students are individuals
who develop differently. All students must be given appropriate opportunities
to demonstrate achievement. Effective and informative assessment and
reporting practice is sensitive to the self esteem and general well-being of
students, providing honest and constructive feedback.
Values and attitudes outcomes are an important part of learning that
should be assessed and reported. They are distinct from knowledge,
understanding and skill outcomes.
Involves a whole school approach
An effective and informative assessment and reporting policy is
developed through a planned and coordinated whole school
approach. Decisions about assessment and reporting cannot be
taken independently of issues relating to curriculum, class groupings,
timetabling, programming and resource allocation.
Actively involves parents
Schools and their communities are responsible for jointly developing
assessment and reporting practices and policies according to their
local needs and expectations.
Schools should ensure full and informed participation by parents in the
continuing development and review of the school policy on reporting
processes.
Conveys meaningful and useful information
Reporting of student achievement serves a number of purposes, for a
variety of audiences. Students, parents, teachers, other schools and
employers are potential audiences. Schools can use student
achievement information at a number of levels including individual,
class, grade or school. This information helps identify students for
targeted intervention and can inform school improvement programs.
The form of the report must clearly serve its intended purpose and
audience.
Effective and informative reporting acknowledges that students can be
demonstrating progress and achievement of syllabus outcomes across
stages, not just within stages.
Good reporting practice takes into account the expectations of the
school community and system requirements, particularly the need for


information about standards that will enable
parents to know how their children are
progressing.
Student achievement and progress can be reported by comparing
students' work against a standards framework of syllabus outcomes,
comparing their prior and current learning achievements, or comparing
their achievements to those of other students. Reporting can involve a
combination of these methods. It is important for schools and parents
to explore which methods of reporting will provide the most meaningful
and useful information.
















TOPIC 10

ISSUES AND CONCERNS RELATED TO
ASSESSMENT IN MALAYSIAN PRIMARY
SCHOOLS

10.0 SYNOPSIS
Topic 10 focuses on the issues and concerns related to assessment in the
Malaysian primary schools. It will look at how assessment is viewed and
used in Malaysia.

10.1 LEARNING OUTCOMES
By the end of Topic 10, teachers will be able to:
Understand some issues and concerns regarding assessment in the
Malaysian primary schools
Understand Chapter 4 of the Malaysian Education Blueprint 2013-2025
Use the different types of assessment in assessing language in school


(cognitive-level,school-based and alternative
assessment)

10.2 FRAMEWORK OF TOPICS




CONTENT
SESSION TEN (3 hours)


10.3 Exam-oriented System
The educational administration in Malaysia is highly centralised with four
hierarchical levels; that is, federal, state, district and the lowest level,
school. Major decision-and policy-making take place at the federal level
represented by the Ministry of Education (MoE), which consists of the
Curriculum Development Centre, the school division, and the Malaysian
Examination Syndicate (MES).
The current education system in Malaysia is too examination-oriented and
over-emphasizes rote-learning with institutions of higher learning fast
becoming mere diploma mills.Like most Asian countries (e.g., Gang 1996;
Lim and Tan 1999; Choi 1999); Malaysia so far has focused on public
examination results as important determinants of students progression to
higher levels of education or occupational opportunities (Chiam 1984).
The Malaysian education system requires all students to sit for public
examinations at the end of each level of schooling. There are four public
examinations from primary to postsecondary education. These are the
Primary School Achievement Test (UPSR) at the end of six years of

Issues and
Concerns in
Malaysian
Schools
Exam -
Oriented
system
Cognitive
Levels of
assessment
School -
based
assessment
Alternative
assessment


primary education, the Lower Secondary
Examination (PMR) at the end of another
three years schooling, the Malaysian Certificate of Education (SPM) at the
end of 11 years of schooling, and the Malaysian Higher School Certificate
Examination (STPM) or the Higher Malaysian Certificate for Religious
Education (STAM) at the end of 13 years schooling (MoE 2004).


Malaysia Education Blueprint 2013-2025

In October 2011, the Ministry of Education launched a
comprehensive review of the education system in Malaysia
in order to develop a new National Education Blueprint.
This decision was made in the context of rising
international education standards, the Governments
aspiration of better preparing Malaysias children for the
needs of the 21st century, and increased public and
parental expectations of education policy. Over the
course of 11 months, the Ministry drew on many sources of
input, from education experts at UNESCO, World Bank,
OECD, and six local universities, to principals, teachers,
parents, and students from every state in Malaysia. The
result is a preliminary Blueprint
that evaluates the performance of Malaysias education
system against historical starting points and international
benchmarks. The Blueprint also offers a vision of the
education system and students that Malaysia both needs
and deserves, and suggests
11 strategic and operational shifts that would be required
to achieve that vision. The Ministry hopes that this effort
will inform the national discussion on how to
fundamentally transform Malaysias education system,
and will seek feedback from across
the community on this preliminary effort before finalising the
Blueprint in December 2012.


The examined Curriculum
In public debate, the issue of teaching to the test has often translated
into debates over whether the UPSR, PMR, and SPM examinations
should be abolished. Summative national examinations should not in
themselves have any negative impact on students. The challenge is
that these examinations do not currently test the full range of skills
that the education system aspires to produce. An external review by
Pearson Education Group of the English examination papers at


UPSR and SPM level noted that these
assessments would benefit from the
inclusion of more questions testing higher-order thinking skills, such
as application, analysis, synthesis and evaluation. For example, their
analysis of the 2010 and 2011 English Language UPSR papers
showed that approximately 70% of the questions tested basic skills of
knowledge and comprehension.
LP has started a series of reforms to ensure that, as per policy,
assessments are evaluating students holistically. In 2011, in parallel
with the KSSR, the LP rolled out the new PBS format that is intended
to be more holistic, robust, and aligned to the new standard-
referenced curriculum. There are four components to the new PBS:
School assessment refers to written tests that assess subject
learning. The test questions and marking schemes are developed,
administered, scored, and reported by school teachers based on
guidance from LP;
Central assessment refers to written tests, project work, or
oral tests (for languages) that assess subject learning. LP
develops the test questions and marking schemes. The tests
are, however, administered and marked by school teachers;
Psychometric assessment refers to aptitude tests and a
personality inventory to assess students skills, interests, aptitude,
attitude and personality. Aptitude tests are used to assess students
innate and acquired abilities, for example in thinking and problem
solving. The personality inventory is used to identify key traits and
characteristics that make up the students personality. LP develops
these instruments and provides guidelines for use. Schools are,
however, not required to comply with these guidelines; and
Physical, sports, and co-curricular activities assessment refers
to assessments of student performance and participation in physical
and health education, sports, uniformed bodies, clubs, and other
non-school sponsored activities. Schools are given the flexibility to
determine how this component will be assessed.

The new format enables students to be assessed on a broader range of
output over a longer period of time. It also provides teachers with more


regular information to take the appropriate
remedial actions for their students. These
changes are hoped to reduce the overall emphasis on teaching to the test,
so that teachers can focus more time on delivering meaningful learning as
stipulated in the curriculum.

In 2014, the PMR national examinations will be replaced with school and
centralised assessment. In 2016, a students UPSR grade will no longer be
derived from a national examination alone, but from a combination of PBS
and the national examination. The format of the SPM remains the same,
with most subjects assessed through thenational examination, and some
subjects through a combination of examinations and centralised
assessments.




10.4 Cognitive Levels of Assessment
Bloom's Taxonomy of Cognitive Levels
Knowledge
Comprehension
Application
Analysis
Synthesis
Evaluation
Knowledge
Recalling memorized information. May involve remembering a wide range
of material from specific facts to complete theories, but all that is required is
the bringing to mind of the appropriate information. Represents the lowest
level of learning outcomes in the cognitive domain.
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when?
where? what?


Comprehension
The ability to grasp the meaning of material.
Translating material from one form to another (words to numbers),
interpreting material (explaining or summarizing), estimating future trends
(predicting consequences or effects). Goes one step beyond the simple
remembering of material, and represent the lowest level of understanding.
Learning objectives at this level: understand facts and principles, interpret
verbal material, interpret charts and graphs, translate verbal material to
mathematical formulae, estimate the future consequences implied in data,
justify methods and procedures.
Question verbs: Explain, predict, interpret, infer, summarize, convert,
translate, give example, account for, paraphrase x?
Application
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning
outcomes in this area require a higher level of understanding than those
under comprehension.
Learning objectives at this level: apply concepts and principles to new
situations, apply laws and theories to practical situations, solve
mathematical problems, construct graphs and charts, demonstrate the
correct usage of a method or procedure.
Question verbs: How could x be used to y? How would you show, make use
of, modify, demonstrate, solve, or apply x to conditions y?

Analysis
The ability to break down material into its component parts. Identifying
parts, analysis of relationships between parts, recognition of the
organizational principles involved. Learning outcomes here represent a
higher intellectual level than comprehension and application because they
require an understanding of both the content and the structural form of the
material.
Learning objectives at this level: recognize unstated assumptions,
recognizes logical fallacies in reasoning, distinguish between facts and
inferences, evaluate the relevancy of data, analyze the organizational
structure of a work (art, music, writing).
Question verbs: Differentiate, compare / contrast, distinguish x from y, how
does x affect or relate to y? why? how? What piece of x is missing /
needed?



Synthesis
(By definition, synthesis cannot be assessed
with multiple-choice questions. It appears here to complete Bloom's
taxonomy.)
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
structure.
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose
a plan for an experiment, integrate learning from different areas into a plan
for solving a problem, formulate a new scheme for classifying objects (or
events, or ideas).
Question verbs: Design, construct, develop, formulate, imagine, create,
change, write a short story and label the following elements:

Evaluation
The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them.
Learning outcomes in this area are highest in the cognitive hierarchy
because they contain elements of all the other categories, plus conscious
value judgments based on clearly defined criteria.
Learning objectives at this level: judge the logical consistency of written
material, judge the adequacy with which conclusions are supported by data,
judge the value of a work (art, music, writing) by the use of internal criteria,
judge the value of a work (art, music, writing) by use of external standards
of excellence.
Question verbs: Justify, appraise, evaluate, judge x according to given
criteria. Which option would be better/preferable to party y?


10.5 School-based Assessment
The traditional system
of assessment no longer satisfies the educational and social needs of
the third millennium. In the past few decades, many countries have
made profound reforms in their assessment systems. Several
educational systems have in turn introduced school-based assessment
as part of or instead of external assessment in their certification. While
examination bodies acknowledge the immense potential of school-
based assessment in terms of validity and flexibility, yet at the same
time they have to guard against or deal with difficulties related to
reliability, quality control and quality assurance. In the debate on
school-based assessment, the issue of why has been widely written
about and there is general agreement on the principles of validity of
this form of assessment.
Izard (2001) as well as Raivoce and Pongi (2001) explain that school-
based assessment (SBA) is often perceived as the process put in place
to collect evidence of what students have achieved, especially in
important learning outcomes that do not easily lend themselves to the
pen and paper tests. Daugherty (1994) clarifies that this type of
assessment has been recommended: because of the gains in the
validity which can be expected when students performance on assessed
tasks can be judged in a greater range of contexts and more frequently
than is possible within the constraints of time- limited, written
examinations. However, as Raivoce and Pongi (2001) suggest the validity
of SBA depends to a large extent on the various assessment tasks
students are required to perform.
Burton (1992) provides the following five rules of the thumb that may be
applied in the planning stage of school-based assessment :
1. The assessment should be appropriate to what is being assessed.
2. The assessment should enable the learner to demonstrate positive
achievement and reflect the learners strengths.
3. The criteria for successful performance should be clear to all
concerned


4. The assessment
should be appropriate to all persons being
assessed 5. The style of assessment should blend with the learning
pattern so it contributes to it.

In the Malaysian SBA context, assessment for and of learning
Standard-referenced Assessment
Holistic
Integrated
Balance
Robust
Components of SBA/ PBS
1. Academic:
School Assessment (using Performance Standards)
Centralised Assessment
2. Non-academic:
Physical Activities, Sports and Co-curricular Assessment (Pentaksiran
Aktiviti Jasmani, Sukan dan Kokurikulum - PAJSK)
Psychometric/Psychological Tests
Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer
the assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.

10.6 Alternative Assessment

Alternative assessments are assessment procedures that differ from
the traditional notions and practice of tests with respect to format,
performance, or implementation. It is likely that alternative


assessment found its roots in writing
assessment because of the need to provide
continuous assessment rather than a single impromptu evaluation
(Alderson & Banerjee, 2001).

As the term indicates, alternative assessments are assessment
proposals that present alternatives to the more traditional
examination formats. They have become more popular of late
because of some doubts raised regarding the ability of traditional
assessment to elicit a fair and accurate measure of a students
performance. Alternative assessment brings together with it a
complete set of perspectives that contrast against traditional tests
and assessments. Table 10.1 illustrates some of the major
differences between traditional and alternative assessments.


Table 10.1: Contrasting Traditional and Alternative Assessment
Source: Adapted from Bailey (1998:207 and Puhl, 1997: 5)




In discussing alternative assessments,
Herman et al. (1992: 6) list several of their common characteristics. They
describe alternative assessments as performing the following:
Ask the students to perform, create, produce, or do something.
Tap higher-level thinking and problem-solving skills.
Use tasks that represent meaningful instructional activities.
Invoke real-world applications.
People, not machines, do the scoring, using human judgment.
Require new instructional and assessment roles for teachers.
Alternative assessments are suggested largely due to a growing concern
that traditional assessments are not able to accurately measure the ability
we are interested in. They are also seen to be more student centred as
they cater for different learning styles, cultural and educational
backgrounds as well as language proficiencies.
Tannenbaum (1996), comments that alternative assessments focus on
documenting individual strengths and development which would assist
in the teaching and learning process.


Nevertheless, although alternative assessments are compatible with the
contemporary emphases on the process as well as product of learning
(Croker, 1999), several shortcomings of alternative assessments have been
noted.

Perhaps one of the major limitations of alternative assessments is that
accounts of the benefits of alternative assessment tend to be descriptive
and persuasive, rather than research-based (Alderson & Banerjee, 2001:
229). Alternative assessments are also said to be limited to the classroom
and has not become part of mainstream assessment. Brown and Hudson,
in advocating alternative assessment, seem to have taken a safer approach
by suggesting the term alternatives in assessment. They believe that
educators should be familiar with all possible formats of assessment and
decide on the format that best measures the ability or construct that they
are interested in. Hence, these alternatives would include all possible
assessment formats both traditional and informal.

Despite these limitations, alternative assessments present a viable and
exciting option in eliciting and assessing the students actual abilities. There


are a number of test formats that are
considered alternative assessment formats.
Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Checklists
Teacher-pupils conferences
Interviews
Performace tasks
Portfolios
Self assessment
Peer assessment

Portfolios
A well known and commonly uses alternative assessment is the portfolio
assessment. The contents of the portfolio become evidence of abilities
much like how we would use a test to measure the abilities of our
students.
Bailey (1998, p: 218), describes a portfolio to contain four primary
elements.
First, it should have an introduction to the portfolio itself
which provides an overview to the content of the portfolio.
Bailey even suggests that this section include a reflective
essay by the student in order to help express the students
thoughts and feelings about the portfolio, perhaps explaining
strengths and possible weaknesses as well as explain why
certain pieces are included in the portfolio.
Secondly, she argues that portfolios should have what
she refers to as an academic works section. This section is
meant to demonstrate the students improvement or
achievement in the major skill areas (p. 218).
The third section is described as a personal section in
which students may wish to include their journals, score
reports of tests that they have sat for, as well as photographs
and other items that illustrate their experiences with as well as
achievements in the English language.
Finally, an assessment section may contain evaluations
made by peers, teachers as well as self evaluations.
Table 10.1: Contents of a Portfolio
Source: Adapted from Bailey (1998: 218)
Introductory Section Academic Works Section


Overview
Reflective Essay
Samples of best work
Samples of work demonstrating
development
Personal Section

Journals
Score reports
Photographs
Personal items
Evaluation by peers
Self-evaluation



The portfolio can be said to be a students personal documentation that
helps demonstrate his or her ability and successes in the language. It
may even require students to consciously select items that can
document their own progress as learners. The actual compilation of the
content of the portfolio is in itself a learning experience. Some suggest
that students should attach a short reflection on each piece or item
placed in the portfolio. Portfolio assessment, therefore, is both a
learning and assessment experience. This dual function can be
considered as one of the benefits of portfolio assessment.

Brown and Hudson (1998), summarise several other advantages in
using portfolios in assessment. They discuss these advantages
according to how the portfolio strengthens students learning, enhances
the teachers role and improves the testing process. With respect to
testing, the advantages of using portfolio as an assessment instrument
are listed as follows (pp.664-665):
enhances student and teacher involvement in
assessment;
provides opportunities for teachers to observe students
using meaningful language;
to accomplish various authentic tasks in a variety of
contexts and situations;
Assessment Section


permit the assessment of the multiple
dimensions of language learning;
provide opportunities for both
students and teachers to work together and reflect on what it
means to assess students language growth;
increase the variety of information collected on students;
and
make teachers ways of assessing student work more
systematic.

Self Assessment and Peer Assessment
Two other common forms of alternative assessment are the
selfassessment and peer-assessment procedures. Both these
forms of assessment are strongly advocated by Puhl (1997) as
she believes that they are essential to continuous assessment,
a cornerstone to alternative assessment. The benefits of self
and peer assessment are especially found in formative stages
of assessment in which the development of the students
abilities are emphasised.

Self appraisals are also thought to be quite accurate and are
said to increase student motivation. Puhl (1997), describes a
case study in which she believes self-assessment forced the
students to reread and thereby make necessary editing and
corrections to their essays before they handed them in.
Nevertheless, in order for self assessment to be useful and not
a futile exercise, the learners need to be trained and initially
guided in performing their self assessment. This training
involves providing students with the rationale for self
assessment and how it is intended to work and how it is
capable of helping them.

In language teaching and learning, self assessment is relevant
in assessing all the language skills. An example of the self
assessment of the listening skill, especially in the
comprehension of questions asked is suggested by Cohen
(1994), as follows:


Comprehension of questions asked:
5. I can always understand the
questions with no difficulties and without having ask for repetition
4. I can usually understand questions, but I might occasionally ask for
repetition
3. I have difficulty with some questions, but I generally get the meaning
2. I have difficulty understanding most questions even after repetition
1. I dont understand questions well at all



These questions are useful in the formative stages of
assessment as it helps students identify their own strengths
and weaknesses and respond accordingly. Through asking
these types of self assessment questions, the students are
expected to become more sensitive to their own learning and
ultimately perform better in the final summative evaluation at
the end of the instructional programme.

Peer assessment differs from self assessment in that it involves
the social and emotional dimensions to a much greater extent.
Peer-assessment can be defined as a response in some form
to other learners work (Puhl, 1997). It can be given by a
group or an individual and it can take any of a variety of
coding systems: the spoken word, the written word, checklists,
questionnaires, nonverbal symbols, numbers along a scale,
colours, etc. (p.8) Peer assessment requires that a student
take up the role of a critical friend to another student in order
to support, challenge, and extend each others learning
(Brooks, 2002: 73). Among the reported benefits of peer
assessment are as follows:

remind learners they are not working in isolation;
help create a community of learners;
improve the product (Two heads are better than one);


improve
the process; motivates, even inspires;
help
learners be reflective; and
stimulate meta-cognition.

EXERCISE
In your opinion, what are the advantages of using portfolios as a form
of alternative assessment?











REFERENCES

Allen, I. J. (2011). Repriviledging reading: The negotiation of
uncertainty. Pedagogy: Critical Approaches to Teaching
Literature, Language Composition, and Culture, 12 (1) pp. 97120.
Available at:
http://pedagogy.dukejournals.org/cgi/doi/10.1215/15314200141654
0(RetrievedSeptember 26, 2013)

Alderson, J. C. (1986b). Innovations in language testing? In M.
Portal (Ed.), Innovations in language testing. pp. 93-105.
Windsor: NFER/Nelson.

Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test
construction and evaluation. Cambridge: Cambridge University
Press.

Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian,P.W.,
Cruikshank, K.A., Mayer, R.E., Pintrich, P.R.,Raths, J., &
Wittrock, M.C. (2001). A taxonomy for learning, teaching, and
assessing: A revision of Bloom's Taxonomy of Educational
Objectives (Complete edition). New York: Longman.



Anderson, K. M., (2007). Differentiating
instruction to include all students.
Preventing School Failure, 51 (3) pp.
49-54. Bachman, L. F. (2004). Statistical Analyses for
Language Assessment. pp. 22-23. Cambridge, UK:
Cambridge University Press.

Biggs, J. B. and Collis, K. F. (1982).Evaluating the Quality of
Learning: the SOLO taxonomy. New York, NY: Academic Press.

Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the
quality of intelligent behaviour. In: H. Rowe (Ed.) Intelligence:
Reconceptualization and measurement. Hillsdale, NJ: Lawrence
Erlbaum. pp. 57-75.

Biggs, J.B.& Tang, C. (2009). Applying constructive alignment to
outcomes- based teaching and learning. Training Material. Quality
Teaching for Learning in Higher Education Workshop for
Master Trainers. Ministry of Higher Education. Kuala Lumpur.

Black, P. & Wiliam, D. (2009). Developing the theory of formative
assessment J. Gardiner, ed. Educational Assessment
Evaluation and Accountability, 1 (1), pp. 531.
Available at: http://eprints.ioe.ac.uk/1119/. (Retrieved 23 August
2013)

Bloom, B. S. (Ed.). Engelhart, M.D., Furst, E.J., Hill,W.H., &
Krathwohl, D.R. (1956). Taxonomy of educational objectives:
The classification of educational goals. Handbook 1: Cognitive
domain.New York: David McKay.

Bloom, B. S. (1956). Taxonomy of Educational Objectives,
Handbook I: The Cognitive Domain. New York: David McKay Co
Inc.
Brennan, R. L. (1996). Generalizability of performance
assessments. In G. W. Phillips (Ed.), Technical issues in large-
scale performance assessment (NCES 96-802) (pp. 19-58).
Washington, DC: National Center for Education Statistics.
Brown, H. D., & Abeywickrama, P. (2010). Language
Assessment: Principles and Classroom Practices.New York,
NY: Pearson Education.

Brown, G., & Yule, G. (1983). Teaching the spoken language.
Cambridge: Cambridge University Press.



Brown, H.D.
(1994). Teaching by principles: An interactive
approach to language pedagogy. Englewood Cliffs, NJ: Prentice
Hall Regents.

Campbell, K. J., Watson, J. M., & Collis, K. F. (1992).Volume
measurement and intellectual development. Journal of Structural
Learning. 11, pp. 279-298.

Carroll, J. B., & Sapon, S. M. (1958). Modern Language Aptitude
Test. New York, NY: The Psychological Corporation.

Cheng, L. Watanabe, Y., & Curtis, A. (Eds.). (2004). Washback
in language testing: Research contexts and methods. Mahwah,
NJ: Lawrence Erlbaum Associates.

Chick, H. (1998).Cognition in the Formal Modes: Research
mathematics and the SOLO taxonomy. Mathematics
Education Research Journal. 10 (2) pp. 4-26.
Clark, J. (1979). Direct vs. semi-direct tests of speaking ability. In
E. Briere & F. Hinofotis (Eds.), Concepts in language testing:
Some recent studies (pp. 35-49). Washington, DC:TESOL.
Davidson, F., Hudson, T. & Lynch, B. (1985). Language testing:
Operationalization in classroom measurement and L2 research.
In M. Celce-Murcia (Ed.). Beyond basics: Issues and research
in TESOL pp. 137-152. Rowley, MA: Newbury House.

Davidson, F., & Lynch, B. (2002). Testcraft: A teachers guide to
writing and using language test specifications. New Haven, CT:
Yale University Press.

Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T.
and McNamara, T. (1999). Dictionary of language
testing. Cambridge: University ofCambridge Local
Examinations Syndicate and Cambridge University
Press.

Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn
(ed.). Educational Measurement. (3
rd
. ed.) pp.105-146. New
York, NY: Macmillan.

Gottlieb, M. (2006). Assessing English Language Learners:
Bridges from Language Proficiency to Academic Achievement.
USA: Corwin Press.



Grotjahn, R. (1986).Test validation and
cognitive psychology: Some methodological
considerations.Language Testing 3,pp.158
85.

Hattie, J. (2009).Visible Learning. New York: Routledge.
Hattie, J. (2012) Visible Learning for Teachers: Maximizing Impact
on Learning. Abingdon: Routledge
Hattie, J. & Brown, G. (2004) Cognitive processes in asTTle: The
SOLO taxonomy. University of Auckland/Ministry of Education.
asTTle Technical Report 43

Hook, P. & Mills, J. (2011) SOLO Taxonomy: A Guide for Schools
Book 1: A common language of learning. Laughton, UK:
Essential Resources Educational Publishers.
Huang, S.C. (2012).English Teaching: Practice and Critique 11 (4),
pp. 99119.
Hughes, A. (2003). Testing for language teachers (2
nd
. Ed.).
Cambridge, MA: Cambridge University Press.
Gavin, B. et al. (2008). An introduction to educational
assessment, measurement and evaluation. (2
nd
ed.). Australia:
Pearson Education New Zealand.
McNamara, T. (2000). Language testing. Oxford, UK: Oxford
University Press.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and
assessment in teaching. (8
th
ed.). Upper Saddle River, NJ:
Merrill/Prentice Hall.
Malaysia Education Blueprint 2013-2025.
McMillan, J. H. (2001a.). Classroom assessment: Principles and
practice for effective instruction.(2
nd
ed.). Boston: MA: Allyn &
Bacon.

Messick, S. (1989). Validity. In R. Linn (Ed.) Educational
measurement. Pp. 13-103. New York, NY:: MacMillan.

Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., & Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching and learning. Cambridge:
Cambridge University Press.


Mousavi, S. A.
(2009). An encyclopedic dictionary of
language testing (4
th
ed.) Tehran:
Rahnama Publications.
Norleha Ibrahim. (2009). Management of measurement and
evaluation Module. Selongor: Open University Malaysia.
Nckles, M., Hbner, S. & Renkl, A. (2009). Enhancing
selfregulated learning by writing learning protocols. Learning
and Instruction, 19(3), pp. 259 271. Available
at: http://linkinghub.elsevier.com/retrieve/pii/S0959475208000558
(Retrieved March 26, 2013).

Oller, J. W. (1979). Language tests at school: A pragmatic
approach. London: Longman.


Pearson, I. (1988).Tests as levers for change. In D. Chamberlain
& R. Baumgardner (Eds.), ESP in the classroom: Practice and
evaluation (Vol. 128, 98-107). London: Modern
EnglishPublications.

Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New
York, NY: Harcourt, Brace & World.

Shepard, L. A. (2000). The role of assessment in a learning
culture. Paper presented at the Annual Meeting of the
American Educational Research Association.
Available
http://www.aera.net/meeting/am2000/wrap/praddr01.htm
(Retrieved 10.8.2013)

Smith, A. (2011) High Performers: The Secrets of Successful
Schools. Camarthen: Crown House Publishing.
Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning.
The Clearing House. 80 (5) pp. 205211.
Spaan, M. (2006). Test and item specifications
development.Language Assessment Quarterly, 3, pp. 71-79.

Spratt, M. (2005). Washback and the classroom: The
implications
for teaching and learning of studies of washback from exams.
Language Teaching Research, 19, 5-29.


Stansfield, C., &
Reed, D. (2004). The story behind the
Modern Language Aptitude Test: An
interview with John B. Carrol (1916-2003). Language
Assessment Quarterly, 1, pp.43-56.
Websites

http://www.catforms.com/pages/Introduction-to-Test-Items.html
(Retrieved 9.8.2013)

http://myenglishpages.com/blog/summative-
formativeassessment/ - (Retrieved 10.8.2013)

http://www.teachingenglish.org.uk/knowledgedatabase/objective-
test - (Retrieved 12.8.2013)

http://assessment.tki.org.nz/Using-evidence-for
learning/Concepts/Concept/Reliability-and-validity PANEL
PENULIS MODUL
PROGRAM PENSISWAZAHAN GURU
MOD PENDIDIKAN JARAK JAUH
(PENDIDIKAN RENDAH)


NAMA KELAYAKAN
NURLIZA BT OTHMAN
othmannurliza@yahoo.com











ANG CHWEE PIN
chweepin819@yahoo.com

KELULUSAN:
M.A TESL University of North Texas, USA
B.A (Hons) English North Texas State University, USA
Sijil Latihan Perguruan Guru Siswazah (Kementerian
Pelajaran Malaysia)

PENGALAMAN KERJA
4 tahun sebagai guru di sekolah menengah
21 tahun sebagai pensyarah di IPG


KELULUSAN
M.Ed.TESL Universiti Teknologi Malaysia
B.Ed. (Hons.) Agri. Science/TESL, Universiti Pertanian
Malaysia

PENGALAMAN KERJA
23 tahun sebagai guru di sekolah menengah
7 tahun sebagai pensyarah di IPG

Вам также может понравиться