Criteria for a Good Test: Validity and Reliability

Criteria for a Good test Posted by Yoga arif wijaya 20.
41, under Bahan Kuliah | No comments
3.1 VALIDITY A test is said to be valid if it measures accurately what it is intended to measure. This seems simple enough. When closely examined, however, the concept of validity reveals a number of aspects, each of which deserves our attention.
3.1.1 Content validity A test is said to have content validity if its content constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned. It is obvious that grammar test, for instance, must be made up of items testing knowledge of control of grammar. But this in itself does not ensure content validity. The test would have content validity if it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon the purpose of the test. We wouldnt expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a test has content validity, we need a specification of the skills or structures etc. that is meant to cover. Such a specification should be made at a very early stage in test construction. It is not to be expected that everything in the specification will always appear in the test; there may simply be too many things for all of them to appear in a single test. What is the importance of content validity? First, the greater a tests content validity, the more likely it is to be an accurate measure of what it is supposed to measure. Secondly, such a test is likely to have harmful backwash effect. Areas which are likely to become areas ignored in teaching and learning. Too often the content of tests is the best safeguard against this is to write full test specifications and to ensure that the test content is a fair reflection of these. The effectiveness of a content validity strategy can be enhanced by making sure that the experts are truly experts in the appropriate field and that they have adequate and appropriate tools in the form of rating scales so that their judgments can be sound and focused. However, testers should never rest on their laurels. Once they have established that a test has adequate content validity, they must immediately explore other kinds of validity of the test in terms related to the specific performances of the types of students for whom the test was designed in the first place.
3.1.2 Criterion-related validity/ Empirical validity There are essentially two kinds of criterion-related validity: concurrent validity and predictive validity. Concurrent validity is established when the test and the criterion are administered at about the same time. To exemplify this kind of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part
of the final achievement test. The objectives may list a large number of function which students are expected to perform orally, to test of all which might take 45 minutes for each student. This could well be impractical. The second kind of criterion-related validity is predictive validity. This concerns the degree to which a test can predict candidates future performance. An example would be how well a proficiency test could predict a students ability to cope with a graduate course at a British University. The criterion measure here might be an assessment of the students English as perceived by his or her supervisor at the university, or it could be the outcome of the course (pass/fail etc.) 3.1.3 Construct validity A test, part of test, or a testing technique is said to have construct validity if it can be demonstrated that it measures just the ability which it is supposed to measure. The word construct refers to underlying ability (or trait) which is hypothesized in a theory of language ability. One might hypothesize, for example, that the ability to read involves a number of subabilities, such as the ability to guess the meaning of unknown words from the context in which they are met. It would be a mater of empirical research to establish whether or not such a distinct ability existed and could be measured. If we attempted to measure that ability in a particular test, then that part of the test would have construct validity only if we were able to demonstrate that we were indeed measuring just that ability. Construct validity is the most important form of validity because it asks the fundamental validity question: What this test really measuring? We have seen that all variables derive from constructs and that constructs are nonobservable traits, such as intelligence, anxiety, and honesty, invented to explain behavior. Constructs underlie the variables that researchers measure. You cannot see a construct, you can only observe its effect. Why does the person act this way and that person a different way? Because one is intelligent and one is not or one is dishonest and the other is not. We cannot prove that constructs exist, just as we cannot perform brain surgery on a person to see his or her intelligence, anxiety, or honesty.
3.1.4 Face validity A test is said to have face validity if it looks as if it measures what it is supposed to measure, for example, a test which pretended to measure pronunciation ability but which did not require the candidate to speak (and there have been more) might be thought to lack face validity. This would be true even if the tests construct and criterion-related validity could be demonstrated. Face validity is hardly a scientific concept, yet it is very important. A test which does not face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if it is used, the candidates reaction to it may mean that they do not perform on it in a way that truly reflects their ability.
3.1.5 The use of validity
What use is the reader to make of the notion of validity? First, every effort should be made in constructing tests to ensure content validity. Where possible, the tests should be validated empirically against some criterion. Particularly where it is intended to use indirect testing, reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be used.
3.2 RELIABILITY Reliability is a necessary characteristic of any good test: for it to be valid at all, a test must first be reliable as a measuring instrument. It test is administered to the same candidates on different occasion (with no language practice work taking place between these occasion), then, to the extent that it produces differing results. It is not reliable. Reliability measured in this way is commonly referred to as test/re-test reliability to distinguish it from mark/re-mark reliability. In short, in order to be reliable, a test must be consistent in its measurements. Factors affecting the reliability of a test are: the extent of the sample of material selected for testing; whereas validity is concerned chiefly with the content of the sample, reliability is concerned with the size. The larger the sample (i.e the more tasks the testees have to perform), the greater the probability that the test as a whole is reliable-hence the favoring of objectives tests, which allow for a wide field to be covered. the administration of the test : is the same test administered to different groups under different conditions or at different times? Clearly, this is an important factor in deciding reliability, especially in tests of oral production and listening comprehension.
One method of measuring the reliability of a test is to re-administer the same test after a lapse of time. It is assumed that all candidates have been treated in the same way in the interval that they have either all been taught or that none of them have. Another means of estimating the reliability of a test is by administering parallel forms of the test to the same group. This assumes that two similar versions of a particular test can be constructed; such tests must be identical in the nature of their sampling, difficulty, length, rubrics, etc.
3.2.1 How to make tests more reliable As we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring. Take enough sample of behavior. Other things equal, the more items that you have on a test, the more reliable that test will be. This seems intuitive right. While it is important to make a test long enough to achieve satisfactory reliability, it should not be made so long that the candidates become so bored or tired that the behavior that they exhibit becomes
unrepresentative of their ability. At the same time , it may often be necessary to resist pressure to make a test shorter than is appropriate. The usual argument for shortening a test is that it is not practical. Do not allow candidates too much freedom. In some kinds of language test there is a tendency to offer candidates a choice of questions and then to allow them a great deal of freedom in the way that they answer the ones that they have chosen. Such a procedure is likely to have a depressing effect on the reliability of the test. The more freedom that is given, the greater is likely to be the difference between the performance. Write unambiguous items. It is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated. Provide clear and explicit instructions. This applies both to written and oral instructions. It is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will. Test writers should not rely on the students powers of telepathy to elicit the desired behavior. Ensure that tests are well laid out and perfectly legible. Too often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test. Candidates should be familiar with format and testing techniques. If any aspect of a test is unfamiliar to candidates, they are likely to perform less well they would do otherwise (on subsequently taking a parallel version, for example). For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them. Provide uniform and non-distracting conditions of administration. The greater the differences between one administration of a test and another, the greater the differences one can expect between a candidates performance on two occasions. Great care should be taken to ensure uniformity. Use items that permit scoring which is as objective as possible. This may appear to be a recommendation to use multiple choice items, which permit completely objective scoring. An alternative to multiple choice item which has a unique, possibly one word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidates meaning unclear often make demands on the scorers judgment. The longer the required response, the greater the difficulties of this kind. Make comparisons between candidates as direct as possible. This reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond. Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests. The scoring should be all the more reliable if the compositions are guided. In this section, do not allow candidates too much freedom.
Provide a detailed scoring key. This should specify acceptable answer and assign points for partially correct responses. For high scorer reliability the key should e as detailed possible in its assignment of points. Train scorers. This is especially important where scoring is most subjective. The scoring of comparisons, for example, should not be assigned to anyone who has not learned to score accurately compositions form past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again. Identify candidates by number; not name. Scorers inevitably have expectations of candidates that they know. Except in purely objective testing, this will affect the way that they score. Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given. For example, a scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects. Employ multiple, independent scoring. As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scorers. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scores and investigates discrepancies.
3.3 ADMINISTRATION
A test must be practicable; in other words, it must be fairly straight forward to administer. It is only too easy to become so absorbed in the actual construction of the test items that the most obvious practical considerations concerning the test are overlooked. The length of time available for the administration of the test is frequently misjudged even by experienced test writers. Especially when the complete test consists of a number of sub-tests. In such cases sufficient time may not be allowed for the administration of the test (i.e. a try out of the test to a small but representative group of testees) Another practical consideration concerns the answer sheets and the stationary used. Many tests require the testees to enter their answers on the actual question paper (e.g. circling the letter of the correct option), thereby unfortunately reducing the speed of the scoring and presenting the question paper from being used a second time. In some tests the candidates are presented with a separate answer sheet, but too often insufficient thought has been given to possible errors arising from the (mental) transfer of the answer sheet itself. A final point concerns the presentation of the test paper itself. Where possible, it should be printed or typewritten and appear neat, tidy and aesthetically pleasing. Nothing is worse and more disconcerting to the testee than an untidy test paper, full of misspellings, omissions and corrections.

Criteria for a Good Test: Validity and Reliability

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Criteria for a Good Test: Validity and Reliability

Загружено:

Авторское право:

Доступные форматы

Criteria for a Good test Posted by Yoga arif wijaya 20.

41, under Bahan Kuliah | No comments

3.1.5 The use of validity

Вам также может понравиться