Вы находитесь на странице: 1из 13

VALIDITY AND RELIABILITY 1. Which of the following is the best definition of reliability? a.

Reliability refers to whether the data collection process measures what it is supposed to measure. b. Reliability refers to the degree to which the data collection process covers the entire scope of the content it is supposed to cover. c. Reliability refers to whether or not the data collection process is appropriate for the people to whom it will be administered. d. Reliability refers to the consistency with which the data collection process measures whatever it measures.

2. Mary took a test on which she received a score of 75. The teacher's house burned down, and the tests were destroyed. Mary took the same test over again the next day and again received a score of 75. a. There is evidence to suggest that Mary's test was reliable. b. There is evidence to suggest that Mary's test was unreliable. c. There is no evidence upon which to base even a tentative judgment about the reliability of the test.

3. Marvin's final exam was scored by his teacher, who gave him a 64. This would have caused him to fail the course. He protested to the school officials, and two other teachers scored the same test. One of them gave him a 75, and the other an 85. a. There is evidence that Marvin's test was scored reliably. b. There is evidence that Marvin's test was scored unreliably. c. There is no evidence upon which to base even a tentative judgment about the reliability of the scoring process.

4. Ella May's teacher rated her behavior as indicating that she was quite popular. The teacher's aide assigned to the same classroom rated Ella May as being uncooperative when given assignments. a. There is evidence that the rating process was reliable. b. There is evidence that the rating process was unreliable. c. There is no evidence upon which to base even a tentative judgment about the reliability of the rating process.

5. Donald was rated by his teacher as being unable to perform the mathematical skills necessary for the next math unit. Because of this low rating, Donald received a special programmed unit to help him review his skills. A day later, after completing the programmed materials, Donald was rated by the same teacher as able to perform the skills necessary for the next unit. a. There is evidence that the teacher's rating system was reliable. b. There is evidence that the rating system was unreliable. c. There is no evidence upon which to base even a tentative judgment about the reliability of the rating process.

Questions 6 through 8 go together. 6. Miss Curtis was planning to teach a unit on English grammar to her ninth graders. She planned to give one test as a pretest, and another as a posttest. Then she planned to compare the two sets of scores to determine whether or not the students had profited from the unit. Examine the following list of statements and indicate which ones suggest that her tests lacked reliability. (Choose more than one answer.) a. Each form of the test contained 50 items, worth two points apiece. b. When the tests were scored by two separate persons, the results were exactly the same. c. Thirty-five of the items on each of the alternate forms of the test were answered correctly by everyone.

d. Rather than giving highly structured instructions, she allowed the students to ask questions as they went along, and provided information as it was requested. e. The average score on the posttest was substantially higher than the average score on the pretest.

7. After Miss Curtis had administered both forms of her English grammar test, she decided to revise it. By doing this, she hoped to make it a more reliable test the next year. Examine the following list of statements and indicate which ones would be likely to increase the reliability of the test. (Choose more than one.) a. She wrote out a detailed set of instructions based on the questions which had arisen this time, and she attached these instructions to the tests. b. She increased the length of the test from 50 to 75 items on each form. c. She eliminated several of the items which everyone had answered correctly, because she found that these had included irrelevant clues which enabled the students to get them right. She replaced these with items that she felt contained no irrelevant clues. d. She eliminated each of the items which had been missed by 40-60% of the students on the pretest and by 70-90% of the students on the posttest. e. She decided to base many of her new items on contemporary music, since nearly all the students seemed to be interested in such music.

8. Miss Curtis decides to compute statistical reliability to help determine the degree of reliability her tests possess. The tests are multiple choice/true-false in format. She considers them to be criterion-referenced rather than norm-referenced tests. Her main concern is that the decisions she would make on the basis of the results would be based on actual abilities of the students rather than on unique aspects of the testing situation. She is also concerned that any differences between the pretest and the posttest should indicate real differences, rather than merely differences between the two tests. Which of the following types of statistical reliability would help Miss Curtis make useful decisions about her tests? (Choose as many as necessary.) a. Test-retest reliability.

b. Equivalent-forms reliability. c. Internal consistency reliability. d. Interscorer reliability. e. Interobserver agreement

9. Which of the following types of statistical reliability require that the same test be administered to the same persons two times? (Choose as many as necessary.) a. Test-retest reliability. b. Equivalent-forms reliability. c. Internal consistency reliability. d. Interscorer reliability. e. Interobserver agreement

10. Which of the following is a major weakness of the statistical techniques for estimating reliability? (Choose only one.) a. When respondents give different answers because of chance factors such as health problems or luck, this lowers the statistical estimate of reliability. b. When a large number of persons master a skill and therefore get the answer right, this lowers statistical reliability. c. Changes in the directions as they are given or as they are perceived by the respondents will lower the statistical reliability. d. Essay tests receive lower estimates of statistical reliability than more objective tests, because there are more likely to be subjective factors influencing the scoring process.

11. If a teacher has access to two tests which attempt to measure the same research variable, she should almost always choose the test which is the more reliable (provided they take about the same amount of time to administer and score). a. True. b. False.

12. Which of the following is the best definition of validity? a. Validity deals with whether the data collection process actually measures what it purports to be measuring. b. Validity deals with whether the data collection process is designed at the appropriate level of difficulty. c. Validity deals with whether the data collection process is consistent in measuring whatever it measures. d. Validity deals with the question of how subjectivity can best be controlled in the scoring process. e. Validity deals with the standardization of procedures for administering, scoring, and interpreting data collection processes.

13. All but one of the following are factors which directly influence the validity of a data collection process. Choose the exception. a. The logical appropriateness of the operational definition. b. The match between the tasks in the data collection process and the operational definition. c. The difficulty of the data collection process. d. The reliability of the data collection process.

14. Mr. Gomez wants to help his students become familiar with educational television. He defines "familiarity" with educational television as meaning that the students will be able to name several of the shows on the local educational television station. To measure this research variable, he asks the students one day to write down the name of as many shows as they can think of which were on the local educational channel the night before. He then concludes that the students who can name more shows are more familiar with educational television than those who name few or no shows accurately. What is the most obvious reason why this measurement strategy is likely to be invalid? a. It is likely to be unreliable. b. The task doesn't match the operational definition. c. The operational definition is logically inappropriate. d. The task requires that the students be familiar with the local educational television station.

15. Miss Chesterton is teaching her students to use English grammar correctly. She operationally defines using English grammar correctly as meaning that they will follow all the rules of normal English grammar in the compositions they write. On the exam, she determines how well the students have met this goal by requiring them to diagram twenty sentences of varying levels of complexity. What is the most obvious reason why this measurement strategy is likely to be invalid? a. It is likely to be unreliable. b. The task doesn't match the operational definition. c. The operational definition is logically inappropriate. d. The task requires that the students be capable of following the rules of English grammar in their writing.

16. Professor Carter wants her students to develop a genuine appreciation of Shakespeare's plays. She operationally defines this to mean that the students will be able to recall lines of the plays from memory. She measures this by giving the

students several important scenes with lines omitted and having them fill in the missing lines. What is the most obvious reason why this measurement strategy is likely to be invalid? a. It is likely to be unreliable. b. The task doesn't match the operational definition. c. The operational definition is logically inappropriate. d. The students may not be able to recall the lines.

Questions 17 through 20 are based on the following information. Mrs. Green wants to measure Kathy's reading comprehension by having her read a story and then relate it to her own experience. Examine each of the following statements (assuming they are all true), and indicate whether each would or would not weaken the validity of Mrs. Green's testing strategy. 17. Even outside reading situations, Kathy has a great deal of trouble relating any stories at all to her personal life. a. Weakens the validity of the data collection process. b. Does not weaken the validity of the data collection process.

18. Kathy has trouble understanding the passage. a. Weakens the validity of the data collection process. b. Does not weaken the validity of the data collection process.

19. Kathy becomes anxious because she has to take the test aloud in front of the class, and anxiety makes her perform poorly. a. Weakens the validity of the data collection process.

b. Does not weaken the validity of the data collection process.

20. The passage is extremely short. a. Weakens the validity of the data collection process. b. Does not weaken the validity of the data collection process.

21. Ms. Monroe has developed a questionnaire to measure her students' attitudes toward the practicum in her nursing training program. She is concerned about whether the questions apply proportionately to all the aspects of the program. What tool for estimating aspects of validity would help Ms. Monroe make a sound judgment in this regard? a. Content validity. b. Criterion-related validity. c. Construct validity. d. None of the above.

22. Mr. Shepard has developed a criterion-referenced test on basic mathematic abilities. He wants to be sure it gives appropriate coverage to all the topics covered during the semester. What tool for estimating aspects of validity would help Mr. Shepard make a sound judgment in this regard? a. Content validity. b. Criterion-related validity. c. Construct validity. d. None of the above.

23. Professor DuParc has developed an observational strategy to measure a person's "independence from peer pressure." What tool for estimating aspects of validity would help Professor DuParc to demonstrate that his strategy really measures "independence from peer pressure" rather than some other characteristic? a. Content validity. b. Criterion-related validity. c. Construct validity. d. None of the above.

24. Mrs. Masters has been admitting persons into her Advanced Composition course on the basis of their performance in Introductory English. She decides that she could make better selections if she would have the applicants take a special test, and then successful candidates would be those who scored highest on the test. What tool for estimating aspects of validity would help Mrs. Masters demonstrate that her new procedure is better than the old one? a. Content validity. b. Criterion-related validity. c. Construct validity. d. None of the above.

Review Quiz 1. (d). This is a paraphrase of the definition given in the textbook. If you chose (a), you selected the definition of validity. 2. (a). This doesn't conclusively prove that the test is entirely valid, but it does give some evidence in that direction, since it demonstrates consistency. 3. (b). If the measurement process were reliable, Marvin should be evaluated about the same on each occasion. He has gotten three different scores on three occasions. 4. (c). The teacher and the aide are rating different characteristics (popularity and cooperativeness), and so it is reasonable that the ratings may be different. There has been no attempt here to measure the same thing twice. There is no evidence on which to base a judgment regarding consistency. 5. (c). Donald's score changed; but since he had training between the two testing occasions, there is no reason to expect the ratings to remain stable. Although there has been an attempt here to measure the same thing twice, there was no reason to expect the ratings to remain the same. There is not enough evidence on which to base a judgment regarding consistency. 6. (c) and (d). Statement (c) indicates that several of the items were excessively easy. Statement (d) indicates that the instructions might change on different testing occasions (because students might ask different questions). Statement (a) indicates a strength (50 items is a reasonably large number of questions). Statement (b) suggests consistency among people scoring the tests (interscorer reliability). Statement (e) gives no real evidence: the scores changed, but we would expect them to change after instruction. Since (except for the test results, which would involve circular reasoning) we don't know whether the instruction was effective or not, we don't know whether the test was reliable. 7. (a), (b), and (c). Statement (a) described a way to standardize the measurement process, thereby eliminating some extraneous influences. Statement (b) would increase reliability by expanding the sample of items - assuming that the additional 25 items were related to the same outcomes as the original 50. Statement (c) would increase reliability by increasing the number of effective items, since excessively easy items do not add to the reliability of a test. Statement (d) describes a bad strategy; eliminating items of medium difficulty on the pretest would actually reduce the reliability of the test. Statement (e) may be a good idea, but it is irrelevant to the concept of reliability.

8. (a) and (b). Since she is concerned about unique aspects of the testing situation, she needs test-retest reliability. Since she wants to compare posttest results to pretest results, she would like to have parallel tests, and so she also needs equivalent-forms reliability. 9. (a). Test-retest is the only one of those listed that fits this description. Equivalent-forms requires given two forms of the same test to one group of people. Internal consistency requires giving the test just once and then analyzing those results with coefficient alpha. Interscorer reliability requires giving the test just once and then having two persons score it. Interobserver agreement requires two persons to observe the same set of behaviors and to compare their results to see if the agreed on what they observed. 10. (b). Restrictions in the range of scores lower statistical reliability. Since restrictions sometimes occur for good reasons (e.g., student mastery of the information), this would be considered a possible weakness of their use. Statements (a), (c), and (d) describe strengths of reliability coefficients, since reliability is supposed to notice and rule out these extraneous factors. 11. (b). Validity (not reliability) is the most important factor in test design and selection. It is easy to develop tests with high reliability that lack validity. 12. (a). This is a paraphrase of the textbook's definition of validity. If you chose (c), you chose the definition of reliability. 13. (c). The difficulty of the data collection process may be an important consideration, but it does not directly influence validity. The other three are the factors listed by the textbook as influencing validity. 14. (a). The measurement process is likely to unreliable (and hence, invalid), because Mr. Gomez has used a single question focusing on a single night. He should sample several nights, ask more questions, or use multiple operational definitions. The operational definition seems appropriate (naming shows sounds close to familiarity), and the task matched the operational definition (He asked students to name shows). To the extent that statement (d) is true, Mr. Gomez would have evidence of validity, not invalidity, since it states exactly what he is trying to measure. 15. (b). Diagramming sentences is not even remotely synonymous with following the rules of normal English grammar in the compositions they write.

16. (c). Few people would seriously argue that recalling lines from a play is synonymous with appreciating that play. The tasks do match the operational definition, and there is no reason to believe that the test is unreliable; but the faulty operational definition ruins the validity of this data collection process. 17. (a). If Kathy has problems relating stories in general to her own life, then making her perform this task to indicate reading comprehension is invalid. She would be doing two tasks: (a) comprehending (at which she might succeed) and (b) relating to her own life (at which she might fail). By failing at the second task, she would look like she had failed at the first. Other students, who had no trouble relating stories to their personal lives, would be performing only one real task (comprehending), and so failure at that task would indicate a lack of comprehension. 18. (b). This does not weaken the validity of the test. Quite the contrary, it's evidence that the test is measuring what it's supposed to be measuring. 19. (a). The test is supposed to require her to comprehend (presumably under normal circumstances). The task she is actually required to perform is to comprehend under conditions of extreme anxiety. 20. (a). This is likely to weaken the reliability of the test, because it is an inadequate sample of behavior. Weakening reliability is one way to weaken validity. 21. (a). She is concerned that the data collection process covers the entire range of what it should cover. This is a good paraphrase of the definition of content validity. 22. (a). He is concerned that the data collection process covers the entire range of what it should cover. This is a good paraphrase of the definition of content validity. 23. (c). "Independence from peer pressure" is an internalized concept (construct) that Professor DuParc wants to measure. Construct validity would help him demonstrate that he has done so correctly. 24. (c). She is trying to predict performance in the Advanced Composition course. Predictive validity (a form of criterion-related validity) wold help her determine whether the special test was useful for this purpose.

Вам также может понравиться