Вы находитесь на странице: 1из 8

1

Chapter 5: Reliability

Chapter 5
Reliability

Brief Chapter Outline

I. The Concept of Reliability


A. Sources of Error Variance
Test construction
Test administration
Test scoring and interpretation
Other sources of error

II. Reliability Estimates


A. Test-Retest Reliability Estimates
B. Parallel-Forms and Alternate-Forms Reliability Estimates
C. Split-Half Reliability Estimates
The Spearman–Brown formula
D. Other Methods of Estimating Internal Consistency
The Kuder–Richardson formulas
Coefficient alpha
Average proportional distance (APD)
E. Measures of Inter-Scorer Reliability

III. Using and Interpreting a Coefficient of Reliability


A. The Purpose of the Reliability Coefficient
B. The Nature of the Test
Homogeneity versus heterogeneity of test items
Dynamic versus static characteristics
Restriction or inflation of range
Speed tests versus power tests
Criterion-referenced tests
C. The True Score Model of Measurement and Alternatives to It
Domain sampling theory and generalizability theory
Item response theory (IRT)

IV. Reliability and Individual Scores


A. The Standard Error of Measurement
B. The Standard Error of the Difference Between Two Scores

Cohen: Psychological Testing and Assessment, 9e


2
Chapter 5: Reliability

Close-up: Psychology’s Replicability Crisis

Everyday Psychometrics: The Importance of the Method Used for Estimating Reliability

Meet an Assessment Professional: Meet Dr. Bryce B. Reeve

Self-Assessment

Term to Learn

Reliability: It is a synonym for dependability and consistency. Broadly speaking, in the language
of psychometrics reliability refers to consistency in measurement.

Some Relevant Reference Citations

Green, C.E., Chen, C. E., Helms, J. E., & Henze, K. T. (2011). Recent Reliability Reporting
Practices in Psychological Assessment: Recognizing the People Behind the Data.
Psychological Assessment, 23(3), 656–669.

Kieffer, K. M., & MacDonald, G. (2011). Exploring Factors that Affect Score Reliability and
Variability: In the Ways of Coping Questionnaire Reliability Coefficients: A Meta-Analytic
Reliability Generalization Study. Journal of Individual Differences, 32(1), 26–38.

Markon, K. E., Chmielewski, M., & Miller, C. J. (2011). The Reliability and Validity of Discrete
and Continuous Measures of Psychopathology: A Quantitative Review. Psychological
Bulletin, 137(5), 856–879.

For Class Consideration

What does it mean when a tool of assessment is characterized as reliable? Under what conditions
might one expect an otherwise useful tool of assessment to be unreliable?

Class Discussion Questions

Here is a list of questions that may be used to stimulate a class discussion, as well as critical and
generative thinking, with regard to some of the material presented in this chapter of the text.

1. Ask the class to draw parallels between a reliable person and a reliable test. What are the
similarities and differences between the two?

Cohen: Psychological Testing and Assessment, 9e


3
Chapter 5: Reliability

2. A hypothetical situation for a class discussion is as follows: One student’s measured IQ on


an intelligence test is 100. Another student’s measured IQ is 110 on a different intelligence
test. What can one say about the two students and their IQ scores on these intelligence
tests? How can the standard error of the difference determine whether or not a difference
exists between the two scores? What factors affect the magnitude of the standard error of
the difference score?

3. Create a list of the various types of tests and measurements on the board. Ask students to
respond with the corresponding type(s) of reliability estimates that would be appropriate
for each of the tests. You may also ask them to name the statistic of choice to be calculated.
A sample list of tests is as follows:
a. Typing test (timed)
b. Color blindness
c. Weight of an individual for every month from the age of 6 weeks till the age of 21
years
d. Intelligence
e. Mood
f. Weight of melting ice cubes
g. Test of reaction time
h. Multiple choice exams in this course for midterm and final (two different exams)
i. Test of test anxiety
j. Essay exam in an English literature course
k. Presidential preference poll
l. Art aptitude test, which includes judging the quality of a clay sculpture

4. Why is there a need for different methods of estimating the reliability for norm-referenced
tests and criterion-referenced tests?

5. Why is an observed score always represented as the sum of the true score and error?

6. What factors affect the test-retest reliability of the tests designed to measure the developing
cognitive and motor skills of infants?

In-Class Demonstrations

1. Bring Something to Class.

Bring the following materials to class to demonstrate some of the concepts discussed in the
chapter.

Cohen: Psychological Testing and Assessment, 9e


4
Chapter 5: Reliability

a. Test manuals of tests that do and do not employ a true score model

Bring some manuals for tests that do and do not employ a true score model of
measurement to class. Discuss the similarities and differences between the
information presented in the manuals.

b. Test manual of a test that has alternate forms

To supplement a class discussion of alternate forms reliability, bring copies of a test


manual for a test that has alternate forms that have been developed (for the Peabody
Picture Vocabulary Test, the Key Math, or the Woodcock Diagnostic Reading
Battery). Using information from the test’s technical manual, summarize how the
alternate forms were developed for the test.

c. Quiz data for analysis

Bring the data of the students’ scores for the past two quizzes or examinations in
your course to the class. This could be substituted for the data of the students’ scores
in the measurement course or another course. Have students analyze the data for test-
retest reliability.

d. A copy of the Standards

Bring a copy of the most recent edition of the Standards to the class, and discuss the
material dealing with reliability and the errors of measurement.

e. Two forms of a test on basic arithmetic operations

Create two forms of a “Basic Arithmetic” test (“Form A” and “Form B”) that could
reasonably be administered to your class in five minutes. The test should tap
students’ knowledge of basic arithmetic operations such as addition, subtraction,
multiplication, and division. Administer Form A with a five-minute time limit, under
regular test-taking conditions. Collect the exam papers. Now, administer Form B also
with a five-minute time limit. During the administration of Form B, however, create
adverse testing conditions for students by playing loud or distracting music, flashing
the lights on and off, and so on. Collect the papers. Now, redistribute all of the papers
so that different students are marking different students’ tests and no student is
marking his or her own paper. Using the group data, calculate a rank order (Rho)
coefficient that will be the measure of test-retest reliability. Is it lower than expected
due to the various sources of error variance introduced?

Cohen: Psychological Testing and Assessment, 9e


5
Chapter 5: Reliability

f. A simple speed test

Create a simple speed test in which the test taker’s task is to correctly alphabetize a
list of words. Administer the test to the class. As a class, calculate the split-half
reliability of the resulting data. Then, discuss why such an approach to estimating
reliability is inappropriate for use in determining the reliability of a speeded test.

2. Bring Someone to Class.

Invite a guest speaker to class. The guest speaker could be any of the individuals listed
below.

a. A faculty member

Invite a faculty member (from your university or a neighboring one) who is an expert
in item response theory, classical test theory, or the area of reliability to elaborate on
the material presented in this chapter.

b. A local test user

Invite a local user of psychological tests from any setting who can elaborate on the
principles of reliability as used in everyday work.

In-Class Role Play and Debate Exercises

1. Role Play: The Willingness to Demonstrate Helpfulness Inventory (WDHI)

Divide the class into three groups: (1) WDHI test developers, (2) investors, and (3)
advisers to investors. Members of Group 1 play the role of the test developers, and their
task is to pitch a well-informed proposal to Group 2 for collecting investment funds to
develop their test. Members of Group 2, who are also well informed of many issues related
to assessment, particularly reliability issues, should question Group 1 about their test, in
particular regarding the reasons for investing. After the exchange between Group 1 and 2,
members of group 3 could act as advisors to Group 2 and advise them on whether or not
the WDHI test would be a good source of investment.

2. Debate: Classical Test Theory (CTT) versus Item Response Theory (IRT)

Give students the following topic: Should all psychological tests developed from the

Cohen: Psychological Testing and Assessment, 9e


6
Chapter 5: Reliability

present day forward rely on CTT or IRT? Ask them to research the relevant issues on the
given topic and come prepared for a debate in the next class. One half of the students in the
class will be assigned to the “CTT” team, while the other half of the students will be
assigned to the “IRT” team. The use of homemade T-shirts to create “uniforms” with the
appropriate lettering could be encouraged for the debate.

Out-of-Class Learning Experiences

1. Take a Field Trip.

Arrange for the class to go on a field trip to the locations given below.

a. A corporate human resources department

Arrange for the class to visit the human resources (HR) department of a local
business or a large corporation that employs psychological tests to see how they are
used in practice, with a specific focus on reliability issues.

b. A local consumer research firm

Arrange for the class to visit a local consumer research firm that employs statistics
and/or statistical methods. Ask a member of the firm to discuss the testing that is
conducted in the firm and the measures in place to ensure the reliability of data from
the test.

Suggested Assignments

1. Critical Thinking Exercise: Error in Ability Tests

Critically evaluate any existing ability test with regard to all of the possible sources of error
that may be inherent in measurement. Explain why some sources of error are likely to be
greater than others in magnitude for this particular test.

2. Generative Thinking Exercise: Appropriate Measures of Reliability

Create a table with two headings: Appropriate and Inappropriate. Under these two
headings, list at least one type of test for which different methods of estimating reliability
would or would not be appropriate. Your listing should include, for example, one type of
test for which the test-retest method of estimating reliability would be appropriate, and one
type of test for which the test-retest method of estimating reliability would be

Cohen: Psychological Testing and Assessment, 9e


7
Chapter 5: Reliability

inappropriate. Continue to do this for the estimates of inter-item consistency, inter-scorer


reliability, and alternate form reliability. Explain why you have listed each type of test in
the Appropriate or Inappropriate column.

3. Read-then-Discuss Exercises

a. Generalizability theory versus classical test theory

Prior to class, do some independent reading about generalizability theory and


classical test theory. During class, be prepared to discuss how generalizability theory
differs from classical test theory.

b. Item response theory versus classical test theory

Prior to class, do some independent reading about item response theory and classical
test theory. During class, be prepared to discuss how the two theories differ.

c. Reliability reporting in a test manual

Students are instructed to check out and review a test manual from their
college/university test library. Then, they should write a report on the types of
reliability estimates provided for their test of choice.

4. Other Exercises and Assignments

a. Demonstrating measurement error

Buck (1991) provides suggestions for an in-class demonstration of measurement


error and reliability. Allen (1992) objected to some aspects of this demonstration, but
Buck (1992) responded to the criticisms. The demonstration, its criticisms, and the
rebuttal could be a lively class exercise.

b. Moore on more assignments for teaching concepts of reliability

Moore (1981) provided some suggestions for teaching related reliability concepts
such as true score, true variance, and the standard error of measurement using the
measurements of lines.

Media Resources

Cohen: Psychological Testing and Assessment, 9e


8
Chapter 5: Reliability

On the Web

http://www.ehd.org/science_technology_testresults.php
This website explains some applied examples for understanding the reliability of a test from
fields other than psychology.

http://www.youtube.com/watch?v=DS8Hw0Ort4w&feature=related
http://www.youtube.com/watch?v=LolwQXYjuh8&feature=related
https://www.youtube.com/watch?v=CZQlqVswAq8
This three-part video segment explains the aspects of reliability and the validity of tests.

Validity and Reliability


http://www.youtube.com/watch?v=56jYpFkdqW8&feature=related
This video explores the relationship between validity and reliability.

http://edres.org/irt/
This website provides information on item response theory.

References

Allen, M. J. (1992). Comments on “A Demonstration of Measurement Error and Reliability.”


Teaching of Psychology, 19(2), 111.

Buck, J. (1991). A Demonstration of Measurement Error and Reliability. Teaching of


Psychology, 18(1), 46–47.

Buck, J. (1992). When True Scores Equal Zero: A Reply to Allen. Teaching of Psychology, 19,
111–112.

Moore, M. (1981). An Empirical Investigation and a Classroom Demonstration of Reliability


Concepts. Teaching of Psychology, 8, 163–164.

Cohen: Psychological Testing and Assessment, 9e

Вам также может понравиться