Вы находитесь на странице: 1из 11

It is a common mistake to assume the terms “validity” and “reliability” have the

same meaning.

While they are related, the two concepts are very different. In an effort to clear up any
misunderstandings about validity and reliability, I have defined each here for you.

Reliability

Of the two terms, reliability is the simpler concept to explain and understand. If you are
focusing on the reliability of a test, all you need to ask is—are the results of the test
consistent? If I take the test today, a week from now and a month from now, will my
results be the same?

If an assessment is reliable, your results will be very similar no matter when you take
the test. If the results are inconsistent, the test is not considered reliable.

Validity

Validity is a bit more complex because it is more difficult to assess than reliability. There
are various ways to assess and demonstrate that an assessment is valid, but in simple
terms, validity refers to how well a test measures what it is supposed to measure.

There are several approaches to determine the validity of an assessment, including the
assessment of content, criterion-related and construct validity.

 An assessment demonstrates content validity when the criteria it is measuring


aligns with the content of the job. Also, the extent to which that content is
essential to job performance (versus useful-to-know) is part of the process in
determining how well the assessment demonstrates content validity.For example,
the ability to type quickly would likely be considered a large and crucial aspect of
the job for an executive secretary compared to an executive. While the executive
is probably required to type, such a skill is not as nearly as important to
performing that job. Ensuring an assessment demonstrates content validity
entails judging the degree to which test items and job content match each other.

 An assessment demonstrates criterion-related validity if the results can be used


to predict a facet of job performance. Determining if an assessment predicts
performance requires that assessment scores are statistically evaluated against
a measure of employee performance.For example, an employer interested in
understanding how well an integrity test identifies individuals that are likely to
engage in counterproductive work behaviors might compare applicants’ integrity
test scores to how many accidents or injuries those individuals have on the job, if
they engage in on-the-job drug use, or how many times they ignore company
policies. The degree to which the assessment is effective in predicting such
behaviors is the extent to which it exhibits criterion-related validity.
 An assessment demonstrates construct validity if it is related to other
assessments measuring the same psychological construct–a construct being a
concept used to explain behavior (e.g., intelligence, honesty).For example,
intelligence is a construct that is used to explain a person’s ability to understand
and solve problems. Construct validity can be evaluated by comparing
intelligence scores on one test to intelligence scores on other tests (i.e.,
Wonderlic Cognitive Ability Test to the Wechsler Adult Intelligence Scale).

Reliable and Valid?

The tricky part is that a test can be reliable without being valid. However, a test cannot
be valid unless it is reliable. An assessment can provide you with consistent results,
making it reliable, but unless it is measuring what you are supposed to measure, it is not
valid.

What are the biggest questions you have surrounding reliability and validity?

19 Responses to “A valid test is always reliable but a reliable test


is not necessarily valid”

1. raw2392 says:
December 8, 2011 at 3:26 pm

Really good blog, you can tell you have a great understanding of reliability and validity
and have clearly done some good research into these areas before writing this blog.
Validity and reliability are both important when it comes to research and the output this
research produces, if a test is not valid then the results that have come from it can not be
trusted. Validity is essential for the results to be taken seriously. Like you have quoted ‘a
valid test is always reliable but a reliable test is not necessarily valid’.
Reliability is also important for a test to be repeated later on by different researchers
and the same results to be found. This is important so that a test can be retested to see if
the research can be published.

A brilliant blog though and really explained the two concepts well!
Reply

o Mudasir says:
November 20, 2017 at 12:34 pm

I agree to U.. but quote “a valid test is always reliable but a reliable test is not necessarily
valid” is correct. Reliability is having more weight. Because it has been proven in past
times so its correct. Validity is on spot discussion of test or questionnaire and is also
important but not as important as reliability.
Reply
2. Pingback: Week 11 comments for TA | raw2392

3. psud6e says:
December 8, 2011 at 5:20 pm

I think this blog was very good, with a clear understanding of reliability and validity – so
much so, you were able to put it in simple terms I think the majority of people will be
able to understand. Today, we seem to find that the reliability of research in psychology
is only every tested by new researchers investigating a topic. When a researcher
completes an experiment, and has gathered all their data, they don’t run the experiment
again to look for the same results to test reliability; instead they compare it to other
research in the same field. If other researchers support their findings, then the research
can be assumed to be reliable. In the same way, reliability of ground-breaking research
– stuff completely new to the field – is only found when another experiment in the same
field is run.
Validity, however, is a harder concept to check. To be honest, can we really be sure
something is 100% reliable? We often use hypothetical constructs, especially in
psychology, whereby we assume something means another thing. For example, in
personality questionnaires, we ask questions that we assume are related to the
personality type we are trying to measure.
I definitely agree with you that reliability and validity are both important in testing, and
a good example of this is in medicine. We want the drug trial to be reliable, because then
we know that the drug is safe to use when it is manufactured, the next stage after the
trialling process. We also want the trial to be valid. This means that we want it to test the
effects of the drug. If it didn’t do this, we may not be able to see and measure the side
effects, and therefore a drug that is actually dangerous may be manufactured. Therefore,
reliability and validity are definitely important.
s mentioned in Key Concepts, reliability and validity are closely related. To better understand
this relationship, let's step out of the world of testing and onto a bathroom scale.

If the scale is reliable it tells you the same weight every time you step on it as
long as your weight has not actually changed. However, if the scale is not
working properly, this number may not be your actual weight. If that is the
case, this is an example of a scale that is reliable, or consistent, but not
valid. For the scale to be valid and reliable, not only does it need to tell you the same weight
every time you step on the scale, but it also has to measure your actual weight.

Switching back to testing, the situation is essentially the same. A test can be
reliable, meaning that the test-takers will get the same score no matter when or
where they take it, within reason of course. But that doesn't mean that it is valid
or measuring what it is supposed to measure. A test can be reliable without being
valid. However, a test cannot be valid unless it is reliable.

The answer above gives an excellent definition of reliable and unreliable tests, what they are,
and how they work. Because of this, I am going to elaborate a little more on your question. An
unreliable test is simply unreliable, but how and why can a reliable test become or be
unreliable?

First, I will talk about how a test is conducted. (Even though your question is about the Social
Sciences, it is a little easier to understand if I talk about Science. A bit further down I will
change the process over to the Social Sciences.)

Let us begin with an elementary science experiment: I place Plant A in the sunlight. I place
Plant B under a box. Over the course of a week, I monitor the plant in order to discover how
important light is to the growth of the plant.

Part of the reliability of the test is based on whether all the control factors are the same. Jimmy,
who is running the test, must keep all other variables the same. Both plants need to be watered
on the same schedule that is optimal for the plant type. Jimmy must also maintain the integrity
of the test- He cannot leave the box off on some days when "he forgets." If he does so, the test
becomes unreliable because the integrity has not been maintained- it is invalid. Invalid tests are
unreliable because no conclusions can be drawn from the test.

If Jimmy maintains the integrity of the test, then it becomes valid- thus reliable on the
surface. However, researchers know that one test could have bad results. (Maybe one plant is
particularly hardy or has a genetic mutation.) By adding more experiments with the results all
ending up the same, it makes the test more reliable. (If I conducted the plant experiment three
times and all three results matched my first experiment, I would increase my test's reliability. If
the test is run 300 times and 1/2 of the time the plant under the box lived and half of the time it
died- the test would be deemed unreliable. Researchers would recognize something in the
experiment is askew and must be resolved, and the experiment re-run before conclusions could
be drawn. The closer the experiment gets to 100%- same results every time, the more reliable
the experiment.

In the Social Sciences tests are not be as easy to control, because most of the time you are
dealing with some form of human nature. All efforts in an experiment must be made to keep the
integrity of the test- such as using a control group, subjects should be random (blind or double
blind protocols), ethics must be maintained. If these are not followed- the test is
considered invalid thus unreliable.
Results from a test that is invalid (unreliable) cannot be made reliable. The process whereby the
results were obtained is corrupted, thus they cannot be used.

However, if the experiment is conducted properly and on the surface is considered valid, but the
results are conflicting, it becomes unreliable. By performing multiple tests, the reliability of the
test in increased. Like the plant experiment- if the test is performed 300 times and the 90% of
the results are the same- the reliability is proven. The larger the discrepancy between
experiments the more unreliable the test becomes.

For example: Let us say I observe a Kindergarten Class to determine if self-control is an


important skill to classroom learning in Kindergarten. My research assistant notes the behavior
of each child- focus, following directions, and behavior issues (temper tantrums) are observed
and documented over the course of a month. A different research assistant (who knows nothing
of the first research assistant's work) is then assigned to take each Kindergartner into a room and
put a cookie on a plate. He then tells the Kindergartner that he is going to leave the room. He
will come back in three minutes. If the cookie is still on the plate, the child will get two cookies
instead of one, but if the child eats the cookie, he/she will not get another. The research assistant
leaves the room and the child is observed- filmed or monitored by a third research assistant. The
child's reaction to the test is noted and the ending result of the experiment is documented. (Who
received two cookies and who ate the one?) The results are compared to the first researcher's
notes. Did the students who did better in class wait for the second cookie?

As long as the protocols are maintained this is a valid experiment. However, in order to prove
the reliability of the research, the study should be conducted multiple times with exactly the
same protocols. The higher the frequency of similar results, the more reliable the test/results. If
the results vary widely with each experiment, or if a separate researcher finds conflicting results
through a second experiment- the test is considered unreliable. However, if a third and fourth
researcher find the same results as the first researcher, the reliability is returned and the second
researcher's results are called into question. (This is common in the Social Sciences.)

 list Cite
 link Link

Related Questions

 Explain this statement: If assessment results are highly valid, they will also be reliable.
 1 EDUCATOR ANSWER
 Explain the following statement: If assessment results are highly reliable, they may or may not...
 1 EDUCATOR ANSWER
 Do you support the use of intelligence tests in predicting students' academic performance? Please...
 4 EDUCATOR ANSWERS
 How would you make sure that the language test that you are designing/have designed,will test...
 1 EDUCATOR ANSWER
 Is the TAKS test in public schools doing its job or not?Is the TAKS test in public schools doing...
 1 EDUCATOR ANSWER
MORE SOCIAL SCIENCES QUESTIONS »
POHNPEI397 | CERTIFIED EDUCATOR

A test is reliable if it gets the same value over and over. A test is valid if it is truly measuring
what the researcher thinks it is measuring.

A test can be reliable but not valid. Let's say I wanted to measure how smart people are by
measuring their heads. I'd get the same value (in inches or centimeters or whatever) every time I
measured their head so the test would be reliable. But I wouldn't be actually measuring their
intelligence so the test wouldn't be valid.

A test cannot be valid if it's not reliable. If the test is not reliable, that means it gives different
results every time I do it. If it keeps giving different results, it cannot possibly be measuring
what I think it is. Let's say I give a person multiple tests to measure intelligence and they get
wildly different results the tests. Clearly, the test is not really measuring intelligence because if
it truly measured intelligence it would have to yield results that were nearly the same every time
(because we assume a person's intelligence doesn't change from moment to moment).
C. Reliability and Validity

In order for assessments to be sound, they must be free of bias and distortion. Reliability and validity are
two concepts that are important for defining and measuring bias and distortion.

Reliability refers to the extent to which assessments are consistent. Just as we enjoy having reliable cars
(cars that start every time we need them), we strive to have reliable, consistent instruments to measure
student achievement. Another way to think of reliability is to imagine a kitchen scale. If you weigh five
pounds of potatoes in the morning, and the scale is reliable, the same scale should register five pounds for
the potatoes an hour later (unless, of course, you peeled and cooked them). Likewise, instruments such as
classroom tests and national standardized exams should be reliable – it should not make any difference
whether a student takes the assessment in the morning or afternoon; one day or the next.

Another measure of reliability is the internal consistency of the items. For example, if you create a quiz to
measure students’ ability to solve quadratic equations, you should be able to assume that if a student gets an
item correct, he or she will also get other, similar items correct. The following table outlines three common
reliability measures.

Type of Reliability How to Measure

Give the same assessment twice, separated by days, weeks, or months. Reliability is
Stability or Test-Retest
stated as the correlation between scores at Time 1 and Time 2.

Create two forms of the same test (vary the items slightly). Reliability is stated as
Alternate Form
correlation between scores of Test 1 and Test 2.

Internal Consistency Compare one half of the test to the other half. Or, use methods such as Kuder-
(Alpha, a) Richardson Formula 20 (KR20) or Cronbach's Alpha.

The values for reliability coefficients range from 0 to 1.0. A coefficient of 0 means no reliability and 1.0
means perfect reliability. Since all tests have some error, reliability coefficients never reach 1.0. Generally,
if the reliability of a standardized test is above .80, it is said to have very good reliability; if it is below .50,
it would not be considered a very reliable test.

Validity refers to the accuracy of an assessment -- whether or not it measures what it is supposed to
measure. Even if a test is reliable, it may not provide a valid measure. Let’s imagine a bathroom scale that
consistently tells you that you weigh 130 pounds. The reliability (consistency) of this scale is very good, but
it is not accurate (valid) because you actually weigh 145 pounds (perhaps you re-set the scale in a weak
moment)! Since teachers, parents, and school districts make decisions about students based on assessments
(such as grades, promotions, and graduation), the validity inferred from the assessments is essential -- even
more crucial than the reliability. Also, if a test is valid, it is almost always reliable.

There are three ways in which validity can be measured. In order to have confidence that a test is valid (and
therefore the inferences we make based on the test scores are valid), all three kinds of validity evidence
should be considered.

Type of
Definition Example/Non-Example
Validity

A semester or quarter exam that only includes content


The extent to which the content of the covered during the last six weeks is not a valid measure
Content
test matches the instructional objectives. of the course's overall objectives -- it has very low
content validity.

The extent to which scores on the test


If the end-of-year math tests in 4th grade correlate
are in agreement with (concurrent
Criterion highly with the statewide math tests, they would have
validity) or predict (predictive validity) an
high concurrent validity.
external criterion.

If you can correctly hypothesize that ESOL students will


The extent to which an assessment
perform differently on a reading test than English-
Construct corresponds to other variables, as
speaking students (because of theory), the assessment
predicted by some rationale or theory.
may have construct validity.

So, does all this talk about validity and reliability mean you need to conduct statistical analyses on your
classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to
verify the content validity of your major assessments.) However, you should be aware of the basic tenets of
validity and reliability as you construct your classroom assessments, and you should be able to help parents
interpret scores for the standardized exams.
Try This

Reflect on the following scenarios.

1. A parent called you to ask about the reliability coefficient on a recent standardized test. The coefficient
was reported as .89, and the parent thinks that must be a very low number. How would you explain to the
parent that .89 is an acceptable coefficient?
2. Your school district is looking for an assessment instrument to measure reading ability. They have
narrowed the selection to two possibilities -- Test A provides data indicating that it has high validity, but
there is no information about its reliability. Test B provides data indicating that it has high reliability, but
there is no information about its validity. Which test would you recommend? Why?

A good classroom test is valid and reliable.

Validity is the quality of a test which measures what it is supposed to measure. It is the degree to
which evidence, common sense, or theory supports any interpretations or conclusions about a student
based on his/her test performance. More simply, it is how one knows that a math test measures
students' math ability, not their reading ability. Another aspect of test validity of particular importance
for classroom teachers is content-related validity. Do the items on a test fairly represent the items
that could be on the test? Reasonable sources for "items that should be on the test" are class
objectives, key concepts covered in lectures, main ideas, and so on. Classroom teachers who want to
make sure that they have a valid test from a content standpoint often construct a table of
specifications which specifically lists what was taught and how many items on a test will cover those
topics. The table can even be shared with students to guide them in studying for the test and as an
outline of what was most important in a unit or topic.

Reliability is the quality of a test which produces scores that are not affected much by chance.
Students sometimes randomly miss a question they really knew the answer to or sometimes get an
answer correct just by guessing; teachers can sometimes make an error or score inconsistently with
subjectively scored tests. These are problems of low reliability. Classroom teachers can solve the
problem of low reliability in some simple ways. First, a test with many items will usually be more
reliable than a shorter test, as whatever random fluctuations in performance occur over the course of
a test will tend to cancel itself out across many items. By the same token, a class grade will itself be
more reliable if it reflects many different assignments or components. Second, the more objective a
test is, the fewer random errors there will be in scoring, so teachers concerned about reliability are
often drawn to objectively scored tests. Even when using a subjective format, such as supply items,
teachers often use a detailed scoring rubric to make the scoring as objective, and, therefore, as
reliable as possible.

Classroom tests can also be categorized based on what they are intended to measure. Traditional
paper-and-pencil classroom tests (e.g. multiple-choice, matching, true-false) are best used to
measure knowledge. They are typically objectively scored (a computer with an answer key could score
it). Performance-based tests, sometimes called authentic or alternative tests, are best used to assess
student skill or ability. They are typically subjectively scored (a teacher must apply some degree of
opinion in evaluating the quality of a response). Performance-based tests are discussed in a separate
area on this website.

Tests designed to measure knowledge are usually made up of a set of individual questions.
Questions can be of two types: a) selection (or select) items, which allow students to select a correct
answer from a list of possible correct answers (e.g. multiple-choice, matching) and b) supply items,
which require students to supply the correct answer (e.g. fill-in-the-blank, short answer). Scoring
selection items is usually quicker and objective. Scoring supply items tends to take more time and is
usually more subjective. Sometimes teachers decide to use selection items when they are interested
in measuring basic, lower levels of understanding (at the knowledge or comprehension level in a
Bloom's taxonomy sense, Bloom et al.,1956) and use supply items if they are interested in higher
levels of understanding, but a well-written selection item can still get at higher levels of
understanding.

Teacher-made tests can also be distinguished by when they are given and how the results are
used. Tests given at the end of a unit or semester or after learning has occurred are called summative
tests. Their purpose is to assess learning and performance and usually affects a student's class grade.
Tests can also be given while learning is occurring, and these are called formative tests. Their purpose
is to provide feedback, so students can adjust how they are learning or teachers can adjust how they
are teaching. Usually these tests do not affect student grades.

Classroom assessment is an integral part of teaching (Chase, 1999; Popham, 2002; Trice, 2000;
Ward & Murray-Ward, 1999) and may take more than one-third of a teacher's professional time
(Stiggins, 1991). Most classroom assessment involves tests that teachers have constructed
themselves. It is estimated that 54 teacher-made tests are used in a typical classroom per year
(Marso & Pigge, 1988) which results in perhaps billions of unique assessments yearly world-wide
(Worthen, Borg, & White, 1993). Regardless of the exact frequency, teachers regularly use tests they
have constructed themselves (Boothroyd, McMorris, & Pruzek , 1992; Marso & Pigge, 1988; Williams,
1991). Further, teachers place more weight on their own tests in determining grades and student
progress than they do on assessments designed by others or on other data sources (Boothroyd, et al.,
1992; Fennessey, 1982; Stiggins & Bridgeford, 1985; Williams, 1991).

Most teachers believe that they need strong measurement skills (Wise, Lukin & Roos, 1991).
While some report that they are confident in their ability to produce valid and reliable tests (Oescher &
Kirby, 1990; Wise, et al., 1991), others report a level of discomfort with the quality of their own tests
(Stiggins & Bridgeford, 1985) or believe that their training was inadequate (Wise, et al.). Indeed, most
state certification systems and half of all teacher education programs have no assessment course
requirement or even an explicit requirement that teachers have received training in assessment
(Boothroyd, et al.; Stiggins, 1991; Trice, 2000; Wise, et al.). In addition, teachers have historically
received little or no training or support after certification (Herman & Dorr-Bremme, 1984). The formal
assessment training teachers do receive often focuses on large-scale test administration and
standardized test score interpretation rather than on the test construction strategies or item-writing
rules that teachers need (Stiggins, 1991; Stiggins & Bridgeford, 1985).

A quality teacher-made test should follow valid item-writing rules. However, empirical studies
establishing the validity of item-writing rules are in short supply and often inconclusive, and, "item
writing-rules are based primarily on common sense and the conventional wisdom of test experts"
(Millman & Greene, 1993; p. 353). Even after half a century of psychometric theory and research,
Cronbach (1970) bemoaned the almost complete lack of scholarly attention paid to achievement test
items. Twenty years after Cronbach's warning, Haladyna and Downing (1989) reasserted this claim,
stating that the body of knowledge about multiple-choice item writing, for example, was still quite
limited and, when revisiting the issue a decade later, added that "item writing is still largely a creative
act" (Haladyna, Downing & Rodriguez, 2002, p. 329).

The current empirical research literature for item-writing rules-of-thumb focuses on studies which
look at the relationship between a given item format and either test performance or psychometric
properties of the test related to the format choice. There are some guidelines supported by
experimental or quasi-experimental designs, but the foundation of best practices in this area remains,
essentially, only recommendations of experts. Common sense, along with an understanding of the
nature of the two characteristics of all quality tests (validity and reliability), provides the framework
that teachers use to make the best choices when designing student assessments.

Developed by: Bruce B. Frey, Ph.D., University of Kansas

Вам также может понравиться