Вы находитесь на странице: 1из 30

Reliability and Validity

Scales clarify the characteristics of measurement processes Scales indicate which statistical procedures are appropriate

Nominal Categories without order Colors, gender, political party, nationality

Ordinal Categories with order Size (S,M,L), Social class, Agreement (strong, some, low, none)

Interval Distance is meaningful between categories Temperature, ACT scores, shoe size, IQ

Ratio Scale of categories has absolute zero Age, income, all rates and percents, vacation time

Levels of Measurement: Learn them by playing the game!

Does my measurement procedure give the same accurate measurement each time it is used?

Reliability is

consistency in measurement
Does this procedure or

test yield the same results if you repeat the measurement, so long as conditions have not changed? How do we know?

Stern Tone Variator, from The Archives of the History of American Psychology

Validity is truth

in measurement
Does this procedure or

test actually measure the construct or dimension it is intended to measure? How do we know?
Lavery Psychograph, from The Archives of the History of American Psychology

Reliable: pattern shows the

shot hits the same part of the target each time: it is consistent, so it is reliable. Not valid. The goal is to hit the center of the target, but the shots are not in that area.

Valid because the pattern is

evenly distributed around the correct goal (center): the person probably tried to hit the correct place. Not reliable because the shots are off the mark in every possible direction; they are not consistent.

Not reliable because the

shots are not tightly clustered together; they are not consistent.
Not valid because, to the

extent there is any pattern, it is not at the true target, the center.

Reliable: the darts land

close together. The red player can reliably hit the same part of the target. Valid: the darts are clustered at the center, where they were aimed.

Bullseye!! by modenadude at http://www.flickr.com/photos/modenadude/3280286776

All of our research uses data.


Data is gathered through measurement procedures

The scores only have meaning if they measure what they are

supposed to measure (valid) and do so with accuracy and consistency (reliability).


Evaluating whether data are reliable and valid is a key

element in applying research findings.


We use statistical techniques in evaluating the reliability

and validity of data.

Reliability
Does the value observed and

Validity
Does the value observed and

recorded accurately reflect the true value of the object?


Test by measuring the object

recorded reflect the concept and dimension of interest?


Test by comparing with other

multiple times or ways. Every researcher must either use a known instrument, or test and demonstrate the reliability of a new tool.
The Literature Search is a

data or similar processes. Every researcher must either use a known instrument, or test and demonstrate the validity of a new tool.
The Literature Search is a

huge labor saving device Using a known instrument improves research quality

huge labor saving device Using a known instrument improves research quality.

Unreliable measurement

tools introduce error Reliability improves with new tool or method Test-Retest is the simplest way to assess reliability Can be used whenever the process of measuring will not, by itself, affect data Test with Correlation of first with second scores.

Reliable measurements allow researchers to test their

theories and hypotheses. The more error in the data whether random or systematic the less likely they are to find true and significant results. Researchers in psychology and human service fields spend months and years to develop a set o questions or observations that has high reliability.

Meaning of questions is

unclear or produces random answers. Raters not adequately trained on method of making rating. Some of the questions or items measure a subtly different dimension, they dont go with the others. Instructions may be unclear or inconsistent, even if the test questions are fine. Outside events may be having an effect.

Testing situation requires

the use of different forms of the same tool


School & licensing exams Learning occurs in testing,

so it cant be repeated
Test by computing

correlation of two or more forms, taken under same circumstances.

Abstract or complex

dimensions cant be measured directly Several related items or observations are more likely to get an accurate result Need to verify that all the items relate to the same dimension Cronbachs alpha () is commonly reported.
Interpret strength on same

scale as correlation

Observation process is

skilled and requires individual judgment. Trained researchers make observations of the same subject independently. Test with Correlation of ratings (interval or ratio) or percent agreement (nominal or ordinal)
Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/

Measurement process

needs accuracy across all possible outcomes. Ceiling effect: process cannot measure extreme high scores Floor effect: process cannot measure extreme low scores Both are a scale attenuation problem.

Taking In the View by Randy Son of Robert at http://www.flickr.com/photos/randysonofrobert/2384256036/

The real world is messy Not always clear how to

math problems for girls by woodleywonderworks at http://www.flickr.com/photos/wwworks/3597217248/

sort out all the reliability questions Reliability issues are intertwined with validity. Learn by seeing what competent researchers do.

Click to listen

W. Andrew Harrell describes his study

Two controversies are connected to Dr. Harrells study:

Do physical traits such as

beauty have an evolutionary impact (i.e., people are more likely to have children to pass along their genes)?

Isnt beauty a subjective

judgment, not a trait that can be objectively measured in a research study.


University Of Alberta (2005, April 13). Researchers Show Parents Give Unattractive Children Less Attention. ScienceDaily. Retrieved July 25, 2009, from http://www.sciencedaily.com /releases/2005/04/050412213412.htm

Dr. Harrell addresses both of those questions in this 4 minute audio clip.

Direct observation of seat

belt safety Previously validated measures of beauty Two trained researchers evaluate attractiveness (inter-rater reliability) Two different trained researchers observe safety (inter-rater reliability and avoiding bias)

All researchers should use identical instructions.

A larger number of items (questions) will provide a

more stable measure of a complex dimension Some questions will be eliminated because they evoke varied and inconsistent responses The items need to cover the entire range of the dimension in order to observe the extreme values Reliability may change if observations are made in very different populations or situations (e.g., college students vs. seniors in assisted living).

Does my measurement procedure give a measurement of the construct or variable that I intend? Or is it measuring something else?

Validity is an over-arching concern of research


Measurement are the observations directly and truly

linked to the dimension or concept claimed? Research design how well does the experiment or study control the situation so that we are confident that the relationships or results observed were due to the impact of the independent variable?
In this course, we consider validity in measurement now

and validity in research design later.

Face validity
Content validity Are all aspects of the dimension or concept covered? Are any aspects over- or under-emphasized? Does the measure differentiate this dimension from other similar ones?
Improving content validity Thorough search of the literature Consult with experts who disagree with your perspective

Manuscripts and checklists by Muffett at http://www.flickr.com/photos/calliope/173797447/

A measure is valid if it has a strong relationship to an external

criterion
A music audition is a valid measure if it selects the better players

over those with less ability (concurrent validity). The GRE is a valid measure if people who do well on the GRE succeed in graduate school (predictive validity) It is often as hard to demonstrate the link between the test and the criterion as between the test and the dimension.

GRE flashcards by NEPMET at http://www.flickr.com/photos/blahman/2168064272/

The dimension to be measured is a construct, an abstract

idea related to a group of interrelated variables. The construct itself might be socially constructed
classic studies in obedience

are being re-interpreted. some cultures lack a word or idea for schizophrenia
Researchers make their case for
Milgrams obedience experiment

validity, but must be open to reconsideration.

Incorrect theory produced

this psychograph. It yielded mostly random data.


The procedure measures a

dimension, but not the one intended by the researcher.


The procedure measures a

dimension, but there is more than one interpretation of its meaning.

Lavery Psychograph, from The Archives of the History of American Psychology

Both reliability and validity are measured on a continuum,

evaluated in terms of degrees. If a measurement has very low reliability, it cant be valid because it is not even accurate. Maximum validity is the square root of reliability.

Validity Reliability
Especially in inferential statistics when relationships

are tested, the analyst must remember to examine the quality of measurement.

Вам также может понравиться