Академический Документы
Профессиональный Документы
Культура Документы
WRITTEN REPORT
Reliability
Validity
Face Validity
Content Validity
Criterion-related Validity
Concurrent Validity
Internal reliability assesses the consistency of results across items within a test.
External reliability refers to the extent to which a measure varies from one use to
another.
Internal Reliability
Split-half Method
The split-half method assesses the internal consistency of a test, such as
psychometric tests and questionnaires. There, it measures the extent to which all parts
of the test contribute equally to what is being measured.
This is done by comparing the results of one half of a test with the results from
the other half. A test can be split in half in several ways, e.g. first half and second half,
2
or by odd and even numbers. If the two halves of the test provide similar results this
would suggest that the test has internal reliability.
The reliability of a test could be improved through using this method. For
example any items on separate halves of a test which have a low correlation (e.g. r = .
25) should either be removed or re-written.
The split-half method is a quick and easy way to establish reliability. However it
can only be effective with large questionnaires in which all questions measure the same
construct. This means it would not be appropriate for tests which measure different
constructs.
For example, the Minnesota Multiphasic Personality Inventory has sub scales
measuring differently behaviors such depression, schizophrenia, social introversion.
Therefore the split-half method was not be an appropriate method to assess reliability
for this personality test.
External Reliability
Test-retest
The test-retest method assesses the external consistency of a test. Examples of
appropriate tests include questionnaires and psychometric tests. It measures the
stability of a test over time.
A typical assessment would involve giving participants the same test on two
separate occasions. If the same or similar results are obtained then external reliability is
established. The disadvantages of the test-retest method are that it takes a long time for
results to be obtained.
Beck et al. (1996) studied the responses of 26 outpatients on two separate
therapy sessions one week apart, they found a correlation of .93 therefore
demonstrating high test-retest reliability of the depression inventory.
This is an example of why reliability in psychological research is necessary, if it
wasnt for the reliability of such tests some individuals may not be successfully
diagnosed with disorders such as depression and consequently will not be given
appropriate therapy.
The timing of the test is important; if the duration is to brief then participants may
recall information from the first test which could bias the results. Alternatively, if the
duration is too long it is feasible that the participants could have changed in some
important way which could also bias the results.
Inter-rater Reliability
3
The test-retest method assesses the external consistency of a test. This refers to
the degree to which different raters give consistent estimates of the same behavior.
Inter-rater reliability can be used for interviews.
Note, it can also be called inter-observer reliability when referring to
observational research. Here researcher when observe the same behavior
independently (to avoided bias) and compare their data. If the data is similar then it is
reliable.
Where observer scores do not significantly correlate then reliability can be
improved by:
Training observers in the observation techniques being used and making sure
everyone agrees with them.
Ensuring behavior categories have been operationalized. This means that they
have been objectively defined.
A distinction can be made between internal and external validity. These types of
validity are relevant to evaluating the validity of a research study / procedure.
Internal validity refers to whether the effects observed in a study are due to the
manipulation of the independent variable and not some other factor. In-other-words
there is a causal relationship between the independent and dependent variable.
Internal validity can be improved by controlling extraneous variables, using
standardized instructions, counter balancing, and eliminating demand characteristics
and investigator effects.
External validity refers to the extent to which the results of a study can be
generalized to other settings (ecological validity), other people (population validity) and
over time (historical validity).
External validity can be improved by setting experiments in a more natural setting
and using random sampling to select participants.
FACE VALIDITY
What is Face Validity?
It is built upon the principle of reading through the plans and assessing the
viability of the research, with little objective measurement.
This 'common sense' approach often saves a lot of time, resources and stress.
For example, imagine a research paper about Global Warming. A layperson could read
through it and think that it was a solid experiment, highlighting the processes behind
Global Warming.
On the other hand, a distinguished climatology professor could read through it and find
the paper, and the reasoning behind the techniques, to be very poor.
This example shows the importance of face validity as useful filter for eliminating
shoddy research from the field of science, through peer review.
If Face Validity is so Weak, Why is it Used?
Especially in the social and educational sciences, it is very difficult to measure
the content validity of a research program. Often, there are so many interlinked factors
that it is practically impossible to account for them all. Many researchers send their
plans to a group of leading experts in the field, asking them if they think that it is a good
and representative program.
This face validity should be good enough to withstand scrutiny and helps a
researcher to find potential flaws before they waste a lot of time and money. In the
social sciences, it is very difficult to apply the scientific method, so experience and
judgment are valued assets.
Before any physical scientists think that this has nothing to do with their more
quantifiable approach, face validity is something that pretty much every scientist uses.
Every time you conduct a literature review, and sift through past research papers, you
apply the principle of face validity.
Although you might look at who wrote the paper, where the journal was from and
who funded it, ultimately, you ask 'Does this paper do what it sets out to?' This is face
validity in action.
Process of Face Validity
There are two important steps in this process is first, experts or people who
understand your topic read through your questionnaire. The experts should evaluate
whether the questions effectively capture the topic under investigation. The second is to
have a psychometrician (i.e., one who is expert on questionnaire construction). A
psychometrician is required to check the survey for common errors like double-barreled,
confusing, and leading questions.
CONTENT VALIDITY
6
Test Blueprint
CVR = ne (N/2)
N/2
Where CVR = content validity ratio
ne = number of panelists indicating essential
N = total number of panelists.
Assuming a panel of ten experts, the following three examples illustrate the
meaning of the CVR when it is negative, zero, and positive.
1. Negative CVR:
When fewer than half the panelists indicate essential, the CVR is negative.
Assume four of ten panelists indicated essential; then:
CVR = 4 (10/2)
10/2
= 0.2
2. Zero CVR:
When exactly half the panelists indicate essential, the CVR is zero:
CVR = 5 (10/2)
10/2
= 0.0
3. Positive CVR:
When more than half but not all the panelists indicate essential, the CVR
ranges between .00 and .99. Suppose that nine of ten indicated essential; then:
CVR = 9 (10/2)
10/2
= 0.80
In validating a test, the content validity ratio is calculated for each item. Lawshe
recommended that if the amount of agreement observed is more than 5% likely to occur
by chance, then the item should be eliminated. The minimal CVR values corresponding
to this 5% level are presented in Table 61. In the case of ten panelists, an item would
need a minimum CVR of .62. In our third example (in which nine of ten panelists
agreed), the CVR of .80 is signicant and so the item could be retained. Subsequently,
in our discussion of criterion-related validity, our attention shifts from an index of validity
8
based not on test content but on test scores. First, some perspective on culture as it
relates to a tests validity.
CRITERION-RELATED VALIDITY
What is Criterion-Related Validity?
It is the degree to which test scores indicate a result on a specific measure that is
consistent with some other criterion of the characteristic being assessed.
It measures how well one measure predicts an outcome for another outcome.
It is the relationship between test scores and some type of criterion or outcome.
There are two types of validity subsumed to have the Criterion-Related Validity and they
are the following:
1 The concurrent validity which is an index of the degree to which a test score is
related to some criterion measure obtained at the same time (concurrently).
2 The predictive validity which is an index of the degree to which tests score
predicts some criterion measure.
What is Criterion?
It is a standard against which a test or test score is evaluated. It can be a test
score, a specic behavior or group of behaviors, an amount of time, a rating, a
psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol
intoxication, and so on. Ideally it is relevant, valid, and uncontaminated.
Characteristics of a Criterion
1 Relevant
An adequate criterion measure must also be valid.
2 Uncontaminated
A criterion being uncontaminated, the outcome would have not been biased.
There is a term what we call, criterion contamination that applied to a criterion
measure that has been based, at least in part on predictor measures. We dont
rely to one factor as the predictor and the criterion because hen criterion
contamination does occur, the results of the validation study cannot be taken
seriously. There are no methods or statistics to correct such contamination and to
gauge the extent to which criterion contamination has taken place
Examples of what is a Criterion
Endorses check
issue of whether the criterion (in this case, the diagnoses made by panel
members) was indeed valid.
Examples of Criterion Contamination
1 There was these hypothetical Inmate Violence Potential Test (IVPT) designed to
predict a prisoners potential for violence in the cell block. In part, this evaluation
entails ratings from fellow inmates, guards, and other staff in order to come up
with a number that represents each inmates violence potential. After all of the
inmates have been given scores on this test, the study authors attempted to
validate the test by asking guards to rate each inmate on their violence potential.
Because the guards opinions were used to formulate the inmates test score in
the rst place (the predictor variable), the guards opinions cannot be used as a
criterion against which to judge the soundness of the test. Once the guards
opinions were used both as a predictor and as a criterion, then we would say that
criterion contamination had occurred.
2
As the name suggests, concurrent validity relies upon tests that took place at the
same time. Ideally, this means testing the subjects at exactly the same moment,
but some approximation is acceptable.
This gives us confidence that the two measurement procedures are measuring
the same thing (i.e., the same construct).
Cross referencing the scores for each student allows the researchers to check if
there is a correlation, evaluate the accuracy of their test, and decide whether it
measures what it is supposed to. The key element is that the two methods were
compared at about the same time.
If the researchers had measured the mathematical aptitude, implemented a new
educational program, and then retested the students after six months, this would
be predictive validity.
2 Imagine that you are a psychologist developing a new psychological test
designed to measure depression, called the Rice Depression Scale. Once your
test is fully developed, you decide that you want to make sure that it is valid; in
other words, you want to make sure that the test accurately measures what it is
supposed to measure. One way to do this is to look for other tests that have
already been found to be valid measures of your construct, administer both tests,
and compare the results of the tests to each other.
Since the construct, or psychological concept, that you want to measure is
depression, you search for psychological tests that measure depression. In your
search, you come across the Beck Depression Inventory, which researchers
have determined through several studies is a valid measure of depression. You
recruit a sample of individuals to take both the Rice Depression Scale and the
Beck Depression Inventory at the same time.
You analyze the results and find the scores on the Rice Depression Scale have a
high positive correlation to the scores on the Beck Depression Scale. That is, the
higher the individual scores on the Rice Depression Scale, the higher their score
on the Beck Depression Inventory. Likewise, the lower the score on the Rice
Depression Scale, the lower the score on the Beck Depression Inventory. You
conclude that the scores on the Rice Depression Scale correspond to the scores
on the Beck Depression Inventory. You have just established concurrent validity.
3 Concurrent validity can also occur between two different groups. For example,
lets say a group of nursing students take two final exams to assess their
knowledge. One exam is a practical test and the second exam is a paper test. If
the students who score well on the practical test also score well on the paper
test, then concurrent validity has occurred. If, on the other hand, students who
score well on the practical test score poorly on the paper test (and vice versa),
then you have a problem with concurrent validity. In this particular example, you
would question the ability of either test to assess knowledge.
Advantages and Disadvantages of Concurrent Validity
Advantages:
Disadvantages:
If you are testing different groups, like people who want jobs and people who
have jobs, responses may differ between groups. For example, people who
already have jobs may be less inclined to put their best foot forward.
Process of Validation
In Concurrent Validation, Pearson Product-Moment Correlation Coefficient, or
simply Pearsons R Test may be used.
Pearson Product-Moment Correlation Coefficient
A measure of the strength of a linear association between two variables and is
denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line
of best fit through the data of two variables, and the Pearson correlation coefficient, r,
indicates how far away all these data points are to this line of best fit (how well the data
points fit this new model/line of best fit).
The general formula for the Pearson correlation coefficient (r) calls for the values
of the summation of the following:
the x values
the y values
13
The equation may also be used, given that you have to solve for the following elements:
A table of value is used to evaluate the relation between the variables given the result of
the test.
0 to 0.20
Negligible/ inverse r
0.21 to 0.40
Low/ slight r
0.41 to 0.70
Marked/ substantial r
0.71 to 1
14
References:
Cohen-Swerdlik, Psychological Testing and Assessment: An Introduction to Tests and
Measurement, 7th Edition
http://dissertation.laerd.com/criterion-validity-concurrent-and-predictive-validity-p2.php
https://explorable.com/concurrent-validity
http://jfmueller.faculty.noctrl.edu/toolbox/howstep3.htm
https://www.nap.edu/read/1862/chapter/10
http://www.statisticshowto.com/concurrent-validity/
http://study.com/academy/lesson/concurrent-validity-definition-examples.html
15
16