Вы находитесь на странице: 1из 17

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

COLLEGE OF SOCIAL SCIENCES AND DEVELOPMENT


DEPARTMENT OF PSYCHOLOGY

Bernardo, Isabel Joyce


Martin, Monica
Nabablit, Diala
Oberas, Diana Mae
Padilla, Kiela Louise
Pagulayan, Jehan Josef
Pico, Kookai Camille
Santos, Angela Marinella
Sicabalo, John Patrick
Turingan, Jensken Bremel
BS Psychology 3-1
Prof. Rodrigo Lopiga, RPm, RPsy

WRITTEN REPORT

Reliability
Validity
Face Validity
Content Validity
Criterion-related Validity
Concurrent Validity

RELIABILITY AND VALIDITY


What is Reliability?

Reliability in statistics and psychometrics is the overall consistency of a measure.


A measure is said to have a high reliability if it produces similar results under consistent
conditions. "It is the characteristic of a set of test scores that relates to the amount of
random error from the measurement process that might be embedded in the scores.
Scores that are highly reliable are accurate, reproducible, and consistent from one
testing occasion to another. That is, if the testing process were repeated with a group of
test takers, essentially the same results would be obtained. Various kinds of reliability
coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), are
usually used to indicate the amount of error in the scores."
Example of Reliability
For example, if a person weighs themselves during the course of a day they
would expect to see a similar reading. Scales which measured weight differently each
time would be of little use.
The same analogy could be applied to a tape measure which measures inches
differently each time it was used. It would not be considered reliable.
If findings from research are replicated consistently they are reliable. A
correlation coefficient can be used to assess the degree of reliability. If a test is reliable
it should show a high positive correlation.
Of course, it is unlikely the exact same results will be obtained each time as
participants and situations vary, but a strong positive correlation between the results of
the same test indicates reliability.
Types of Reliability
There are two types of reliability internal and external reliability.

Internal reliability assesses the consistency of results across items within a test.

External reliability refers to the extent to which a measure varies from one use to
another.

Internal Reliability
Split-half Method
The split-half method assesses the internal consistency of a test, such as
psychometric tests and questionnaires. There, it measures the extent to which all parts
of the test contribute equally to what is being measured.
This is done by comparing the results of one half of a test with the results from
the other half. A test can be split in half in several ways, e.g. first half and second half,
2

or by odd and even numbers. If the two halves of the test provide similar results this
would suggest that the test has internal reliability.
The reliability of a test could be improved through using this method. For
example any items on separate halves of a test which have a low correlation (e.g. r = .
25) should either be removed or re-written.
The split-half method is a quick and easy way to establish reliability. However it
can only be effective with large questionnaires in which all questions measure the same
construct. This means it would not be appropriate for tests which measure different
constructs.
For example, the Minnesota Multiphasic Personality Inventory has sub scales
measuring differently behaviors such depression, schizophrenia, social introversion.
Therefore the split-half method was not be an appropriate method to assess reliability
for this personality test.
External Reliability
Test-retest
The test-retest method assesses the external consistency of a test. Examples of
appropriate tests include questionnaires and psychometric tests. It measures the
stability of a test over time.
A typical assessment would involve giving participants the same test on two
separate occasions. If the same or similar results are obtained then external reliability is
established. The disadvantages of the test-retest method are that it takes a long time for
results to be obtained.
Beck et al. (1996) studied the responses of 26 outpatients on two separate
therapy sessions one week apart, they found a correlation of .93 therefore
demonstrating high test-retest reliability of the depression inventory.
This is an example of why reliability in psychological research is necessary, if it
wasnt for the reliability of such tests some individuals may not be successfully
diagnosed with disorders such as depression and consequently will not be given
appropriate therapy.
The timing of the test is important; if the duration is to brief then participants may
recall information from the first test which could bias the results. Alternatively, if the
duration is too long it is feasible that the participants could have changed in some
important way which could also bias the results.
Inter-rater Reliability
3

The test-retest method assesses the external consistency of a test. This refers to
the degree to which different raters give consistent estimates of the same behavior.
Inter-rater reliability can be used for interviews.
Note, it can also be called inter-observer reliability when referring to
observational research. Here researcher when observe the same behavior
independently (to avoided bias) and compare their data. If the data is similar then it is
reliable.
Where observer scores do not significantly correlate then reliability can be
improved by:

Training observers in the observation techniques being used and making sure
everyone agrees with them.

Ensuring behavior categories have been operationalized. This means that they
have been objectively defined.

For example, if two researchers are observing aggressive behavior of children


at nursery they would both have their own subjective opinion regarding what aggression
comprises. In this scenario it would be unlikely they would record aggressive behavior
the same and the data would be unreliable.
However, if they were to operationalize the behavior category of aggression this
would be more objective and make it easier to identify when a specific behavior occurs.
For example, while aggressive behavior is subjective and not operationalized,
pushing is objective and operationalized. Thus researchers could simply count how
many times children push each other over a certain duration of time.
What is Validity?
The concept of validity was formulated by Kelly (1927, p. 14) who stated that a
test is valid if it measures what it claims to measure. The word "valid" is derived from the
Latin validus, meaning strong. Validity is important because it can help determine what
types of tests to use, and help to make sure researchers are using methods that are not
only ethical, and cost-effective, but also a method that truly measures the idea or
construct in question.
For example a test of intelligence should measure intelligence and not something
else (such as memory).
Types of Validity
Internal and External Validity
4

A distinction can be made between internal and external validity. These types of
validity are relevant to evaluating the validity of a research study / procedure.
Internal validity refers to whether the effects observed in a study are due to the
manipulation of the independent variable and not some other factor. In-other-words
there is a causal relationship between the independent and dependent variable.
Internal validity can be improved by controlling extraneous variables, using
standardized instructions, counter balancing, and eliminating demand characteristics
and investigator effects.
External validity refers to the extent to which the results of a study can be
generalized to other settings (ecological validity), other people (population validity) and
over time (historical validity).
External validity can be improved by setting experiments in a more natural setting
and using random sampling to select participants.
FACE VALIDITY
What is Face Validity?

Face validity, as the name suggests, is a measure of how representative a


research project is 'at face value,' and whether it appears to be a good project.

It is built upon the principle of reading through the plans and assessing the
viability of the research, with little objective measurement.

Whilst face validity, sometime referred to as representation validity, is a weak


measure of validity, its importance cannot be underestimated.

This 'common sense' approach often saves a lot of time, resources and stress.

Examples of Face Validity


In many ways, face validity offers a contrast to content validity, which attempts to
measure how accurately an experiment represents what it is trying to measure. The
difference is that content validity is carefully evaluated, whereas face validity is a more
general measure and the subjects often have input.
An example could be, after a group of students sat a test, you asked for feedback,
specifically if they thought that the test was a good one. This enables refinements for
the next research project and adds another dimension to establishing validity.
Face validity is classed as 'weak evidence' supporting construct validity, but that does
not mean that it is incorrect, only that caution is necessary.
5

For example, imagine a research paper about Global Warming. A layperson could read
through it and think that it was a solid experiment, highlighting the processes behind
Global Warming.
On the other hand, a distinguished climatology professor could read through it and find
the paper, and the reasoning behind the techniques, to be very poor.
This example shows the importance of face validity as useful filter for eliminating
shoddy research from the field of science, through peer review.
If Face Validity is so Weak, Why is it Used?
Especially in the social and educational sciences, it is very difficult to measure
the content validity of a research program. Often, there are so many interlinked factors
that it is practically impossible to account for them all. Many researchers send their
plans to a group of leading experts in the field, asking them if they think that it is a good
and representative program.
This face validity should be good enough to withstand scrutiny and helps a
researcher to find potential flaws before they waste a lot of time and money. In the
social sciences, it is very difficult to apply the scientific method, so experience and
judgment are valued assets.
Before any physical scientists think that this has nothing to do with their more
quantifiable approach, face validity is something that pretty much every scientist uses.
Every time you conduct a literature review, and sift through past research papers, you
apply the principle of face validity.
Although you might look at who wrote the paper, where the journal was from and
who funded it, ultimately, you ask 'Does this paper do what it sets out to?' This is face
validity in action.
Process of Face Validity
There are two important steps in this process is first, experts or people who
understand your topic read through your questionnaire. The experts should evaluate
whether the questions effectively capture the topic under investigation. The second is to
have a psychometrician (i.e., one who is expert on questionnaire construction). A
psychometrician is required to check the survey for common errors like double-barreled,
confusing, and leading questions.

CONTENT VALIDITY
6

What is Content Validity?

Describes a judgment of how adequately a test samples behavior representative


of the universe of behavior that the test was designed to sample.

With respect to educational achievement tests, it is customary to consider a test


a content-valid measure when the proportion of material covered by the test
approximates the proportion of material covered in the course. A cumulative nal
exam in introductory statistics would be considered content-valid if the proportion
and type of introductory statistics problems on the test approximates the
proportion and type of introductory statistics problems presented in the course.

Test Blueprint

Structure of the evaluation, a plan regarding the types of information to be


covered by the items, the number of items tapping each area of coverage, the
organization of the items in the test.

The Quantification of Content Validity


The measurement of content validity is important in employment settings, where
tests used to hire and promote people are carefully scrutinized for their relevance to the
job, among other factors (Russell & Peterson, 1997).Courts often require evidence that
employment tests are work related. Several methods for quantifying content validity
have been created (for example, James et al., 1984; Lindell et al., 1999; Tinsley &
Weiss, 1975). One method of measuring content validity, developed by C. H. Lawshe, is
essentially a method for gauging agreement among raters or judges regarding how
essential a particular item is. Lawshe (1975) proposed that each rater respond to the
following question for each item: Is the skill or knowledge measured by this item
essential, useful but not essential, or not necessary to the performance of the job?
Examples of Content Validity
1 At home, such as whether the respondent has difculty in making her or his
views known to fellow family members;
2 On the job, such as whether the respondent has difculty in asking subordinates
to do what is required of them;
3 And in social situations, such as whether the respondent would send back a
steak not done to order in a fancy restaurant.
Process of Validation
Greater levels of content validity exist as larger numbers of panelists agree that a
particular item is essential. Using these assumptions, Lawshe developed a formula
termed the content validity ratio (CVR):
7

CVR = ne (N/2)
N/2
Where CVR = content validity ratio
ne = number of panelists indicating essential
N = total number of panelists.
Assuming a panel of ten experts, the following three examples illustrate the
meaning of the CVR when it is negative, zero, and positive.
1. Negative CVR:
When fewer than half the panelists indicate essential, the CVR is negative.
Assume four of ten panelists indicated essential; then:
CVR = 4 (10/2)
10/2
= 0.2
2. Zero CVR:
When exactly half the panelists indicate essential, the CVR is zero:

CVR = 5 (10/2)
10/2
= 0.0
3. Positive CVR:
When more than half but not all the panelists indicate essential, the CVR
ranges between .00 and .99. Suppose that nine of ten indicated essential; then:
CVR = 9 (10/2)
10/2
= 0.80
In validating a test, the content validity ratio is calculated for each item. Lawshe
recommended that if the amount of agreement observed is more than 5% likely to occur
by chance, then the item should be eliminated. The minimal CVR values corresponding
to this 5% level are presented in Table 61. In the case of ten panelists, an item would
need a minimum CVR of .62. In our third example (in which nine of ten panelists
agreed), the CVR of .80 is signicant and so the item could be retained. Subsequently,
in our discussion of criterion-related validity, our attention shifts from an index of validity
8

based not on test content but on test scores. First, some perspective on culture as it
relates to a tests validity.
CRITERION-RELATED VALIDITY
What is Criterion-Related Validity?

It is the degree to which test scores indicate a result on a specific measure that is
consistent with some other criterion of the characteristic being assessed.

It measures how well one measure predicts an outcome for another outcome.

It is the relationship between test scores and some type of criterion or outcome.

There are two types of validity subsumed to have the Criterion-Related Validity and they
are the following:
1 The concurrent validity which is an index of the degree to which a test score is
related to some criterion measure obtained at the same time (concurrently).
2 The predictive validity which is an index of the degree to which tests score
predicts some criterion measure.
What is Criterion?
It is a standard against which a test or test score is evaluated. It can be a test
score, a specic behavior or group of behaviors, an amount of time, a rating, a
psychiatric diagnosis, a training cost, an index of absenteeism, an index of alcohol
intoxication, and so on. Ideally it is relevant, valid, and uncontaminated.
Characteristics of a Criterion
1 Relevant
An adequate criterion measure must also be valid.
2 Uncontaminated
A criterion being uncontaminated, the outcome would have not been biased.
There is a term what we call, criterion contamination that applied to a criterion
measure that has been based, at least in part on predictor measures. We dont
rely to one factor as the predictor and the criterion because hen criterion
contamination does occur, the results of the validation study cannot be taken
seriously. There are no methods or statistics to correct such contamination and to
gauge the extent to which criterion contamination has taken place
Examples of what is a Criterion

1 If a test purports to measure the trait of athleticism, we might expect to employ


membership in a health club or any generally accepted measure of physical
tness as a criterion in evaluating whether the athleticism test truly measures
athleticism. Operationally, a criterion can be most anything: pilot performance in
ying a Boeing 767, grade on examination in Advanced Hairweaving, number of
days spent in psychiatric hospitalization; the list is endless.
2 Here is a standard from the Special Education collection of examples:
The student will conduct banking transactions. The authentic task this teacher
assigned to students to assess the standard was to make deposits, withdrawals
or cash checks at a bank. To identify the criteria for good performance on this
task, the teacher asked herself "what would good performance on this task look
like?" She came up with seven essential characteristics for successful completion
of the task:

Selects needed form (deposit, withdrawal)

Fills in form with necessary information

Endorses check

Locates open teller

States type of transaction

Counts money to be deposited to teller

Puts money received in wallet


If students meet these criteria then they have performed well on the task and,
thus, have met the standard or, at least, provided some evidence of meeting the
standard.

Examples of a Relevant Criterion


1 A test purporting to advise test users whether individuals share the same
interests of successful actors. Assuming that the test takers are done and we get
the results already. To validate the outcome of the tests, there is a need to have
the opinions of the successful actors because they act as the criterion or the
standard.
2 Another example, that a test purporting to measure depression is said to have
been validated using as a criterion the diagnoses made by a blue-ribbon panel of
psychodiagnosticians. A test user might wish to probe further regarding variables
such as the credentials of the blue-ribbon panel (that is, their educational
background, training, and experience) and the actual procedures used to validate
a diagnosis of depression. Answers to such questions would help address the
10

issue of whether the criterion (in this case, the diagnoses made by panel
members) was indeed valid.
Examples of Criterion Contamination
1 There was these hypothetical Inmate Violence Potential Test (IVPT) designed to
predict a prisoners potential for violence in the cell block. In part, this evaluation
entails ratings from fellow inmates, guards, and other staff in order to come up
with a number that represents each inmates violence potential. After all of the
inmates have been given scores on this test, the study authors attempted to
validate the test by asking guards to rate each inmate on their violence potential.
Because the guards opinions were used to formulate the inmates test score in
the rst place (the predictor variable), the guards opinions cannot be used as a
criterion against which to judge the soundness of the test. Once the guards
opinions were used both as a predictor and as a criterion, then we would say that
criterion contamination had occurred.
2

Researchers are conducting and administering a test designed for depression


level of the adults in a nursing home. Assuming that the test went well and it
already have test scores. The researchers are attempting to validate the result by
getting the opinion of the staffs to have the standard in evaluating the test results.
Same case in the first example that the staffs opinions are used as the predictor
and criterion. In the end, criterion contamination occurred.
CONCURRENT VALIDITY

What is Concurrent Validity?

Concurrent validity is a concept commonly used in psychology, education, and


social science.

It refers to the extent to which the results of a particular test, or measurement,


correspond to those of a previously established measurement for the same
construct.

As the name suggests, concurrent validity relies upon tests that took place at the
same time. Ideally, this means testing the subjects at exactly the same moment,
but some approximation is acceptable.

This gives us confidence that the two measurement procedures are measuring
the same thing (i.e., the same construct).

Examples of Concurrent Validity


1 Researchers give a group of students a new test, designed to measure
mathematical aptitude. They then compare this with the test scores already held
by the school, a recognized and reliable judge of mathematical ability.
11

Cross referencing the scores for each student allows the researchers to check if
there is a correlation, evaluate the accuracy of their test, and decide whether it
measures what it is supposed to. The key element is that the two methods were
compared at about the same time.
If the researchers had measured the mathematical aptitude, implemented a new
educational program, and then retested the students after six months, this would
be predictive validity.
2 Imagine that you are a psychologist developing a new psychological test
designed to measure depression, called the Rice Depression Scale. Once your
test is fully developed, you decide that you want to make sure that it is valid; in
other words, you want to make sure that the test accurately measures what it is
supposed to measure. One way to do this is to look for other tests that have
already been found to be valid measures of your construct, administer both tests,
and compare the results of the tests to each other.
Since the construct, or psychological concept, that you want to measure is
depression, you search for psychological tests that measure depression. In your
search, you come across the Beck Depression Inventory, which researchers
have determined through several studies is a valid measure of depression. You
recruit a sample of individuals to take both the Rice Depression Scale and the
Beck Depression Inventory at the same time.
You analyze the results and find the scores on the Rice Depression Scale have a
high positive correlation to the scores on the Beck Depression Scale. That is, the
higher the individual scores on the Rice Depression Scale, the higher their score
on the Beck Depression Inventory. Likewise, the lower the score on the Rice
Depression Scale, the lower the score on the Beck Depression Inventory. You
conclude that the scores on the Rice Depression Scale correspond to the scores
on the Beck Depression Inventory. You have just established concurrent validity.
3 Concurrent validity can also occur between two different groups. For example,
lets say a group of nursing students take two final exams to assess their
knowledge. One exam is a practical test and the second exam is a paper test. If
the students who score well on the practical test also score well on the paper
test, then concurrent validity has occurred. If, on the other hand, students who
score well on the practical test score poorly on the paper test (and vice versa),
then you have a problem with concurrent validity. In this particular example, you
would question the ability of either test to assess knowledge.
Advantages and Disadvantages of Concurrent Validity
Advantages:

It is a fast way to validate your data.


12

It is a highly appropriate way to validate personal attributes (i.e. depression, IQ,


strengths and weaknesses).

Disadvantages:

It is less effective than predictive validity to predict future performance or


potential, like job performance or ability to succeed in college.

If you are testing different groups, like people who want jobs and people who
have jobs, responses may differ between groups. For example, people who
already have jobs may be less inclined to put their best foot forward.

Process of Validation
In Concurrent Validation, Pearson Product-Moment Correlation Coefficient, or
simply Pearsons R Test may be used.
Pearson Product-Moment Correlation Coefficient
A measure of the strength of a linear association between two variables and is
denoted by r. Basically, a Pearson product-moment correlation attempts to draw a line
of best fit through the data of two variables, and the Pearson correlation coefficient, r,
indicates how far away all these data points are to this line of best fit (how well the data
points fit this new model/line of best fit).

The general formula for the Pearson correlation coefficient (r) calls for the values
of the summation of the following:

the x values

the y values

the products of x and y values

the square of every x and y values

13

The equation may also be used, given that you have to solve for the following elements:

A table of value is used to evaluate the relation between the variables given the result of
the test.
0 to 0.20

Negligible/ inverse r

0.21 to 0.40

Low/ slight r

0.41 to 0.70

Marked/ substantial r

0.71 to 1

High/ very high r

14

References:
Cohen-Swerdlik, Psychological Testing and Assessment: An Introduction to Tests and
Measurement, 7th Edition
http://dissertation.laerd.com/criterion-validity-concurrent-and-predictive-validity-p2.php
https://explorable.com/concurrent-validity
http://jfmueller.faculty.noctrl.edu/toolbox/howstep3.htm
https://www.nap.edu/read/1862/chapter/10
http://www.statisticshowto.com/concurrent-validity/
http://study.com/academy/lesson/concurrent-validity-definition-examples.html

15

16

Вам также может понравиться