Correlation Analysis

CORRELATION ANALYSIS
1. What does correlation between variables tell us?

Correlation is the degree to which two or more quantities or variables are linearly associated. In a two-
dimensional plot, the degree of correlation between the values on the two axes is quantified by the so-called correlation
coefficient.
Correlation is a statistical measurement of the relationship between two variables. Possible correlations range
from +1 to 1. A zero correlation indicates that there is no relationship between the variables. A correlation of 1
indicates a perfect negative correlation, meaning that as one variable goes up, the other goes down. A correlation of +1
indicates a perfect positive correlation, meaning that both variables move in the same direction together.
Correlation is a statistical technique that can show whether and how strongly pairs of variables are related.
How are they estimated?

Correlation is estimated by quantifying it through a value known as correlation coefficient usually represented by
r. Statistics provides various types of correlation coefficients, the use of which, depends on several factors such as the
kind of variables being correlated.
The table below shows how different types of data, categorized according to measurement scale, may be
correlated and what statistical tool would provide its correct correlation coefficient.
Quantitiative
Variable Y\X Ordinal X Nominal X
X
Quantitative
Pearson r Biserial rb Point Biserial rpb (ad)
Y
Spearman rho (naturally dichotomous)

Ordinal Y Biserial rb Rank Biserial rrb
/Tetrachoric rtet (artificially dichotomous)
Nominal Y Point Biserial rpb Rank Bisereal rrb Phi, L, C, Lambda
2. What is the range of values that a correlation coefficient may take? How is the particular range of values of
correlation coefficient interpreted?
The main result of a correlation is called the correlation coefficient. It ranges from -1.0 to +1.0. The closer r is to
+1 or -1, the more closely the two variables are related.
If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as one
variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other gets smaller (often
called an "inverse" correlation).
While correlation coefficients are normally reported as r, squaring them makes then easier to understand. The
square of the coefficient is equal to the percent of the variation in one variable that is related to the variation in the other.
A correlation report can also show a second result of each test - statistical significance. In this case, the
significance level will tell you how likely it is that the correlations reported may be due to chance in the form of random
sampling error. If you are working with small sample sizes, choose a report format that includes the significance level.
This format also reports the sample size.
3. For each correlation coefficient, provide a description and an illustrative example to show its appropriateness and
how it can be computed.
a. Person-product Moment Correlation
The Pearson product-moment correlation coefficient is a measure of the

strength of a linear association between two variables and is denoted by r.
Description Basically, a Pearson product-moment correlation attempts to draw a line of best
fit through the data of two variables, and the Pearson correlation coefficient, r,
indicates how far away all these data points are to this line of best fit.
n xy( x )( y )
r=
Formula
n ( x )( x ) n ( y )( y )
2 2 2 2
Von Christopher G. Chua, LPT, MST

The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A
value of 0 indicates that there is no association between the two variables. A value
greater than 0 indicates a positive association; that is, as the value of one variable
increases, so does the value of the other variable. A value less than 0 indicates a
negative association; that is, as the value of one variable increases, the value of the
Interpretation other variable decreases.
-1.0 to -0.7 strong negative association.
-0.7 to -0.3 weak negative association.
-0.3 to +0.3 little or no association.
+0.3 to +0.7 weak positive association.
+0.7 to +1.0 strong positive association
The following table shows the grades obtained by six students in Algebra and
Trigonometry. Compute for the Pearson-product moment correlation coefficient.
Student
1 2 3 4 5 6
No.
Algebra 83 78 94 90 88 88
Trigonomet
82 83 93 94 84 86
ry
To solve for the correlation coefficient, some values in the formula must be obtained.
Example
x y x2 y2 xy
83 82 6889 6724 6806
78 83 6084 6889 6474
94 93 8836 8649 8742
90 94 8100 8836 8460
88 84 7744 7056 7392
88 86 7744 7396 7568
x=5 y=5 x2=45 y2=45 xy=45
21 22 397 550 442
Computation:
n xy( x )( y )
r=
n ( x )( x ) n ( y )( y )
2 2 2 2
( 6 ) ( 45442 )(521)(522)
r=
( ( 6 )( 45397 )( 521 )2 )( ( 6 )( 45550 )( 522 )2 )
690 690 690

r= = = =0.79
941 816 (30.68)(28.57) 876.27
With a correlation coefficient equal to 0.79, we can conclude that there is a strong
positive association in the grades of the six students in Algebra and Trigonometry.
b. Phi-coefficient
Description
The phi coefficient is a measure of the degree of association between two binary or dichotomous variables. This measure is
similar to the correlation coefficient in its interpretation because it was also formulated by Karl Pearson.
Formula
adbc
=
efgh X- X+ Total
Y- a b e
Phi compares the product of the diagonal cells (a*d) to the product of the
Y+ c d f
off-diagonal cells (b*c). The denominator is an adjustment that ensures
Total g h n
that Phi is always between -1 and +1.
Interpretation

Two binary variables are considered positively associated if most of the data falls along the diagonal cells (i.e., a and d are
larger than b and c). In contrast, two binary variables are considered negatively associated if most of the data falls off the
diagonal.
Example
The table below shows the first time driving test results of a sample of 200 individuals classified by gender and success or
failure in the examination. We wish to explore the association between the two variables, the null hypothesis being that there
is no relationship between gender and success/failure in driving test results.
Gender Success Failure Total

Male 70 28 98
Female 50 52 102
Total 120 80 200
adbc ( 70 )( 52 ) (28)(50)
= =
efgh (98)(102)(120)(80)
36401400 2240
= = =0.23
95961600 9796.00
The data shows that gender and success or failure in the driving test has little or no correlation.
c. Point Biserial Correlation Coefficient
The point biserial correlation coefficient (rpb) is a correlation coefficient used when
Description one variable is dichotomous; Y can either be "naturally" dichotomous, like
gender, or an artificially dichotomized variable. In most situations it is not
advisable to artificially dichotomize variables.
To calculate rpb, assume that the dichotomous variable Y has the two values 0 and 1. If
we divide the data set into two groups, group 1 which received the value "1" on Y and
group 2 which received the value "0" on Y, then the point-biserial correlation
coefficient (for population) is calculated as follows:
r pb=
Sn n
2

M 1M 2 n1 n0
Where:
sn is the standard deviation used when you have data for every member of the
population:
Formula sN =
( x x )2
N
M1 being the mean on the continuous variable X for all data points in group 1, and M0
the mean on the continuous variable X for all data points in group 2.
n1 is the number of data points in group 1, n0 is the number of data points in group 2
and n is the total sample size. There is an equivalent formula that uses sn1:
point biserial correlation coefficient (for sample)
r pb=
M 1M 2
Sn
n1 n 0
n(n1) n
s=
( x x )2
n1
Interpretation Pett (1997) asserts that the same criteria for evaluating the coefficient of determination
in regard to standard correlation can be applied to rpb2 because of the close relationship
between rpb and the Pearson r. The coefficient of determination in the form of rpb2,
therefore, is a useful index for drawing conclusions from the data.
Very strong: .81

Strong: .49-.80
Moderate: .25-.48

Weak: .00-.08
An urban planner hypothesizes the correlation between lack of car ownership and use of
public transportation would be positive in a particular urban location. In this case, the
dichotomous variable (X) is car ownership, which is the independent variable because it
is hypothesized as affecting frequency of public transportation use. The non-dichotomous
variable is the number of times in a given time spans that person uses public
transportation. The non-dichotomous variable is the dependent variable in this example.
Next, the researcher collects a small sample of 18 participants for her study, gathering the
following information(Table 1):
Use of Public
Participant Car Ownership Transportation
3
1 No
2 No 12
3 No 10
4 No 11
5 No 12
6 No 23
7 No 14
8 No 0
9 No 16
10 Yes 0
11 Yes 2
Example 12 Yes 1
13 Yes 0
14 Yes 3
15 Yes 4
16 Yes 0
17 Yes 0
18 Yes 1
The next step would be to code the responses Yes as 0 and No as 1, making vehicle
ownership into a numerically dichotomous variable. At first glance, this may seem
counterintuitive because we associate zero as negative response (no) and 1 as positive
response (yes). However, because the researcher hypothesizes the effects of not having
a car rather than having a car will be in terms of an increase in public transportation use,
the researcher will code No responses as 1 as Yes responses as 0. Recall that the
researcher wants to know about lack of car ownership, not car ownership, couching the
hypothesis in terms of a positive relationship.
The correlation coefficient, 0.735means that those who do not own cars tend to use
public transportation more.
d. Spearmans Rank Correlation Coefficient

Description The Spearman's rank-order correlation is the nonparametric version of the Pearson
product-moment correlation. Spearman's correlation coefficient, (, also signified by rs)
measures the strength of association between two ranked variables.
A monotonic relationship is a relationship that does one of the following: (1) as the value
of one variable increases, so does the value of the other variable; or (2) as the value of
one variable increases, the other variable value decreases. A monotonic relationship is an
important underlying assumption of the Spearman rank-order correlation. It is also
important to recognize the assumption of a monotonic relationship is less restrictive than
a linear relationship.

There are two methods to calculate Spearman's rank-order correlation depending on
whether: (1) your data does not have tied ranks or (2) your data has tied ranks. The
formula for when there are no tied ranks is:
6 di2
=1
n(n21)
Formula Where di is the difference in the paired ranks and n is the number of cases.
The formula to use when there are tied ranks is:
( x i x ) ( y i y )
i
=
( x x ) ( y y )
2 2
i i
i i
The Spearman correlation coefficient, rs, can take values from +1 to -1. A rs of +1
indicates a perfect association of ranks, a rs of zero indicates no association between
Interpretation
ranks and a rs of -1 indicates a perfect negative association of ranks. The closer rs is to
zero, the weaker the association between the ranks.
The table which follows shows the scores of 10 high school students in an English and
Filipino exam. Both were 40-item tests.
English 18 20 14 34 40 35 7 10 28 38
Filipino 27 30 25 36 38 29 24 22 35 40
To compute for the Spearman rho, we construct the below:
English 18 20 14 34 40 35 7 10 28 38
Filipino 27 30 25 36 38 29 24 22 35 40
Example
Eng(Rank
7 6 8 4 1 3 10 9 5 2
)
Fil(Rank) 7 5 8 3 2 6 9 10 4 1
d 0 1 0 1 1 3 1 1 1 1
d2 0 1 0 1 1 9 1 1 1 1
6 d i2 6 ( 16 ) 96
=1 =1 =1 =10.097=0.91
2
n ( n 1 ) 10 ( 10 1 )2
990
The spearman rho value of 0.91 indicates a strong positive relationship between the two
variables.
e. Rank Biserial Correlation

The rank-biserial correlation coefficient, rrb, is used for dichotomous nominal data
Description vs rankings (ordinal).
2 ( y 1 y 0 )
r rb =
n
Formula Where n is the number of data pairs, and Y0 and Y1 are the Y score means for data pairs
with an x score of 0 and 1 respectively. These Y scores are ranks and the formula
assumes no tied ranks are present.
Example The table shows the performances of 12 Grade 7 students in Science during the first
quarter of the school year.
Stude Ran Studen Ran
Sex Grade Sex Grade
nt No. k t No. k
1 M 82 8 7 F 79 11
2 M 85 7 8 F 81 9
3 M 87 5 9 F 95 1
4 M 80 10 10 F 86 6
5 M 90 2 11 F 89 3
6 M 88 4 12 F 73 12
2 ( y 1 y 0 ) 2(76) 2(1) 2
y 1=7 y 0=6 n=12 r rb = = = = =0.17
n 12 12 12
f. Biserial Correlation Coefficient

Another measure of association, the biserial correlation coefficient, termed rb, is
similar to the point biserial, but its quantitative data against ordinal data, but
Description
ordinal data with an underlying continuity but measured discretely as two values
(dichotomous).
pq
Formula
( )
r b= ( Y 1Y 0 )
Y
Y
Where Y0 and Y1 are the Y score means for the data pairs with an x score of 0 and 1,
respectively, q=1-p and p are the proportions of data pairs with x scores of 0 and 1, and
Y is the populations standard deviation for the y data, and Y is the height of the
standardized normal distribution at the point z.
An example might be test performance vs anxiety, where anxiety is designated as either

high or low. Presumably, anxiety can take on any value in between, perhaps beyond, but
it may be difficult to measure. We further assume that anxiety is normally distributed.
The following data presents the test scores in Math of seven college students together
with their anxiety level during the exam. A two-point scale was used to measure anxiety
level where 0 corresponds to relaxed and 1 to anxious.
Test Score 65 78 84 90 88 93 70 83
Anxiety Level 0 0 1 0 1 1 1 0
Example
pq
( )
r b= ( Y 1Y 0 )
Y
Y
Y 0=79 Y 1=83.75 p=0.5 q=0.5Y =3.99 Y =9.16
( )
(0.5)(0.5)
3.99 0.06
r b= ( 83.7579 )
9.16
=( 4.75 )
9.16 ( )
=( 4.75 ) ( 0.0068 )=0.03
g. Tetrachoric Coefficient
Description The tetrachoric correlation for binary data, and the polychoric correlation, for ordered-
category data, are excellent ways to measure rater agreement. They estimate what the
correlation between raters would be if ratings were made on a continuous scale; they are,
theoretically, invariant over changes in the number or "width" of rating categories. The
tetrachoric and polychoric correlations also provide a framework that allows testing of
marginal homogeneity between raters. Thus, these statistics let one separately assess both
components of rater agreement: agreement on trait definition and agreement on
definitions of specific categories.
The tetrachoric correlation coefficient, rtet, is used when both variables are dichotomous,
like the phi, but we need also to be able to assume both variables really are continuous
and normally distributed. Thus it is applied to ordinal vs. ordinal data which has this
characteristic. Ranks are discrete so in this manner it differs from the Spearman. The
formula involves a trigonometric function called cosine.
180
Formula
r tet =cos
( 1+
BC
AD
)
Example
h. Partial Correlation Coefficient

Partial correlation analysis is aimed at finding correlation between two variables after
removing the effects of other variables. This type of analysis helps spot spurious
correlations (i.e. correlations explained by the effect of other variables) as well as to
reveal hidden correlations - i.e correlations masked by the effect of other variables.
The central concept in partial correlation analysis is the partial correlation coefficient
rxy.z between variables x and y , adjusted for a third variable z . Both x and y are
Description presumed to be linearly related to z :
x = Az + B + dx;
y = Cz + D + dy;
The partial correlation coefficient rxy.z is defined as the correlation
coefficient between residuals dx and dy in this model.
The partial correlation coefficient rxy.z is defined as the correlation

coefficient between residuals dx and dy in this model.
The partial correlation coefficient rxy.z between x and y adjusted for z may be computed
Formula from the pairwise values of the correlation between variables x , y , and z (rxy, ryz, rxz) :
r xy r xz r yz
r xy , z=
( 1r xz
2
)(1r yz 2)
Example
References:
(1) http://www.surveysystem.com/correlation.htm
(2) https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
(3) -http://www.pmean.com/definitions/phi.htm
(4) http://en.wikipedia.org/wiki/Correlation_and_dependence
(5) http://www.andrews.edu/~calkins/math/edrm611/edrm13.htm
(6) http://www.statistics.com/index.php?page=glossary&term_id=538

Correlation Analysis

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Correlation Analysis

Загружено:

Авторское право:

Доступные форматы

CORRELATION ANALYSIS

1. What does correlation between variables tell us?

How are they estimated?

Spearman rho (naturally dichotomous)

Nominal Y Point Biserial rpb Rank Bisereal rrb Phi, L, C, Lambda

The Pearson product-moment correlation coefficient is a measure of the

Von Christopher G. Chua, LPT, MST

690 690 690

Von Christopher G. Chua, LPT, MST

Gender Success Failure Total

c. Point Biserial Correlation Coefficient

point biserial correlation coefficient (for sample)

Very strong: .81

Von Christopher G. Chua, LPT, MST

d. Spearmans Rank Correlation Coefficient

Von Christopher G. Chua, LPT, MST

To compute for the Spearman rho, we construct the below:

e. Rank Biserial Correlation

f. Biserial Correlation Coefficient

An example might be test performance vs anxiety, where anxiety is designated as either

Y 0=79 Y 1=83.75 p=0.5 q=0.5Y =3.99 Y =9.16

h. Partial Correlation Coefficient

The partial correlation coefficient rxy.z is defined as the correlation

Von Christopher G. Chua, LPT, MST

Вам также может понравиться