Вы находитесь на странице: 1из 11
 

5801

– Correlation

 

Learning objectives

 

1.

2.

3.

Pearson correlation Estimating the population Pearson correlation Misleading correlations

4.

5.

Impact of range Two limitations in using correlation to infer causality

 

Copyright © 2013 by Ernest Kwan

1

 

5801

– Correlation

 

Relationships

 

An important goal in statistics is to describe "relationships" between two variables.

By describing relationships in a sample, we estimate the relationship in the population.

 

What does it mean to say that "there is a relationship" between two variables (e.g., X and Y), or to say that "X and Y are related"?

There are different ways of answering this question. On a strictly quantitative level, we may say X and Y are related

 
 

Some common ways of saying "X and Y are related":

"X and Y are associated", "an association between X and Y"; "X and Y are correlated", "a correlation between X and Y".

 
 

 

Copyright © 2013 by Ernest Kwan

2

Pearson correlation coefficient

5801 – Correlation

To measure the relationship between two continuous variables, there are different

indices available (they measure different aspects of such relationships). We will focus on:

Pearson correlation coefficient:

r =

 

n

 

i =1

(

x

x

) (

y

 

y

)

 

i

i

 

n

n

∑ ( x i − x • ) 2 ∑ ( y i − y

(

x

i

x

) 2

(

y

i

y

) 2

i =1

i =1

=

cov(X ,Y ) var (X ) var (Y)
cov(X ,Y )
var (X ) var (Y)

Important properties of r It can only range from -1 to 1. It measures the degree and direction of linear relationship between X and Y. r = 0 implies there is no linear relationship. The more r differs from 0, the greater the linear relationship. r > 0 is called a positive relationship, r = 1 is a "perfect" positive linear relationship. r < 0 is called a negative relationship, r = -1 is a "perfect" negative linear relationship.

Copyright © 2013 by Ernest Kwan

3

Y

Y

Y

Examples

5801 – Correlation

Y Y Y Examples 5801 – Correlation X 4 2 0 -2 -4 -3 -2 -1

X

4 2 0 -2 -4 -3 -2 -1 0 1 2 3 X Y
4
2
0
-2
-4
-3
-2
-1
0
1
2
3
X
Y
X
X
20 10 -10 -20 Y
20
10
-10
-20
Y

X

20 10 0 -10 -20 -3 3 Y Y Y
20
10
0
-10
-20
-3
3
Y
Y
Y

X

X

X

Copyright © 2013 by Ernest Kwan

-2

-1

0

1

2

 

X

 

4

Y

Y

Y

5801 – Correlation

As the data look more and more like such a line, r will get closer and closer to -1.

As the data look more and more like such a line, r will get closer and closer to 1.

X
X
3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 X
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
X
3 2 1 0 -1 -2 -3 -3 -2 -1 0 1 2 3 X
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
X

negative relationship • as X increases, Y decreases • as X decreases, Y increases

positive relationship • as X increases, Y increases • as X decreases, Y decreases

Copyright © 2013 by Ernest Kwan

5

r = -1.00

r = -1.00 r = 0.00 r = 1.00

r = 0.00

r = -1.00 r = 0.00 r = 1.00

r = 1.00

r = 0.33 4 2 0 -2 -4 -3 -2 -1 0 1 2 3
r = 0.33
4
2
0
-2
-4
-3
-2
-1
0
1
2
3
X
Y

Which correlation is stronger?

r = -0.80

5801 – Correlation

4 2 0 -2 -4 -6 -3 -2 -1 0 1 2 3 X Y
4
2
0
-2
-4
-6
-3
-2
-1
0
1
2
3
X
Y

r measures the degree (strength / magnitude) and direction of linear relationship. • Degree of the relationship involves the absolute value of r. • More different |r| is from 0, stronger is the linear relationship.

Copyright © 2013 by Ernest Kwan

6

Correlation in the population

5801 – Correlation

So far we have discussed r as an index of the linear relationship in the sample data;

but thinking beyond the sample data, there is always a population.

The linear relationship between X and Y in the population is referred to as ρ, "rho".

So if we could observe every person's value on X and Y in the population, then that

linear relationship is represented by ρ. r is the sample estimate of the parameter ρ.

Copyright © 2013 by Ernest Kwan

7

Confidence intervals for ρ

5801 – Correlation

We previously discussed a CI for µ.

ρ is also a parameter, so accordingly, we could construct a CI for ρ based on r.

The same interpretations and principles are at work.

 

The CI for ρ however is more complicated to calculate.

This is because the sampling distribution of r is not normal.

Because of this complication, a CI for ρ may not necessarily be symmetric around r.

Let's take a look at some interesting examples of correlations.

Copyright © 2013 by Ernest Kwan

8

Y

Y

Examples

5801 – Correlation

Do you agree with the correlation coefficients?

r = 0.00

r = 0.00

X

r = 0.00

r = 0.00

X

Copyright © 2013 by Ernest Kwan

9

Y

Y

Examples

Is there a positive linear relationship here?

5801 – Correlation

Is there a positive linear relationship here?

Correlation Is there a positive linear relationship here? X This small cluster of data has clearly

X

This small cluster of data has clearly created the positive relationship.

of data has clearly created the positive relationship. X The overall relationship is positive, but the

X

The overall relationship is positive, but the within-gender relationship is negative!

Copyright © 2013 by Ernest Kwan

10

5801 – Correlation •   •   Notice the "outliers" here are not outlying at
5801
– Correlation
•  
Notice the "outliers" here are not outlying at all in terms of Y.
These points are outliers in the sense of having undue influence on r.
r = 0.40
So what is r for this sample?
Is it 0.40 or 0.00?
X
11
Copyright © 2013 by Ernest Kwan
5801
– Correlation
Using correlations in practice
It is very easy to be misled by a correlation coefficient
•  
Just because r = 0.0 doesn't mean there is no relationship, and just because r = 0.9
may not mean there is a strong linear relationship.
What can we do to prevent ourselves from being misled?
X
X
X
12
Copyright © 2013 by Ernest Kwan
Y
Y
Y
Y

Example: Height and weight

Sample of n = 100 (various occupations) 100 80 60 40 20 0 1.00 1.20
Sample of n = 100
(various occupations)
100
80
60
40
20
0
1.00
1.20
1.40
1.60
1.80
2.00
(weight) KG
(weight) KG

(height) METER

5801 – Correlation

Sample of n = 100 (basketball players) 100 80 60 40 20 0 1.90 1.92
Sample of n = 100
(basketball players)
100
80
60
40
20
0
1.90
1.92
1.94
1.96
1.98
2.00

(height) METER

Copyright © 2013 by Ernest Kwan

13

Example: Studying and grades

Yes, more you study for a test, higher the grade… Does this mean 70 hrs of studying will guarantee a perfect score?

100 Sample data 75 50 25 0 3 6 9 12 15 GRADE GRADE
100
Sample data
75
50
25
0
3
6
9
12
15
GRADE
GRADE

(amount of studying) HOUR

5801 – Correlation

Based on the left data, hard to speculate what happens when you study far beyond 15 hours.

100

75

50

25

0 5 10 15 20 25
0
5
10
15
20
25

(amount of studying) HOUR

Copyright © 2013 by Ernest Kwan

14

Problem of restricted range

5801 – Correlation

Previous examples illustrate the effect of range restriction on a correlation coefficient.

 

An important issue to think about in the interpretation of your correlations: Do the data

in fact contain the relevant range of the variable you want to infer about? For example, if you do want to assess the relationship between weight and height for

basketball players, then there is nothing wrong with the data.

Before comparing two correlation coefficients (assessing the same relationship), should make sure the two data sets cover the same relevant range of interest.

Copyright © 2013 by Ernest Kwan

15

5801 – Correlation

What does a relationship mean?

At the level of measured variables (quantitative hypotheses), a relationship was

100

previously defined as a systematic pattern between values of X and Y.

 

75

  75  
 
 

GRADE

 

50

25

 

0

5

10

15

20

25

 

(amount of studying) HOUR

But unless we're doing statistics purely for the sake of statistics, a "relationship" has much more meaning to researchers. Let us now move beyond the statistical / quantitative level.

Copyright © 2013 by Ernest Kwan

16

5801 – Correlation

Beyond just the quantitative variables

To social scientists, relationships have a substantive interpretation beyond just a systematic pattern between values of two variables. A relationship implies causality; to say there is a relationship between X and Y (or

symbolically, X Y) can imply one of three things:

Very often, a relationship is further used for explanations: If X and Y are related, we say "X explains Y", or "Y explains X".

Both causality and explanation are actually complicated philosophical concepts,

What does it mean exactly for A to cause B? How does an explanation work?

– We will not venture too philosophically, but we will consider some limitations in trying to use relationships we observe (e.g., r) for the causations we hope to infer.

Copyright © 2013 by Ernest Kwan

17

5801 – Correlation

Limitation 1: Quality of measurement

Our desired level of inference occurs at a more abstract level than that of the variables we can observe. Social phenomena involve variables that cannot be observed directly.

φ

ξ X 1 1 ξ X 2 2
ξ
X
1
1
ξ
X
2
2

r ( ρ)

Inference concerning how constructs are related depends on the quality of measurement; just how good are the indicators? Simply treating r (or ρ) as indicative of φ without concern for the quality of measurement is a mistake.

Copyright © 2013 by Ernest Kwan

18

5801 – Correlation

Example: Air fresheners and Grades

A recent nation-wide survey collected data from university students across the country, with the aim of studying the consumer behavior of Canadian students. To the surprise of the researchers, a strong negative correlation was found between students' grades and the amount of money spent on air fresheners. These two variables are shown in the scatterplot below:

r = 0.66 grade (overall GPA)
r = 0.66
grade (overall GPA)

Money spent on air fresheners

Copyright © 2013 by Ernest Kwan

19

Limitation 2: The crud factor

5801 – Correlation

Let's consider what other variables may be related to the two we have observed:

$ spent on air fresheners

crowdedness of

noise level of living cond.

attentiveness

living conditions

during study

grades
grades

Many social science variables can be related to each other through such a "chain". This is attributable to the complexity (or interrelatedness) of social phenomena. • It is not difficult to find variables that have a strong relationship – sometimes, unexpected relationships will be stumbled upon.

The prevalence of relationships between arbitrarily paired social variables was deemed the "crud factor" (e.g., Meehl, 1997).

Copyright © 2013 by Ernest Kwan

20

How to refer to correlations

5801 – Correlation

Be very careful then of how you talk about correlations.

Acceptable descriptions:

Misleading descriptions (please avoid):

•   •   •   X is related to Y X is associated with
•  
X is related to Y
X is associated with Y
X is correlated to(with) Y

In the social sciences, however, we should be very cautious of claims of causality.

however, we should be very cautious of claims of causality. •   •   •  
•   •   •   •   X influences Y X causes Y X
•  
X influences Y
X causes Y
X affects Y
X determines Y
To be able to make such claims is a highly desirable goal in science. ξ
To be able to make such claims is a
highly desirable goal in science.
ξ 1
ξ 2

Copyright © 2013 by Ernest Kwan

21

Оценить