Вы находитесь на странице: 1из 11
 5801 – Correlation Learning objectives 1.  2.  3.  Pearson correlation Estimating the population Pearson correlation Misleading correlations 4. 5.  Impact of range Two limitations in using correlation to infer causality Copyright © 2013 by Ernest Kwan 1 5801 – Correlation Relationships • An important goal in statistics is to describe "relationships" between two variables. • By describing relationships in a sample, we estimate the relationship in the population. • What does it mean to say that "there is a relationship" between two variables (e.g., X and Y), or to say that "X and Y are related"? • • There are different ways of answering this question. On a strictly quantitative level, we may say X and Y are related • Some common ways of saying "X and Y are related": – "X and Y are associated", "an association between X and Y"; "X and Y are correlated", "a correlation between X and Y". – Copyright © 2013 by Ernest Kwan 2

Pearson correlation coefficient

5801 – Correlation

 • To measure the relationship between two continuous variables, there are different • indices available (they measure different aspects of such relationships). We will focus on:

Pearson correlation coefficient:

r =

 n ∑ i =1 ( x − x ) ( y − y • ) i • i n n ∑ ( x i − x • ) 2 ∑ ( y i − y • ) 2 i =1 i =1

=

cov(X ,Y )
var (X ) var (Y)

Important properties of r It can only range from -1 to 1. It measures the degree and direction of linear relationship between X and Y. r = 0 implies there is no linear relationship. The more r differs from 0, the greater the linear relationship. r > 0 is called a positive relationship, r = 1 is a "perfect" positive linear relationship. r < 0 is called a negative relationship, r = -1 is a "perfect" negative linear relationship.

3

Y

Y

Y

Examples

5801 – Correlation

X

4
2
0
-2
-4
-3
-2
-1
0
1
2
3
X
Y
X
20
10
-10
-20
Y

X

20
10
0
-10
-20
-3
3
Y
Y
Y

X

X

X

 -2 -1 0 1 2 X 4

Y

Y

Y

5801 – Correlation

As the data look more and more like such a line, r will get closer and closer to -1.

As the data look more and more like such a line, r will get closer and closer to 1.

X
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
X
3
2
1
0
-1
-2
-3
-3
-2
-1
0
1
2
3
X

negative relationship • as X increases, Y decreases • as X decreases, Y increases

positive relationship • as X increases, Y increases • as X decreases, Y decreases

5

r = -1.00

r = 0.00

r = 1.00

r = 0.33
4
2
0
-2
-4
-3
-2
-1
0
1
2
3
X
Y

Which correlation is stronger?

r = -0.80

5801 – Correlation

4
2
0
-2
-4
-6
-3
-2
-1
0
1
2
3
X
Y

r measures the degree (strength / magnitude) and direction of linear relationship. • Degree of the relationship involves the absolute value of r. • More different |r| is from 0, stronger is the linear relationship.

6

Correlation in the population

5801 – Correlation

 • So far we have discussed r as an index of the linear relationship in the sample data; • but thinking beyond the sample data, there is always a population. The linear relationship between X and Y in the population is referred to as ρ, "rho". • So if we could observe every person's value on X and Y in the population, then that • linear relationship is represented by ρ. r is the sample estimate of the parameter ρ.

7

Confidence intervals for ρ

5801 – Correlation

 • We previously discussed a CI for µ. • ρ is also a parameter, so accordingly, we could construct a CI for ρ based on r. • The same interpretations and principles are at work. • The CI for ρ however is more complicated to calculate. • This is because the sampling distribution of r is not normal. • Because of this complication, a CI for ρ may not necessarily be symmetric around r. • Let's take a look at some interesting examples of correlations.

8

Y

Y

Examples

5801 – Correlation

Do you agree with the correlation coefficients?

r = 0.00

X

r = 0.00

X

9

Y

Y

Examples

Is there a positive linear relationship here?

5801 – Correlation

Is there a positive linear relationship here?

X

This small cluster of data has clearly created the positive relationship.

X

The overall relationship is positive, but the within-gender relationship is negative!

10

5801
– Correlation
•
Notice the "outliers" here are not outlying at all in terms of Y.
These points are outliers in the sense of having undue influence on r.
r = 0.40
So what is r for this sample?
Is it 0.40 or 0.00?
X
11
5801
– Correlation
Using correlations in practice
It is very easy to be misled by a correlation coefficient
•
Just because r = 0.0 doesn't mean there is no relationship, and just because r = 0.9
may not mean there is a strong linear relationship.
What can we do to prevent ourselves from being misled?
X
X
X
12
Y
Y
Y
Y

Example: Height and weight

Sample of n = 100
(various occupations)
100
80
60
40
20
0
1.00
1.20
1.40
1.60
1.80
2.00
(weight) KG
(weight) KG

(height) METER

5801 – Correlation

Sample of n = 100
100
80
60
40
20
0
1.90
1.92
1.94
1.96
1.98
2.00

(height) METER

13

Yes, more you study for a test, higher the grade… Does this mean 70 hrs of studying will guarantee a perfect score?

100
Sample data
75
50
25
0
3
6
9
12
15

(amount of studying) HOUR

5801 – Correlation

Based on the left data, hard to speculate what happens when you study far beyond 15 hours.

100

75

50

25

0
5
10
15
20
25

(amount of studying) HOUR

14

Problem of restricted range

5801 – Correlation

 • Previous examples illustrate the effect of range restriction on a correlation coefficient. • An important issue to think about in the interpretation of your correlations: Do the data • in fact contain the relevant range of the variable you want to infer about? For example, if you do want to assess the relationship between weight and height for basketball players, then there is nothing wrong with the data. • Before comparing two correlation coefficients (assessing the same relationship), should make sure the two data sets cover the same relevant range of interest.

15

5801 – Correlation

What does a relationship mean?

 • At the level of measured variables (quantitative hypotheses), a relationship was 100 previously defined as a systematic pattern between values of X and Y. 75 GRADE 50 25 0 5 10 15 20 25 (amount of studying) HOUR • • But unless we're doing statistics purely for the sake of statistics, a "relationship" has much more meaning to researchers. Let us now move beyond the statistical / quantitative level.

16

5801 – Correlation

Beyond just the quantitative variables

 • • To social scientists, relationships have a substantive interpretation beyond just a systematic pattern between values of two variables. A relationship implies causality; to say there is a relationship between X and Y (or symbolically, X ↔ Y) can imply one of three things: • Very often, a relationship is further used for explanations: If X and Y are related, we say "X explains Y", or "Y explains X". • Both causality and explanation are actually complicated philosophical concepts, – • What does it mean exactly for A to cause B? How does an explanation work? – We will not venture too philosophically, but we will consider some limitations in trying to use relationships we observe (e.g., r) for the causations we hope to infer.

17

5801 – Correlation

Limitation 1: Quality of measurement

Our desired level of inference occurs at a more abstract level than that of the variables we can observe. Social phenomena involve variables that cannot be observed directly.

φ

ξ
X
1
1
ξ
X
2
2

r ( ρ)

Inference concerning how constructs are related depends on the quality of measurement; just how good are the indicators? Simply treating r (or ρ) as indicative of φ without concern for the quality of measurement is a mistake.

18

5801 – Correlation

A recent nation-wide survey collected data from university students across the country, with the aim of studying the consumer behavior of Canadian students. To the surprise of the researchers, a strong negative correlation was found between students' grades and the amount of money spent on air fresheners. These two variables are shown in the scatterplot below:

r = 0.66

Money spent on air fresheners

19

Limitation 2: The crud factor

5801 – Correlation

Let's consider what other variables may be related to the two we have observed:

 \$ spent on air fresheners crowdedness of noise level of living cond. attentiveness living conditions during study

Many social science variables can be related to each other through such a "chain". This is attributable to the complexity (or interrelatedness) of social phenomena. • It is not difficult to find variables that have a strong relationship – sometimes, unexpected relationships will be stumbled upon.

The prevalence of relationships between arbitrarily paired social variables was deemed the "crud factor" (e.g., Meehl, 1997).

20

How to refer to correlations

5801 – Correlation

Be very careful then of how you talk about correlations.

Acceptable descriptions:

•
X is related to Y
X is associated with Y
X is correlated to(with) Y

In the social sciences, however, we should be very cautious of claims of causality.

•
X influences Y
X causes Y
X affects Y
X determines Y
To be able to make such claims is a
highly desirable goal in science.
ξ 1
ξ 2