Вы находитесь на странице: 1из 25

Correlation

A bit about Pearson’s r


Questions
• Why does the • Give an example in
maximum value of r which data properly
equal 1.0? analyzed by ANOVA
• What does it mean cannot be used to infer
when a correlation is causality.
positive? Negative?
• Why do we care about
• What is the purpose of
the Fisher r to z the sampling
transformation? distribution of the
• What is range correlation coefficient?
restriction? Range • What is the effect of
enhancement? What do reliability on r?
they do to r?
Basic Ideas
• Nominal vs. continuous IV
• Degree (direction) & closeness
(magnitude) of linear relations
– Sign (+ or -) for direction
– Absolute value for magnitude
• Pearson product-moment correlation
coefficient

r
 z z
X Y

N
Illustrations
Plot of Weight by Height Plot of Errors by Study Time
210
30

180

20
Weight

Errors
150

120 10

90
60 63 66 69 72 75 0
Height 0 100 200 300 400
Study Time
Plot of SAT-V by Toe Size
700

600 Positive, negative, zero


SAT-V

500

400
1.5 1.6 1.7 1.8 1.9
Toe Size
Simple Formulas
r
 xy Use either N throughout or else
NS X SY use N-1 throughout (SD and
x  X  X and y  Y  Y denominator); result is the
same as long as you are
SX 
(X  X ) 2
consistent.
N

Cov( X , Y ) 
 xy
N
Pearson’s r is the average
r
 zz x y z
XX cross product of z scores.
N SX Product of (standardized)
moments from the means.
Graphic Representation
Plot of Weight by Height Plot of Weight by Height in Z-scores
210 2

180 1 - +
M e a n = 1 5 0 .7 lb s .

Z-weight
Weight

150 0

120 -1 + -
M e a n = 6 6 .8 In c h es

90 -2
60 63 66 69 72 75 -2 -1 0 1 2
Height Z-height

1. Conversion from raw to z.


2. Points & quadrants. Positive & negative products.
3. Correlation is average of cross products. Sign &
magnitude of r depend on where the points fall.
4. Product at maximum (average =1) when points on line
where zX=zY.
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
Ht 10 60.00 78.00 69.0000 6.05530
Wt 10 110.00 200.00 155.0000 30.27650
Valid N (listwise) 10

r = 1.0
r=1

Leave X, add error to Y.

r=.99
r=.99

Add more error.


r=.91
With 2 variables, the correlation is the z-score slope.
Review
• Why does the maximum value of r
equal 1.0?
• What does it mean when a correlation is
positive? Negative?
Sampling Distribution of r
Statistic is r, parameter is ρ (rho). In general, r is slightly
biased. Sampling Distributions of r
0 .0 8

Relative Frequ ency


0 .0 6

rho=-.5 rho=0 rho=.5


0 .0 4

0 .0 2

0 .0 0
-1 .2 -0 .8 -0 .4 0 .0 0 .4 0 .8 1 .2

(1   2 2
) Obs er v e d r
The sampling variance is approximately:  2r 
N
Sampling variance depends both on N and on ρ.
Empirical Sampling Distributions of the Correlation Coefficient

  .5; N  100   .7; N  100


  .5; N  50   .7; N  50
0.9 + 0
| 0 |
| 0 |
| 0 0 |
0.8 + 0 | |
| | | |
| | | +-----+
| 0 | +-----+ | |
0.7 + 0 | *--+--* *--+--*
| | | +-----+ | |
| | | | +-----+
| | | | |
0.6 + | | | |
| | +-----+ 0 |
| +-----+ | | 0 |
| | | | | 0 |
0.5 + *--+--* *--+--* 0 0
| | | | | 0 0
| +-----+ | | * 0
| | +-----+ 0
0.4 + | | 0
| | | * 0
| | | *
| | |
0.3 + 0 |
| 0 | *
| 0 |
| 0 0
0.2 + 0 0
| 0 0
| 0 0
| 0
0.1 + 0
| 0
| 0
| 0
0 + *
| *
| *
|
-0.1 +
------------+-----------+-----------+-----------+-----------
param .5_N100 .5_N50 .7_N100 .7_N50
Fisher’s r to z Transformation
Fisher r to z Transformation
1.5
r z
.10 .10 1.3
 (1  r ) 
.20 .20 z  .5 ln  
 (1  r ) 
1.1

z (output)
.30 .31 0.9
.40 .42
0.6
.50 .55
.60 .69 0.4

.70 .87 0.2

.80 1.10
0.0
.90 1.47 0.0 0.2 0.4 0.6 0.8 1.0

r (sample value input)

Sampling distribution of z is normal as N increases.


Pulls out short tail to make better (normal) distribution.
Sampling variance of z = (1/(n-3)) does not depend on ρ.
Hypothesis test: H0 :   0

r
t  N 2 Result is compared to t with (N-
1 r2 2) df for significance.

Say r=.25, N=100


.25 .25
t  98  9.899  2.56 p< .05
1  .25 2 .986

t(.05, 98) = 1.984.


Hypothesis test 2: H 0 :   value
1 r 1 
.5 log e .5 log e One sample z test where r is
1 r 1 
z sample value and ρ is
1/ N  3
hypothesized population value.

Say N=200, r = .54, and ρ is .30.


1.54 1.30
.5 log e .5 log e .60.31
z 1.54 1.30 z =4.13
1 / 200  3 .07

Compare to unit normal, e.g., 4.13 > 1.96 so it is


significant. Our sample was not drawn from a
population in which rho is .30.
Hypothesis test 3: H 0 : 1   2
Testing equality of correlations from 2 INDEPENDENT
samples. 1 r 1 r
.5 log e 1
.5 log e 2
1  r1 1  r2
z
1 / ( N1  3)  1 / ( N 2  3)

Say N1=150, r1=.63, N2=175, r2=70.


1.63 1.70
.5 log e .5 log e .74 .87
z 1 .63 1.70 z = -1.18, n.s.
1 / (150  3)  1 / (175  3) .11
Hypothesis test 4:H 0 : 1   2  ...   k
Testing equality of any number of independent correlations.
k

 (n  3) z
i i Q   (ni  3)( zi  z ) 2
z i 1

 (n  3) i
Compare Q to chi-square with k-1 df.

Study r n z (n-3)z zbar (z-zbar)2 (n-3)(z-zbar)2

1 .2 200 .2 39.94 .41 .0441 8.69


2 .5 150 .55 80.75 .41 .0196 2.88
3 .6 75 .69 49.91 .41 .0784 5.64
sum 425 170.6 17.21=Q
Chi-square at .05 with 2 df = 5.99. Not all rho are equal.
Hypothesis test 5: dependent r
H 0 : 12  13 Hotelling-Williams test

( N  1)(1  r23 )
t ( N 3)  (r12  r13 )
2( N  1) /( N  3) | R | r 2 (1  r23 )3
Say N=101, r12=.4, r13=.6, r23=.3
r  (r12  r13 ) / 2 r  (.4  .6) / 2  .5
| R | 1  r122  r132  r232  2(r12 )( r13 )( r23 )

| R | 1  .42  .62  .32  2(.4)(.6)(.3)  .534


(100)(1  .3)
t ( N 3)  (.4  .6)  2.1
2(100) /(98).534  .5 (1  .3)
2 3

t(.05, 98) = 1.98


H 0 : 12  34 See my notes.
Review
• What is the purpose of the Fisher r to z
transformation?
• Test the hypothesis that   
1 2
– Given that r1 = .50, N1 = 103
– r2 = .60, N2 = 128 and the samples are
independent.
• Why do we care about the sampling
distribution of the correlation
coefficient?
Range Restriction/Enhancement
Reliability
Reliability sets the ceiling for validity. Measurement error
attenuates correlations.
 XY  T X TY
 XX ' YY '
If correlation between true scores is .7 and reliability of
X and Y are both .8, observed correlation is 7.sqrt(.8*.8)
= .7*.8 = .56.
Disattenuated correlation

T X TY
  XY /  XX ' YY '
If our observed correlation is .56 and the reliabilities
of both X and Y are .8, our estimate of the correlation
between true scores is .56/.8 = .70.
Review
• What is range restriction? Range
enhancement? What do they do to r?
• What is the effect of reliability on r?
SAS Power Estimation
proc power; proc power;
onecorr dist=fisherz onecorr
corr = 0.35 corr = 0.35
nullcorr = 0.2 nullcorr = 0
sides = 1 sides = 2
ntotal = 100 ntotal = .
power = .; power = .8;
run; run;

Computed N Total
Computed Power Alpha = .05
Actual alpha = .05 Actual Power = .801
Power = .486 Ntotal = 61
Power for Correlations
Rho N required against
Null: rho = 0
.10 782
.15 346
.20 193
.25 123
.30 84
.35 61

Sample sizes required for powerful conventional


significance tests for typical values of the correlation
coefficient in psychology. Power = .8, two tails,
alpha is .05.

Вам также может понравиться