Вы находитесь на странице: 1из 41

Statistics: part 2 Regression Analysis and SPSS

Correlation
(syllabus chapter 2) Bjorn Winkens Methodology and Statistics University of Maastricht Bjorn.Winkens@stat.unimaas.nl 11 April 2008
Methodology and Statistics | University of Maastricht Bjorn Winkens 2008

Content
Covariance and correlation Pearson correlation coefficient Tests and confidence interval for correlations Spearmann correlation Pitfalls
2

Association
Study goal = examine the association between two variables Some questions arise: What measure of association should we use? Is there a positive or negative association? Is there a linear association? Is there a significant association?
3

Covariance (1)
= measure of how much two random variables vary together difference with variance? formula:

cov( X , Y ) =

( x x )( y
i i

y)

n 1

Note: cov(X,X) = var(X)


4

Covariance (2)
Example: X = height, Y = weight Positive or negative covariance? x = 181 cm, y = 76.5 kg Cov(X,Y) = 35.0
Positive association Strong or weak?
110 100

Weight (kg)

90 80 70 60

50 150

+
160 170 180 190

200

X* = height in meters:
Cov(X*,Y) = 0.35

Height (cm)

Correlation (1)
= measure of linear association between two random variables Notation:
population: (rho) sample: r

Can take any value from -1 to 1 Closer to -1: stronger negative association Closer to +1: stronger positive association
6

Correlation (2)
Pearsons correlation coefficient
cov( X , Y ) r= = 2 2 s X sY ( xi x ) ( yi y )
i i i

( x x )( y
i

y)

No dimension Invariant under linear transformations Example (X, X* = height (cm; m), Y = weight (kg)): Corr(X,Y) = r = 0.38 Corr(X*,Y) = r = 0.38
7

Practical examples (1)


Serum cholesterol (mg/dL)

FEV (l)

Strong positive correlation r = 0.9


Height (in)

Weak positive correlation r = 0.3


Dietary intake of cholesterol

Practical examples (2)


Difficulty Numerical Task (DNT)

FEV (l)

Weak negative correlation r = -0.2


Number of cigarettes per day

Correlation r = 0.0

Caffeine

No association?
9

Practical examples (3)


r = 0.6 straight line appropriate?
Y

Always check linearity by a (scatter)plot !


10

Size does not matter, shape is important

11

Test for correlation coefficient(s)


1. One-sample t-test: H0: = 0 2. One-sample z-test: H0: = 0
Fishers z-transformation Confidence interval for

3. Two-sample z-test: H0: 1 = 2 (independent samples)


12

One-sample t-test: H0: = 0


Example: Is there a correlation between serumcholesterol levels in spouses? X = serum-cholesterol husband (normally distributed) Y = serum-cholesterol wife (normally distributed) H0: = 0, H1: 0 t-test:

n2 t=r 2 1 r

t-distributed with df = n-2 when H0: = 0 is true


13

Example: serum-cholesterol (1)


n = 100 spouse pairs Pearsons correlation coefficient r = 0.25 Is this correlation large enough to reject H0: = 0?

100 2 = 2.56 t-test: t = 0.25 2 1 0.25

Conclusion?

14

Example: serum-cholesterol (2)


Two-sided p-value: p = 2*0.006 = 0.012

Conclusion?
t98 distribution

P(t98 -2.56) = 0.006

P(t98 2.56) = 0.006

-2.56

2.56
15

Be aware!
Significance depends on sample size: n 10 20 50 100 200 Significant ( = 0.05) if r 0.63 0.44 0.28 0.20 0.14
16

Example: Estriol SPSS (1)


Birthweight (g)

Example: Is there an association between estriol level and birthweight? Sample: n = 31

5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 30

Estriol (mg/24 hr)

17

Example: Estriol SPSS (2)


SPSS:
Estriol Correlations Estriol Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Birthweight 1 .610** . .000 31 31 .610** 1 .000 . 31 31

Birthweight

**. Correlation is significant at the 0.01 level (2-tailed).

Conclusion? H0: = 0.1? H0: = 0.3?


18

Test for correlation coefficient(s)


1. One-sample t-test: H0: = 0 2. One-sample z-test: H0: = 0
Fishers z-transformation Confidence interval for

3. Two-sample z-test: H0: 1 = 2 (independent samples)


19

One-sample z-test: H0: = 0 (1)


If 0 0, r has a skewed distribution
e.g. H0: = 0.5 more room for deviation below 0.5 than above 0.5 previous t-test for correlations is invalid!

Solution: Fishers z-transformation van r

1 1+ r z = ln 2 1 r
ln = natural logarithm (base = e = 2.718)
20

Fishers z-transformation
z = 1.1 z = 0.05 z = -1.1

r = -0.8

r = 0.05

r = 0.8

21

One-sample z-test: H0: = 0 (2)


z is approximately normally distributed under H0 with mean

1 1 + 0 z0 = ln 1 2 0

and variance 1/(n-3) Equivalently,

= (z z0)(n-3) ~ N(0,1)
link31

22

One-sample z-test: H0: = 0 (3)


In conclusion:
H0: = 0 ( 0); H1: 0 Compute sample correlation coefficient r Transform r and 0 to z and z0, respectively, using Fishers z-transformation Compute test statistic = (z z0)(n-3) Compute p-value ( ~ N(0,1)) Not (yet) available in SPSS!!!
23

Example: Body weight (1)


Research question: Association between body weights of father and son different for biological than for non-biological fathers? Previous research: A correlation of 0.1 is expected based on previous research with sons and non-biological fathers Sample: n = 100 biological fathers and sons Pearsons correlation coefficient r = 0.38
24

Example: Body weight (2)


H0: = 0 = 0.10; H1: 0.10 r = 0.38 0 = 0.10 z = 0.5*ln(1.38/0.62) = 0.40 z0 = 0.5*ln(1.10/0.90) = 0.10

= (0.40 0.10)*(100 3) = 2.955 p-value = 0.0031 Conclusion? Confidence interval?


25

Confidence interval for


Step 1: compute sample correlation r Step 2: transform r to a Fisher z-score (z) Step 3: compute a 100%x(1 - ) CI for z z1 = z z1-/2 / (n 3), z2 = z + z1-/2 / (n 3) Step 4: transform this CI to CI for :

e 1 e 1 1 = 2 z1 , 2 = 2 z2 e +1 e +1
2 z1 2 z2
26

Example: Body weight (3)


95% confidence interval for :
Step 1: sample correlation r = 0.38 (n = 100) Step 2: z = 0.5*ln(1.38/0.62) = 0.40 Step 3: z1 = 0.40 1.96/97 = 0.20 z2 = 0.40 + 1.96/97 = 0.60 Step 4: 1 = (e2*0.2 1)/ (e2*0.2 + 1) = 0.20 2 = (e2*0.6 1)/ (e2*0.6 + 1) = 0.54

Conclusion?

27

Example: Body weight (4)


95% CI for :
1. Compute r: r = 0.38 2. Transform to zscore: z = 0.40 3. Compute CI for z: (0.20; 0.60) 4. Transform CI for z back to CI for : (0.20; 0.54)
28

Test for correlation coefficient(s)


1. One-sample t-test: H0: = 0 2. One-sample z-test: H0: = 0
Fishers z-transformation Confidence interval for

3. Two-sample z-test: H0: 1 = 2 (independent samples)


29

Example: Body-weight (1) different design


Research question: Association between body weights of father and son different for biological than for non-biological fathers? No previous research Two samples: First group (biological): n1 = 100; r1 = 0.38 Second group (non-biological): n2 = 50; r2 = 0.10
30

Two-sample z-test: H0: 1 = 2


Samples:
Fishers z-scores: group 1: sample size n1, correlation r1 z1 z2 group 2: sample size n2, correlation r2

Test statistic:

z1 z 2 1 1 + n1 3 n2 3

is approximately N(0,1)-distributed under H0 Compare with one-sample z-test (sheet 22)


31

Example: Body-weight (2) different design


Samples:
Group 1 (biological): n1 = 100; r1 = 0.38 Group 2 (non-biological): n2 = 50; r2 = 0.10

Fishers transformation:

0.40 0.10 = 1.69 Test statistic: = 1 1 + 97 47


p-value = 0.091 Conclusion?
32

z1 = 0.5*ln(1.38/0.62) = 0.40 z2 = 0.5*ln(1.10/0.90) = 0.10

Rank correlation (1)


Assumed that X and Y are normally distributed If X and/or Y are either ordinal or have a distribution far from normal (due to outliers), then significance tests based on the Pearson correlation coefficient are no longer valid A non-parametric alternative should then be used. For example, a test based on the

Spearman rank correlation coefficient


33

Rank correlation (2)


Spearmans rank correlation coefficient:
= Pearsons correlation coefficient based on the ranks of X and Y Less sensitive for outliers; more general association (not specifically linear) n 10 (or 30): similar tests and CI as for Pearson correlation n < 10 (or 30): exact significance levels can be found in table Many ties (same value): use Kendalls Tau
34

Normality check (1)


Use pp-plots and histograms to check normality (symmetry) Problem with (significance) tests for normality:
Small sample size: no or little power to detect discrepancy from normality Medium or large sample size: no or small impact due to central limit theorem

Data skewed (outliers) & small sample size data transformation


35

Normality check (2)


Be aware: significance depends on sample size!
1.5
6

Frequency

Frequency

1.0

.5

0.0 1 2 3 4 5 6

0 1 2 3 4 5 6

Outcome

Outcome

Shapiro-Wilk: p = 0.961

p = 0.039
36

Example: Apgar scores


Apgar score (physical condition) at 1 and 5 minutes for 24 newborns Minimal score = 0; maximal score = 10 Spearman rank correlation = 0.593
(Pearsons correlation = 0.845)

t-test for Spearman rank correlation: t = 3.45, df = 24 2 = 22 p-value < 0.01 Conclusion? Remarks?
37

Pitfalls
Spurious correlations No measurement of agreement Change scores (Y-X) always related to baseline X (regression to the mean) Dependent pairs of observations (xi, yi) Note: No mathematical problem Interpretation is incorrect
38

Dependent pairs of observation


Association between study duration and grade Plot 1: dependency ignored negatively association Plot 2: dependency taken into account (data from same subject connected) positively association
10 9 8
10 9 8

GRADE

GRADE

7 6 5 4 3 4 5 6 7

7 6 5 4 3 4 5 6 7

STUDY DURATION

STUDY DURATION

Students were measured twice!!!

39

Relation between two variables


Three main purposes: Association
Pearson or Spearman correlation coefficient

Agreement (same quantity: X = Y)


Method of Bland and Altman (Lancet, 1986)

Prediction
Regression analysis
40

QUESTIONS?

41