Lecture 1 Correlation

Statistics: part 2 Regression Analysis and SPSS
Correlation
(syllabus chapter 2) Bjorn Winkens Methodology and Statistics University of Maastricht Bjorn.Winkens@stat.unimaas.nl 11 April 2008
Methodology and Statistics | University of Maastricht Bjorn Winkens 2008
Content
Covariance and correlation Pearson correlation coefficient Tests and confidence interval for correlations Spearmann correlation Pitfalls
2
Association
Study goal = examine the association between two variables Some questions arise: What measure of association should we use? Is there a positive or negative association? Is there a linear association? Is there a significant association?
3
Covariance (1)
= measure of how much two random variables vary together difference with variance? formula:
cov( X , Y ) =
( x x )( y
i i
y)
n 1
Note: cov(X,X) = var(X)

4
Covariance (2)
Example: X = height, Y = weight Positive or negative covariance? x = 181 cm, y = 76.5 kg Cov(X,Y) = 35.0
Positive association Strong or weak?
110 100
Weight (kg)
90 80 70 60
50 150
+
160 170 180 190
200
X* = height in meters:
Cov(X*,Y) = 0.35
Height (cm)
Correlation (1)
= measure of linear association between two random variables Notation:
population: (rho) sample: r
Can take any value from -1 to 1 Closer to -1: stronger negative association Closer to +1: stronger positive association
6
Correlation (2)
Pearsons correlation coefficient
cov( X , Y ) r= = 2 2 s X sY ( xi x ) ( yi y )
i i i
( x x )( y
i
y)
No dimension Invariant under linear transformations Example (X, X* = height (cm; m), Y = weight (kg)): Corr(X,Y) = r = 0.38 Corr(X*,Y) = r = 0.38
7
Practical examples (1)

Serum cholesterol (mg/dL)
FEV (l)
Strong positive correlation r = 0.9

Height (in)
Weak positive correlation r = 0.3

Dietary intake of cholesterol

Difficulty Numerical Task (DNT)
FEV (l)
Weak negative correlation r = -0.2

Number of cigarettes per day
Correlation r = 0.0
Caffeine
No association?
9

r = 0.6 straight line appropriate?
Y
Always check linearity by a (scatter)plot !

10
Size does not matter, shape is important
11
Test for correlation coefficient(s)

1. One-sample t-test: H0: = 0 2. One-sample z-test: H0: = 0
Fishers z-transformation Confidence interval for
3. Two-sample z-test: H0: 1 = 2 (independent samples)

12
One-sample t-test: H0: = 0

Example: Is there a correlation between serumcholesterol levels in spouses? X = serum-cholesterol husband (normally distributed) Y = serum-cholesterol wife (normally distributed) H0: = 0, H1: 0 t-test:
n2 t=r 2 1 r
t-distributed with df = n-2 when H0: = 0 is true

13
Example: serum-cholesterol (1)

n = 100 spouse pairs Pearsons correlation coefficient r = 0.25 Is this correlation large enough to reject H0: = 0?
100 2 = 2.56 t-test: t = 0.25 2 1 0.25
Conclusion?
14
Example: serum-cholesterol (2)

Two-sided p-value: p = 2*0.006 = 0.012
Conclusion?
t98 distribution
P(t98 -2.56) = 0.006
P(t98 2.56) = 0.006
-2.56
2.56
15
Be aware!
Significance depends on sample size: n 10 20 50 100 200 Significant ( = 0.05) if r 0.63 0.44 0.28 0.20 0.14
16
Example: Estriol SPSS (1)

Birthweight (g)
Example: Is there an association between estriol level and birthweight? Sample: n = 31
5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 30
Estriol (mg/24 hr)
17
Example: Estriol SPSS (2)

SPSS:
Estriol Correlations Estriol Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Birthweight 1 .610** . .000 31 31 .610** 1 .000 . 31 31
Birthweight
**. Correlation is significant at the 0.01 level (2-tailed).
Conclusion? H0: = 0.1? H0: = 0.3?

18


19
One-sample z-test: H0: = 0 (1)

If 0 0, r has a skewed distribution
e.g. H0: = 0.5 more room for deviation below 0.5 than above 0.5 previous t-test for correlations is invalid!
Solution: Fishers z-transformation van r
1 1+ r z = ln 2 1 r
ln = natural logarithm (base = e = 2.718)
20
Fishers z-transformation
z = 1.1 z = 0.05 z = -1.1
r = -0.8
r = 0.05
r = 0.8
21

z is approximately normally distributed under H0 with mean
1 1 + 0 z0 = ln 1 2 0
and variance 1/(n-3) Equivalently,
= (z z0)(n-3) ~ N(0,1)
link31
22

In conclusion:
H0: = 0 ( 0); H1: 0 Compute sample correlation coefficient r Transform r and 0 to z and z0, respectively, using Fishers z-transformation Compute test statistic = (z z0)(n-3) Compute p-value ( ~ N(0,1)) Not (yet) available in SPSS!!!
23
Example: Body weight (1)

Research question: Association between body weights of father and son different for biological than for non-biological fathers? Previous research: A correlation of 0.1 is expected based on previous research with sons and non-biological fathers Sample: n = 100 biological fathers and sons Pearsons correlation coefficient r = 0.38
24

H0: = 0 = 0.10; H1: 0.10 r = 0.38 0 = 0.10 z = 0.5*ln(1.38/0.62) = 0.40 z0 = 0.5*ln(1.10/0.90) = 0.10
= (0.40 0.10)*(100 3) = 2.955 p-value = 0.0031 Conclusion? Confidence interval?

25
Confidence interval for

Step 1: compute sample correlation r Step 2: transform r to a Fisher z-score (z) Step 3: compute a 100%x(1 - ) CI for z z1 = z z1-/2 / (n 3), z2 = z + z1-/2 / (n 3) Step 4: transform this CI to CI for :
e 1 e 1 1 = 2 z1 , 2 = 2 z2 e +1 e +1
2 z1 2 z2
26

95% confidence interval for :
Step 1: sample correlation r = 0.38 (n = 100) Step 2: z = 0.5*ln(1.38/0.62) = 0.40 Step 3: z1 = 0.40 1.96/97 = 0.20 z2 = 0.40 + 1.96/97 = 0.60 Step 4: 1 = (e2*0.2 1)/ (e2*0.2 + 1) = 0.20 2 = (e2*0.6 1)/ (e2*0.6 + 1) = 0.54
Conclusion?
27

95% CI for :
1. Compute r: r = 0.38 2. Transform to zscore: z = 0.40 3. Compute CI for z: (0.20; 0.60) 4. Transform CI for z back to CI for : (0.20; 0.54)
28


29
Example: Body-weight (1) different design

Research question: Association between body weights of father and son different for biological than for non-biological fathers? No previous research Two samples: First group (biological): n1 = 100; r1 = 0.38 Second group (non-biological): n2 = 50; r2 = 0.10
30
Two-sample z-test: H0: 1 = 2

Samples:
Fishers z-scores: group 1: sample size n1, correlation r1 z1 z2 group 2: sample size n2, correlation r2
Test statistic:
z1 z 2 1 1 + n1 3 n2 3
is approximately N(0,1)-distributed under H0 Compare with one-sample z-test (sheet 22)

31
Example: Body-weight (2) different design

Samples:
Group 1 (biological): n1 = 100; r1 = 0.38 Group 2 (non-biological): n2 = 50; r2 = 0.10
Fishers transformation:
0.40 0.10 = 1.69 Test statistic: = 1 1 + 97 47

p-value = 0.091 Conclusion?
32
z1 = 0.5*ln(1.38/0.62) = 0.40 z2 = 0.5*ln(1.10/0.90) = 0.10
Rank correlation (1)

Assumed that X and Y are normally distributed If X and/or Y are either ordinal or have a distribution far from normal (due to outliers), then significance tests based on the Pearson correlation coefficient are no longer valid A non-parametric alternative should then be used. For example, a test based on the
Spearman rank correlation coefficient

33
Rank correlation (2)

Spearmans rank correlation coefficient:
= Pearsons correlation coefficient based on the ranks of X and Y Less sensitive for outliers; more general association (not specifically linear) n 10 (or 30): similar tests and CI as for Pearson correlation n < 10 (or 30): exact significance levels can be found in table Many ties (same value): use Kendalls Tau
34
Normality check (1)

Use pp-plots and histograms to check normality (symmetry) Problem with (significance) tests for normality:
Small sample size: no or little power to detect discrepancy from normality Medium or large sample size: no or small impact due to central limit theorem
Data skewed (outliers) & small sample size data transformation

35
Normality check (2)

Be aware: significance depends on sample size!
1.5
6
Frequency
Frequency
1.0
.5
0.0 1 2 3 4 5 6
0 1 2 3 4 5 6
Outcome
Outcome
Shapiro-Wilk: p = 0.961
p = 0.039
36
Example: Apgar scores

Apgar score (physical condition) at 1 and 5 minutes for 24 newborns Minimal score = 0; maximal score = 10 Spearman rank correlation = 0.593
(Pearsons correlation = 0.845)
t-test for Spearman rank correlation: t = 3.45, df = 24 2 = 22 p-value < 0.01 Conclusion? Remarks?
37
Pitfalls
Spurious correlations No measurement of agreement Change scores (Y-X) always related to baseline X (regression to the mean) Dependent pairs of observations (xi, yi) Note: No mathematical problem Interpretation is incorrect
38
Dependent pairs of observation

Association between study duration and grade Plot 1: dependency ignored negatively association Plot 2: dependency taken into account (data from same subject connected) positively association
10 9 8
10 9 8
GRADE
GRADE
7 6 5 4 3 4 5 6 7
7 6 5 4 3 4 5 6 7
STUDY DURATION
STUDY DURATION
Students were measured twice!!!
39
Relation between two variables

Three main purposes: Association
Pearson or Spearman correlation coefficient
Agreement (same quantity: X = Y)

Method of Bland and Altman (Lancet, 1986)
Prediction
Regression analysis
40
QUESTIONS?
41

Lecture 1 Correlation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lecture 1 Correlation

Загружено:

Авторское право:

Доступные форматы

Statistics: part 2 Regression Analysis and SPSS

Note: cov(X,X) = var(X)

Practical examples (1)

Strong positive correlation r = 0.9

Weak positive correlation r = 0.3

Practical examples (2)

Weak negative correlation r = -0.2

Practical examples (3)

Always check linearity by a (scatter)plot !

Size does not matter, shape is important

Test for correlation coefficient(s)

3. Two-sample z-test: H0: 1 = 2 (independent samples)

One-sample t-test: H0: = 0

t-distributed with df = n-2 when H0: = 0 is true

Example: serum-cholesterol (1)

100 2 = 2.56 t-test: t = 0.25 2 1 0.25

Example: serum-cholesterol (2)

P(t98 -2.56) = 0.006

P(t98 2.56) = 0.006

Example: Estriol SPSS (1)

Example: Is there an association between estriol level and birthweight? Sample: n = 31

5000 4500 4000 3500 3000 2500 2000 0 5 10 15 20 25 30

Estriol (mg/24 hr)

Example: Estriol SPSS (2)

**. Correlation is significant at the 0.01 level (2-tailed).

Conclusion? H0: = 0.1? H0: = 0.3?

Test for correlation coefficient(s)

3. Two-sample z-test: H0: 1 = 2 (independent samples)

One-sample z-test: H0: = 0 (1)

Solution: Fishers z-transformation van r

One-sample z-test: H0: = 0 (2)

and variance 1/(n-3) Equivalently,

One-sample z-test: H0: = 0 (3)

Example: Body weight (1)

Example: Body weight (2)

= (0.40 0.10)*(100 3) = 2.955 p-value = 0.0031 Conclusion? Confidence interval?

Confidence interval for

Example: Body weight (3)

Example: Body weight (4)

Test for correlation coefficient(s)

3. Two-sample z-test: H0: 1 = 2 (independent samples)

Example: Body-weight (1) different design

Two-sample z-test: H0: 1 = 2

is approximately N(0,1)-distributed under H0 Compare with one-sample z-test (sheet 22)

Example: Body-weight (2) different design

0.40 0.10 = 1.69 Test statistic: = 1 1 + 97 47

z1 = 0.5*ln(1.38/0.62) = 0.40 z2 = 0.5*ln(1.10/0.90) = 0.10

Rank correlation (1)

Spearman rank correlation coefficient

Rank correlation (2)

Normality check (1)

Data skewed (outliers) & small sample size data transformation

Normality check (2)

Example: Apgar scores

Dependent pairs of observation

Students were measured twice!!!

Relation between two variables

Agreement (same quantity: X = Y)

Вам также может понравиться

z1 = 0.5ln(1.38/0.62) = 0.40 z2 = 0.5ln(1.10/0.90) = 0.10