You are on page 1of 14

Correlation and Regression

Introduction

• We have been looking at differences between means and at the chi-square test of
the independence of two variables.
• Now we are going to look at the relationship between two variables.
• Two common examples are the relationship between Beta-endorphin levels 12
hours before surgery and 10 minutes before surgery. Are high levels at one
reading associated with high levels at the other? (We ran a t test on these data
about two weeks ago.) The second example is the relationship between SAT
scores and performance on an SAT-like test when the subjects have not read the
passage on which the questions are based.

Prediction and Relationships

• We want to ask if Y is some function of X, where X and Y are two different


variables.
• Discuss differences between correlation and regression
o Correlation is the word we usually use when we want a single measure of
the degree of relationship between two variables.
o Regression is the word we usually use when we want an equation relating
the variables.
o When we have only one predictor, the two approaches tend to blur into
one--we almost never use regression without also speaking of the
correlation coefficient. When we have multiple predictors, we are much
more interested in the regression side of things.
• Y is almost always thought of a as a dependent variable beyond the experimenter's
control.
o In regression, X is usually (traditionally) thought of as a fixed variable,
even when it really isn't.
 This is called the linear regression model.
o In correlation, X is usually thought of as a random variable.
 This is called the bivariate normal model.
• I'm deliberately using a small sample example just to keep things simple. But
don't get the idea that small samples are a good idea.
• The following data refer to beta-endorphin levels 12 hours and 10 minutes before
surgery. Notice that they are paired by patient. (These are real data.)
12 Hours 10 Min.
Subject Gain
Before Before

1 10.0 20.0 10.0

2 6.5 14.0 7.5

3 8.0 13.5 5.5

4 12.0 18.0 6.0

5 5.0 14.5 9.5

6 11.5 9.0 -2.5

7 5.0 18.0 13.0

8 3.5 6.5 3.0

9 7.5 7.5 0.0

10 5.8 6.0 0.2

11 4.7 25.0 20.3

12 8.0 12.0 4.0

13 7.0 15.0 8.0

14 17.0 42.0 25.0

15 8.8 16.0 7.2

16 17.0 52.0 35.0

17 15.0 11.5 -3.5

18 4.4 2.5 -1.9

19 2.0 2.0 0.0


• We could run a t test here, but we did that before.
o It would address an entirely different question.
• We would presumably like to look at the relationship between people's beta-
endorphin scores at the two times.
o Did people who started out high stay high?
o What would it mean if they didn't?
• The first thing we could do is to plot the data.
o The 10 min. data go on the ordinate, because it is logical to predict
forward, not backward, in time.

• Here we see that there is a positive relationship between the two variables--we'll
talk about significance later.
• If we want a measure of the degree of this relationship, the correlation is 0.699
o As we'll see later, the relationship is significant.
o What does that mean?
• In this particular example both of the variables are random--we don't know what
the values of X, or Y, will be before the experiment begins.
Example with Fixed X

• This is really a regression problem.


• Data from Langlois and Roggman (1990) on page 411 of the text.
o Describe study
o Here I have entered 1, 2, ..., 5 for the power of 2 concerning the number of
pictures that were averaged. I have used the mean rated attractiveness of
the photographs.

Condition Attract 3 3.226

1 2.201 3 2.811

1 2.411 3 2.857

1 2.407 3 3.422

1 2.403 4 3.233

1 2.826 4 3.505

1 3.380 4 3.192

2 1.893 4 3.209

2 3.102 4 2.860

2 2.355 4 3.111

2 3.644 5 3.200

2 2.767 5 3.253

2 2.109 5 3.357

3 2.906 5 3.169

3 2.118 5 3.291

5 3.290
Notice that there is no sampling error in X, whereas there was in the previous example.

What does that statement mean?

The scatterplot for these data is given below.

• Notice how the columns line up. Get them to explain why. (This is common with
fixed X.)
• Notice how judged attractiveness increases with the number of faces included in
the composite.
• Notice how the variability of data points decreases as we increase X. This is a
no-no from the point of view of assumptions behind correlation and regression. It
will also be a problem with the analysis of variance.
o Keep in mind that we are talking about assumptions about populations,
though I'm pretty sure that the assumption is violated.
o Ask why this might be expected to happen.
• The correlation is about the same as in the previous example--r = .56, and it is
significant.

Third Example--Smoking and Low Birthweight.

• I chose this example because it is one that psychologists deal with, and relates to
an important health problem.
• The question is the relationship between age and low-birthweight (we know they
are related), and what happens when mothers do, and do not, smoke.
• Data on Smoking mothers (pooled across 48 states, dv = % low birthweight.

• Mothers who do not smoke:


• Notice several things:


o Neither relationship is exactly linear, though we get away with a straight
line in the first one.
o Both relationships are essentially the same, but exaggerated
o Notice the difference in the mean %.
o I don't quite know what to make of these data, but they are interesting.
 If you get pregnant, don't smoke--especially if you are old and
creaky.
 I'm not above a little drum beating.

• To beat another drum, Minitab's month web page just reported

"According to "The World's Women 2000: Trends and Statistics" (a United


Nations compilation of the latest data documenting progress for women
worldwide), an African woman's lifetime risk of dying from pregnancy-related
causes is 1 in 16; in Asia, 1 in 65; and in Europe, 1 in 1,400."

Final Example--Breast Cancer as a function of Solar Radiation

• These data were taken from Newsweek from a 1991 article.


• One of my favorite examples.
The Correlation Coefficient

• The covariance is a measure of how two variables vary together, but it is an


"unscaled" measure.

• This is a definitional formula, and we probably won't see it again.


• Discuss the correlation coefficient and its calculation.
• Why is this one negative?

The Adjusted Correlation Coefficient

• I want them to know what this is, but I don't want them to go away thinking that
we use if very often. (We rarely do).
• What we want is an unbiased estimate of the correlation in the population.
• Comment that we very rarely use the adjusted coefficient, even though most
programs print it out.

The Regression Line

• Here we are looking for the best straight line that can be fit to these data.
• I have included those lines in the plots above.
• We want an equation of the form:

where b = slope and a = intercept

Define slope and intercept.

• This is a general equation for any straight line.

• We solve a set of equations for a and b such that is a minimum,


• There are an infinite number of lines with that slope, and another infinite number
of lines with that intercept, but only one line with both that slope and intercept.

SPSS Analysis of these Cancer and Solar Radiation Data


• Discuss all parts of this printout:
o Include the Anova table and explain what's going on
o Ask what an intercept of 0 would mean. (In this case I can't imagine that it
would mean much, because I can't imagine a case where solar radiation
really = 0.)
o Discuss the slope
 What if the slope were greater or less than it is?
 What if the slope were 0?
 What if we were plotting the same general variable on both axes
(as we did with endorphins) and we had a slope = 1.0. What would
that mean?
o Point out the tests on these coefficients.
o Go back to the regression line and discuss "least squares."