You are on page 1of 14

# Correlation and Regression

Introduction

• We have been looking at differences between means and at the chi-square test of
the independence of two variables.
• Now we are going to look at the relationship between two variables.
• Two common examples are the relationship between Beta-endorphin levels 12
hours before surgery and 10 minutes before surgery. Are high levels at one
reading associated with high levels at the other? (We ran a t test on these data
about two weeks ago.) The second example is the relationship between SAT
scores and performance on an SAT-like test when the subjects have not read the
passage on which the questions are based.

## • We want to ask if Y is some function of X, where X and Y are two different

variables.
• Discuss differences between correlation and regression
o Correlation is the word we usually use when we want a single measure of
the degree of relationship between two variables.
o Regression is the word we usually use when we want an equation relating
the variables.
o When we have only one predictor, the two approaches tend to blur into
one--we almost never use regression without also speaking of the
correlation coefficient. When we have multiple predictors, we are much
more interested in the regression side of things.
• Y is almost always thought of a as a dependent variable beyond the experimenter's
control.
o In regression, X is usually (traditionally) thought of as a fixed variable,
even when it really isn't.
 This is called the linear regression model.
o In correlation, X is usually thought of as a random variable.
 This is called the bivariate normal model.
• I'm deliberately using a small sample example just to keep things simple. But
don't get the idea that small samples are a good idea.
• The following data refer to beta-endorphin levels 12 hours and 10 minutes before
surgery. Notice that they are paired by patient. (These are real data.)
12 Hours 10 Min.
Subject Gain
Before Before

## 19 2.0 2.0 0.0

• We could run a t test here, but we did that before.
o It would address an entirely different question.
• We would presumably like to look at the relationship between people's beta-
endorphin scores at the two times.
o Did people who started out high stay high?
o What would it mean if they didn't?
• The first thing we could do is to plot the data.
o The 10 min. data go on the ordinate, because it is logical to predict
forward, not backward, in time.

• Here we see that there is a positive relationship between the two variables--we'll
talk about significance later.
• If we want a measure of the degree of this relationship, the correlation is 0.699
o As we'll see later, the relationship is significant.
o What does that mean?
• In this particular example both of the variables are random--we don't know what
the values of X, or Y, will be before the experiment begins.
Example with Fixed X

## • This is really a regression problem.

• Data from Langlois and Roggman (1990) on page 411 of the text.
o Describe study
o Here I have entered 1, 2, ..., 5 for the power of 2 concerning the number of
pictures that were averaged. I have used the mean rated attractiveness of
the photographs.

## Condition Attract 3 3.226

1 2.201 3 2.811

1 2.411 3 2.857

1 2.407 3 3.422

1 2.403 4 3.233

1 2.826 4 3.505

1 3.380 4 3.192

2 1.893 4 3.209

2 3.102 4 2.860

2 2.355 4 3.111

2 3.644 5 3.200

2 2.767 5 3.253

2 2.109 5 3.357

3 2.906 5 3.169

3 2.118 5 3.291

5 3.290
Notice that there is no sampling error in X, whereas there was in the previous example.

## The scatterplot for these data is given below.

• Notice how the columns line up. Get them to explain why. (This is common with
fixed X.)
• Notice how judged attractiveness increases with the number of faces included in
the composite.
• Notice how the variability of data points decreases as we increase X. This is a
no-no from the point of view of assumptions behind correlation and regression. It
will also be a problem with the analysis of variance.
o Keep in mind that we are talking about assumptions about populations,
though I'm pretty sure that the assumption is violated.
o Ask why this might be expected to happen.
• The correlation is about the same as in the previous example--r = .56, and it is
significant.

## Third Example--Smoking and Low Birthweight.

• I chose this example because it is one that psychologists deal with, and relates to
an important health problem.
• The question is the relationship between age and low-birthweight (we know they
are related), and what happens when mothers do, and do not, smoke.
• Data on Smoking mothers (pooled across 48 states, dv = % low birthweight.

## • Notice several things:

o Neither relationship is exactly linear, though we get away with a straight
line in the first one.
o Both relationships are essentially the same, but exaggerated
o Notice the difference in the mean %.
o I don't quite know what to make of these data, but they are interesting.
 If you get pregnant, don't smoke--especially if you are old and
creaky.
 I'm not above a little drum beating.

## "According to "The World's Women 2000: Trends and Statistics" (a United

Nations compilation of the latest data documenting progress for women
worldwide), an African woman's lifetime risk of dying from pregnancy-related
causes is 1 in 16; in Asia, 1 in 65; and in Europe, 1 in 1,400."

## • These data were taken from Newsweek from a 1991 article.

• One of my favorite examples.
The Correlation Coefficient

## • The covariance is a measure of how two variables vary together, but it is an

"unscaled" measure.

## • This is a definitional formula, and we probably won't see it again.

• Discuss the correlation coefficient and its calculation.
• Why is this one negative?

## The Adjusted Correlation Coefficient

• I want them to know what this is, but I don't want them to go away thinking that
we use if very often. (We rarely do).
• What we want is an unbiased estimate of the correlation in the population.
• Comment that we very rarely use the adjusted coefficient, even though most
programs print it out.

## The Regression Line

• Here we are looking for the best straight line that can be fit to these data.
• I have included those lines in the plots above.
• We want an equation of the form:

## • We solve a set of equations for a and b such that is a minimum,

• There are an infinite number of lines with that slope, and another infinite number
of lines with that intercept, but only one line with both that slope and intercept.

## SPSS Analysis of these Cancer and Solar Radiation Data

• Discuss all parts of this printout:
o Include the Anova table and explain what's going on
o Ask what an intercept of 0 would mean. (In this case I can't imagine that it
would mean much, because I can't imagine a case where solar radiation
really = 0.)
o Discuss the slope
 What if the slope were greater or less than it is?
 What if the slope were 0?
 What if we were plotting the same general variable on both axes
(as we did with endorphins) and we had a slope = 1.0. What would
that mean?
o Point out the tests on these coefficients.
o Go back to the regression line and discuss "least squares."