Вы находитесь на странице: 1из 11

10-1

Session 10

CorreIation and SimpIe Linear Regression



Types of relationships and Scatter Plots 10-2
Correlation 10-3
A Note on Non-Linear Relationships 10-6
Simple Linear Regression 10-7
Practical session 10 10-11

10-2
SESSION 10: CorreIation and SimpIe Linear Regression

Types of reIationships and Scatter PIots

What do we mean when we say that two variables are reIated? Nothing
complicated; simply that knowing the value of one variable tells us something
about the other.

n Session 7 we produced some Scatter PIots. A Scatter Plot of two
variables that are unrelated produces what appears to be a random pattern;
Figure 7.14 was an example of this. The other extreme is a Perfect
ReIationship, where knowing the value of one variable can tell you the exact
value of the other. n these cases, the points on the Scatter Plot can be
joined to form a smooth line. We will be interested in Linear ReIationships;
that is, where the line would be straight.

Perfect relationships are rare, so we will create some from the GSS data. f
we imagine that all the fathers are exactly thirty years older than their children,
we can create a new variable DADSAGE using

Transform
Compute Variable.

and inserting DADSAGE as the Target VariabIe and AGE + 30 in the
Numeric Expression box. To plot the new variable against AGE, we select
the option (an alternative to Graphs > Chartbuilder used in Section 7):

Graphs
nteractive
Scatterplot.

DADSAGE is selected for the Y Axis, and AGE for the X Axis; a title can be
added by clicking on TitIes; than click OK. The Scatter Plot produced is in
Figure 10.1(a).

This is a positive relationship; that is, as the value of one variable increases,
so does the value of the other.

n a negative relationship the opposite is true: as the value of one variable
increases, the value of the other decreases. As an example, we can create a
new variable, HUNAGE, which is the number of years each respondent has to
go before they reach 100 years of age. We do this using the expression
100 - Age in the Compute procedure. Recalling the Scatter PIot dialogue
box and changing the Y Axis variable from DADSAGE to HUNAGE (don't
forget to change the title too!) produces the Scatter Plot in Figure 10.1(b).

10-3



Figure 10.1(a) Figure 10.1(b)

But what of the middle ground, where there is some relationship between two
variables (see Figure 10.2 below), but it is not a perfect relationship?

CorreIation

The strength of a linear relationship (ie. how tightly the data points are
clustered around an imaginary line) can be measured for quantitative
(normally distributed) variables using the Pearson CorreIation Coefficient.

The values of the CorreIation Coefficient can range from 1 to +1. The
following table provides a summary of the types of relationship and their
Correlation Coefficients:

Linear Relationship Correlation Coefficient

Perfect Negative -1

Negative -1 to 0

None 0

Positive 0 to +1

Perfect Positive +1

The higher the Correlation Coefficient, regardless of sign, the stronger the
linear relationship between the two variables. A positive correlation coefficient
indicates that as one variable increases, the other increases; a negative
correlation coefficient indicates that as one variable increases, the other
decreases.


The graph shows a moderate positive
relationship between Prestige score
and Highest Year of School
completed.

Pearson correlation=0.52
Figure 10.2
10-4


Using the GSS91t data, we look at the linear relationships between the
education of the respondent (EDUC), that of the parents (MAEDUC and
PAEDUC), the age of the respondent (AGE), and the Occupational Prestige
Score (PRESTG80).

We want to look at pairs of variables, so we will use Bivariate Correlations.
n SPSS, we select:

Analyze
Correlate
Bivariate.

and then, in the Bivariate CorreIations dialogue box, we move the five
variables EDUC, MAEDUC, PAEDUC, AGE and PRESTG80 to the VariabIes
list (Figure 10.3). All possible pairs of variables from our chosen list will have
the Correlation Coefficient calculated.

Notice the default choices in the Bivariate CorreIations dialogue box; we
could choose KendaII's tau-b or the Spearman rank CorreIation Coefficient
as well as (or instead of) the Pearson.



Figure 10.3

The Two-taiIed Test of Significance means that both positive and negative
linear relationships will be considered. Clicking OK produces the following
table in the Output Viewer (Figure 10.4)

10-5

Figure 10.4

Notice that, for each pair of variables, the number of respondents, N, differs.
This is because the default is to exclude missing cases pairwise; that is, if a
respondent has missing values for some of the variables, he or she is
removed from the Correlation calculations involving those variables, but is
included in any others where there are valid values for both variables. Using
the Options button in the Bivariate CorreIations dialogue box, we can, if we
require, exclude missing cases Iistwise (meaning that if there are missing
values in any of the variables in the chosen list, the respondent is excluded
from all the Correlation calculations).

Using the Sig. (2-taiIed) value, we can determine whether the Correlation is a
significant one. The NuII Hypothesis is that the CorreIation Coefficient is
zero (or close enough to be taken as zero), and we reject this at the 5% level
if the significance is less than 0.05.

SPSS flags the Correlation Coefficients with a single asterisk if they are
significant at the 5% level, and a double asterisk if significant at the 1% level.
This feature can be removed if preferred by clearing the check box in the
Bivariate CorreIations dialogue box next to FIag significant correIations.

We can see in our example that there are significant positive Correlations for
each pair of the education variables; age is significantly negatively Correlated
with each of them, and the Occupational Prestige Score has significant
positive Correlations with each. All these Correlations are significant at the
1% level, with the education of mothers and fathers having the strongest
relationship.

The remaining variable pairing, age and Occupational Prestige, does not have
a significant linear relationship; the Correlation Coefficient of 0.007 is not
significantly different from zero, as indicated by the Significance Level of
10-6
0.799. This is a formal test of what we saw in Figure 7.14; in the Scatter Plot
of PRESTG80 against AGE, the points seemed randomly scattered.


A Note on Non-Linear ReIationships

t must be emphasised that we are dealing with Linear Relationships. You
may find that the Correlation Coefficient indicates no significant Linear
Relationship between two variables, but they may have a Non-Linear
ReIationship for which we are not testing.

Figure 10.5 contains the results of the Correlation and Scatter Plot procedures
performed on some hypothetical data.

Figure 10.5


As can be seen, the Correlation Coefficient is not significant, indicating no
Linear Relationship, while the Plot indicates a very obvious quadratic
relationship. t is always a good idea to check for relationships visually using
graphics as well as using formal statistical methods!



10-7
SimpIe Linear Regression

We look at the relationship between occupational prestige score PRESTG80
and highest year of school completed EDUC in the GSS data. s there a
significant linear relationship between these variables? Can we predict
prestige score from highest year of school completed?

Firstly, we produce a Scatter Plot in the normal way, with PRESTG80 as the
dependent variable plotted on the Y Axis, and EDUC as the independent
variable on the X Axis. The data points are scattered, but there appears to be
a positive linear relationship between the variables Figure 10.2).

To describe this relationship SPSS estimates the Line of best fit. We can
superpose this line on the graph.

After using the Graphs >> nteractive >> Scatterplot. option, we
DoubIe CIick on the Scatter PIot that has appeared in the Output Viewer to
activate the chart. Then using the menu:

nsert
Fit Line
Regression.

The Line of Best Fit, its equation and the R-square value appear on the plot.
The text label can be moved by clicking on it and dragging it; it can be hidden
by right clicking and choosing the Hide Label option; the connecting line
(Connector) can be deselected. Note the other editing options.

Alternatively, the regression line can be superposed on the graph at the same
time as plotting. After selecting Graphs >> nteractive >> Scatterplot and the
variables to be plotted, select the Fit tab, and for the Method, select
Regression.


Note that if you use the
alternative, Chart Builder
option to draw your graph,
you can superpose the line
of best fit on it as follows:

Double click on the graph
to open the Chart Editor,
click on Elements on the
menu, then Fit Line at
Total.

The linear regression line
will be placed on the graph
as the default.
Figure 10.6
10-8
The Scatter Plot then appears as seen in Figure 10.7 below.

Figure 10.7

Figure 10.7 shows the Iine of best fit estimated by SPSS, using a technique
called the 'method of least squares'.

magine you have measured the vertical distance from every data point to an
imaginary straight line drawn on the plot. Square these distances and add
them together to get a total sum. f you draw a different line through the
points, and go through the same procedure, you will get a different sum. The
'line of best fit' (also called the Ieast squares Iine) is the line which produces
the smallest sum of squared vertical distances from the observed points to it.
The Ieast squares Iine is the line that is closest to all the data points
simultaneously.

We can use this line to estimate the mean prestige score for respondents with
different numbers of years in education.

For example, to get an estimate of the prestige score of a respondent with 10
years in education, we look along the horizontal axis, find 10, and then go up
vertically until we meet the line. From this point we trace a horizontal line left
until we meet the vertical axis. The value on the vertical (prestige score axis)
gives us the value of the estimated prestige score (here about 35). We can
repeat the procedure with a respondent with 15 years in education (estimated
mean score=45).

Estimating the equation of the Iine

When variables appear to have a linear relationship we use Iinear regression
to estimate the equation of the Ieast squares Iine.
When Highest year of
school completed =15

From graph: estimated
Prestige score=45
10-9
The line in Figure 10.7 can be defined by the position where it cuts the y
(Prestige score) axis and by its slope. The slope is a measure of how much
the line goes up (vertically) for every unit it goes along horizontally. n this
case the slope is positive ie. the Prestige Score increases as Highest Year of
School Completed increases.

The equation of a straight line takes the form



where is the intercept (where the line crosses the vertical y-axis) and is
the gradient or slope of the line (i.e. how steep it is). The sign of indicates
a positive or negative relationship. f is zero, this indicates the absence of
any linear relationship between the two variables and . f is large
(either positively or negatively), this indicates that a small change in would
lead to a large change in .

Linear regression can be used to estimate the values of and and also to
indicate whether the relationship between Prestige score and Highest year of
School Completed is significant.

To perform a Linear Regression in SPSS, we select:

Analyze
Regression
Linear.

and this gives the Linear Regression dialogue box seen in Figure 10.8 (we
will see more of this in Session 11).


Figure 10.8
10-10

The Dependent variable is the Prestige Score (PRESTG80) and the
Independent variable is EDUC, the Highest Year of School Completed.

Clicking on OK produces 4 tables in the Output Viewer; we will look at the first
three in more detail in Session 11. The fourth table, Coefficients, is in Figure
10.9 below.
Coefficients
a
13.079 1.340 9.761 .000
2.295 .100 .520 22.864 .000
(Constant)
Highest Year of
School Completed
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: R's Occupational Prestige Score (1980)
a.

Figure 10.9

The estimated values of and are displayed in the Unstandardized
Coefficients, B column. is the constant (=13.079) and is the slope
(=2.295).

This tells us that the equation of the Line of Best Fit ( ) is:

PRESTG80 = 13.079 + (2.295 * EDUC)

Occupational Prestige Score can be estimated by multiplying the Highest
Year of School Completed by 2.295 and adding 13.079.

For each extra year in education, we expect the Occupational Prestige score
to increase by 2.295.

For example, estimated Prestige Score for respondents with 10 years in
education is given by 2.295*10 + 13.079 = 36.03.
The estimated Prestige Score for respondents with 15 years in education is
given by 2.295*15 + 13.079 = 47.5.
Compare these values with those estimated from the graph.

s this relationship between the two variables a significant one? n other
words, is the coefficient of EDUC, 2.295, significantly different from zero?
The Linear Regression procedure performs a test for this, and the results are
produced in the final two columns.

Our NuII Hypothesis is that the coefficient is zero (or not significantly
different from zero). On the evidence of the T-Test in the EDUC row of the
table, we reject this hypothesis, since the Significance is less than 0.05.

Therefore we say that, at the 5% level, there is evidence that the years in
education has a significant effect on Occupational Prestige score.
10-11
PracticaI session 10

1. Weight and height of chiIdren

Open the STATLAB data (H:\My Documents\spss data\StatIaba.sav).

Produce a Scatter PIot of the child's weight at age 10 (CTW) against the
child's height at that time (CTH).
(NB 'Y against X')

Superimpose the Line of Best Fit on the plot.

s there a strong relationship between these two variables? What is the
Pearson Correlation?

Perform a Linear Regression to estimate the Line of Best Fit.

At the 5% level, is there evidence that the child's height significantly affects
how much the child weighs? f so, what happens as the child grows taller?

Estimate the weight of a 10 year old child who is 52 inches tall.


2. The Years of Education of Mother and ChiId

Open the GSS91t data (H:\My Documents\spss data\Gss91t.sav)

Using the variables EDUC and MAEDUC, investigate whether the mother's
education has a significant effect on her child's.


3. Age and OccupationaI Prestige

Use the GSS91t data for this question.

s the Occupational Prestige Score (PRESTG80) significantly affected by the
age of the respondent (AGE)?

Produce some Descriptive Statistics of PRESTG80 and compare them to
your Linear Regression. What values look similar?

How would you estimate the Occupational Prestige Score for a respondent
aged 32? How about one aged 67? (Hint: if AGE is not significant, then
taking age into account in the calculation will not make any real difference
what value could you use instead for any age?)


Save your output as exer10.spo

Вам также может понравиться