Вы находитесь на странице: 1из 45

Chapter 8

SIMPLE LINEAR
REGRESSION ANALYSIS
Correlation Analysis (p. 579)
Given: Bivariate data={(X1,Y1), (X2,Y2), …, (Xn,Yn)}

 Association in bivariate data means a systematic connection


between changes in one variable and changes in the other.
 If both variables were measured on at least an ordinal scale then
the direction of the association can be described as either
positive or negative.
 When an increase in one variable tends to be accompanied by an
increase in the other, the variables are positively associated.
 On the other hand, when an increase in one variable tends to be
accompanied by a decrease in the other, the variables are
negatively associated.
Correlation Analysis
Objective of Correlation Analysis: to measure the strength and
direction of the linear association between two variables.
Scatter Diagram
 First step in correlation analysis is to plot the individual pairs of
observations on a two-dimensional graph called the scatter
diagram.
 This will help you visualize the possible underlying linear
relationship between the two variables.
 Using Microsoft Excel:
Step 1. Highlight data.
Step 2. Click Insert then choose Scatter.
Linear Correlation Coefficient ρ (p. 580)
The linear correlation coefficient, denoted by  (Greek letter rho), is
a measure of the strength of the linear relationship existing
between two variables, say X and Y, that is independent of their
respective scales of measurement.

𝐶𝑜𝑣(𝑋, 𝑌)
𝜌=
𝜎𝑋 𝜎𝑌
Properties of ρ (p. 580)
 A linear correlation coefficient can only assume values [-1,+1].
 The sign of  describes the direction of the linear relationship
between X and Y.
➢A positive value for  means that the line slopes upward to the
right, and so as X increases, the value of Y increases.
➢A negative value for  means that the line slopes downward to
the right, and so as X increases, the value of Y decreases.
 If =0, then there is no linear correlation between X and Y.
However, this does not mean a lack of association. It is possible to
obtain a zero correlation even if the two variables are related,
though their relationship is nonlinear, such as a quadratic
relationship.
Properties of ρ
 When  is -1 or 1, there is perfect linear relationship between X and
Y and all the points (x, y) fall on a line whose slope is not equal to
0. ( is undefined when the slope is 0 since Var(Y)=0 in this case).
A  that is close to 1 or -1 indicates a strong linear relationship.
 A strong linear relationship does not necessarily imply that X
causes Y or Y causes X. It is possible that a third variable may
have caused the change in both X and Y, producing the observed
relationship.
➢This is an important point that we should always remember
when studying not just relationships, but also comparing two
populations, say by using a t-test.
Properties of ρ
➢Unless we collected our data using a well-designed experiment
where we were able to randomize the treatments and
substantially control the extraneous variables, we need to use
the more complex “causal” models to study causality.
➢Otherwise, we just describe the observed relationship or the
observed difference between means.
Pearson Product Moment Correlation (p. 581)
The Pearson product moment correlation coefficient between X
and Y, denoted by r, is defined as:
𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖
𝑟=
𝑛 2 𝑛 2 𝑛 2 𝑛 2
𝑛 σ𝑖=1 𝑋𝑖 − σ𝑖=1 𝑋𝑖 𝑛 σ𝑖=1 𝑌𝑖 − σ𝑖=1 𝑌𝑖
σ𝑛𝑖=1(𝑋𝑖 − 𝑋)(𝑌
ሜ 𝑖 − 𝑌) ሜ
=
σ𝑛𝑖=1(𝑋𝑖 − 𝑋)ሜ 2 σ𝑛𝑖=1(𝑌𝑖 − 𝑌)
ሜ 2

This is a point estimator of .


Examples
r=1 r = -1

Y Y

X X

r=0
r = 0.87

X
Remark
 If r=1 then all the data points belong in a line whose slope is positive.
 If r=-1 then all the data points belong in a line whose slope is negative.
 If r=0 then we cannot conclude that all the data points belong in a line whose
slope is 0 (or a horizontal line). 5

Example 4

X Y XY X2 Y2 3
-4 4 -16 16 16
-2 2 -4 4 4 2

0 0 0 0 0
1
2 2 4 4 4
4 4 16 16 16 0

Sum 0 12 0 40 40 -4 -2 0 2 4

𝑛 σ𝑛 𝑛 𝑛
𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑖=1 𝑋𝑖 σ𝑖=1 𝑌𝑖 (5)(0)−(0)(12)
𝑟= = =0
2 2 (5)(40)−(0)2 (5)(40)−(12)2
𝑛 σ𝑛 𝑋
𝑖=1 𝑖
2 − σ𝑛 𝑋
𝑖=1 𝑖 𝑛 σ𝑛 𝑌
𝑖=1 𝑖
2 − σ𝑛 𝑌
𝑖=1 𝑖
Test of Hypothesis
Ho: =0 vs Ha: 0
𝑟
Test Statistic: 𝑇 =
1−𝑟2
𝑛−2

Critical region: |t| > t/2(v=n-2)

 Even if we are able to establish that there is a linear relationship


between two variables, we still do not conclude that X causes Y.
 There may be a third variable that is correlated with both X and Y
that is responsible for the apparent correlation.
Example
 Example 18.2 (page 581) and Example 18.3 (page 583)
 Exercise 1 (page 584). Suppose a breeder of Thoroughbred horses
wishes to determine whether a linear relationship exists between the
gestation period and the length of life of a horse. The breeder collected
the following data (next slide) from various stables across the region.
a) Plot a scatter diagram of the data on the gestation period and the
length of life of a horse. Does there appear to be a linear relationship
between the variables?
b) Compute for the Pearson correlation coefficient between the
gestation period and the length of life of a horse. What conclusion can
you draw based on the value of the correlation coefficient? Does this
support your observation in a.)?
c) Test whether 𝜌 is different from 0 using a 0.05 level of significance.
Example
Gestation Period Length of Life
(in days) (in years)
Horse X Y X2 Y2 XY
1 416 24 173056 576 9984
2 280 25.75 78400 663.0625 7210
3 290 20 84100 400 5800
4 309 22 95481 484 6798
5 365 20 133225 400 7300
6 356 21.5 126736 462.25 7654
7 403 23.5 162409 552.25 9470.5
8 300 21.75 90000 473.0625 6525
9 265 21 70225 441 5565
10 400 21 160000 441 8400
Total 3384 220.5 1173632 4892.625 74706.5

𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖


𝑟=
2 2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − σ𝑛𝑖=1 𝑋𝑖 𝑛 σ𝑛𝑖=1 𝑌𝑖2 − σ𝑛𝑖=1 𝑌𝑖
(10)(74706.5) − (3384)(220.5)
= = 0.0956
(10)(1173632) − (3384)2 (10)(4892.625) − (220.5)2
Example
Ho: 𝜌 = 0 vs. Ha: 𝜌 ≠ 0 at 𝛼 = 0.05.
𝑟
Test-statistic: 𝑇 =
1−𝑟2
𝑛−2

Decision rule: Reject Ho if |t| > t.025(v=8). That is reject Ho if t>2.306 or


t<-2.306.
Computed value of test statistic:
0.0956
𝑡= = 0.2718
1 − .0956 2
8
Since |0.2618|≯2.306, do not reject Ho. There is insufficient evidence
at 0.05 level of significance to conclude that there is a linear
relationship between gestation period and length of life of horses.
Regression Analysis
Regression analysis is used to:
 Predict the value of a dependent variable based on the value of at
least one independent variable
 Explain the impact of changes in an independent variable on the
dependent variable

Dependent/Response variable: the variable we wish to explain


Independent/Explanatory variable: the variable used to explain the
dependent variable
Simple Linear Regression Model (p. 585)
The simple linear regression model (SLRM) is given by the equation
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
where
𝑌𝑖 is the value of the dependent/response variable (continuous) for the ith
element,
𝑋𝑖 is the value of the independent/explanatory variable (continuous) for the ith
element,
𝛽0 is a regression coefficient that gives the Y-intercept of the regression line,
𝛽1 is a regression coefficient that gives the slope of the line,
𝜀𝑖 is the random error term for the ith element, where the 𝜀𝑖 s are independent,
normally distributed with mean 0 and variance 𝜎 2 for i = 1, 2, …, n,
𝑛 is the number of elements in the sample
Linear Regression Assumptions
 Error values (𝜀) are statistically independent from one another
 Error values are normally distributed for any given value of 𝑋
 The probability distribution of the errors has a mean of 0 and
constant variance, 𝜎 2
 In short, 𝜀𝑖 ~𝑁𝑜𝑟𝑚𝑎𝑙 0, 𝜎 2 , 𝑖 = 1,2, … , 𝑛 and 𝜀𝑖 s are independent.

 These assumptions require that for any given value of 𝑋, say 𝑥,


i.e., 𝑋 = 𝑥,
𝑌~𝑁𝑜𝑟𝑚𝑎𝑙 𝛽0 + 𝛽1 𝑥, 𝜎 2
Notes
 𝛽0 + 𝛽1 𝑋𝑖 is also called the linear component of the SLRM, and 𝜀𝑖
the random error component of the SLRM.
 The SLRM can also be written as 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀 (without the
subscript).
 The model tells us that two or more observations having the
same value for 𝑋 will not necessarily have the same value for 𝑌.
 The expected value of 𝑌 for a given value of 𝑋, say 𝑥𝑖 , is
𝑬 𝒀|𝑿 = 𝒙𝒊 = 𝜷𝟎 + 𝜷𝟏 𝒙𝒊 .
Notes
 The random error term, 𝜀, may be thought of as a representation
of the effect of other factors not explicitly stated in the model but
do affect the response variable to some extent (e.g., omitted
variables, unidentified variables)
➢The response variable may not be adequately predicted by a
single explanatory variable. (Example, exam score is not solely
reliant on the amount of hours spent on studying)
➢In real life applications, we use the Multiple Linear Regression
Model (MLRM). [One dependent, many independent variables]
Remarks
 Blue Line: E(Y given that X=x) = Y|x
= o + 1x since E()=0.
 The random error, I, is the vertical gap
between the ith observation and the blue
line. I is a random variable and we will
never know its realized value because
o and 1 are unknown.
 The random error term accounts for all
other factors that affect the value of Y
that cannot be explained by the
relationship between X and Y. This
includes all other variables related to Y
and also measurement errors.
Remarks
 We require that the Is are independent
random variables. For any fixed value of X,
these random variables are normally
distributed. The mean of any I is 0 and its
variance is 2. (that is, we do not allow that
the variation in the values of Is to differ for
the different values of X).
 Consequently, for a fixed value of X=x,
Y~Normal(o + 1x, 2)
 o is the value of the mean of Y when X=0.
 1 is the change in the average value of Y
for every unit increase in the value of X.
Steps in Doing SLR Analysis (p. 587)
1. Obtain the equation that best fits the data.
2. Evaluate the equation to determine strength of the relationship
for prediction and estimation.
3. Determine if assumptions on the error terms are satisfied.
(Diagnostic Checking)
4. If the model fits the data adequately, use the equation for
prediction and for describing the nature of relationship between
the variables.
Estimation Using the Method of
Least Squares (p. 588)
The estimated regression equation is given by:
𝑌෠ = 𝛽መ0 + 𝛽መ1 𝑋
We use this formula to compute the predicted value of Y when
given the value of X. We also use this to compute the predicted
value of the ith observation in the sample data as follows:
𝑌෠𝑖 = 𝛽መ0 + 𝛽መ1 𝑋𝑖
The method of least squares derives the values of 𝛽መ 0 and 𝛽መ 1 that
minimizes
𝑛 𝑛

෍(𝑌𝑖 − (𝛽0 + 𝛽1 𝑋𝑖 ))2 = ෍ 𝜀𝑖2


𝑖=1 𝑖=1
Estimation Using the Method of
Least Squares
Based on this criterion, the following formulas for 𝛽መ0 , the estimate
for 𝛽0 , and 𝛽መ1 , the estimate for 𝛽1 , are obtained:

𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖


𝛽መ1 = 2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − 𝑛
σ𝑖=1 𝑋𝑖

𝛽መ𝑜 = 𝑌ሜ − 𝛽መ1 𝑋ሜ

Exercise: Show the derivation of 𝛽መ𝑜 and 𝛽መ1 .


Example
 The random error term: the
𝑌෠ = 𝛽መ0 + 𝛽መ1 𝑋 vertical gap between the ith
Y|x = o + 1x observation and Y|Xi = 0 +
1Xi
i = Yi – (0 + 1Xi)
 The residual: the vertical
gap between the ith
observation and 𝑌෠𝑖 = 𝛽መ0 +
𝛽መ1 𝑋𝑖
ei = Yi – (𝛽መ 0 + 𝛽መ 1Xi)
Example
 Example 18.5 (page 589) Math Calculus
 Example 11.12 (Mendenhall/ Student Achievement Score Grade
Scheaffer). The data on the 1 39 65
2 43 78
right represent a sample of
3 21 52
mathematics achievement 4 64 82
scores and calculus grades 5 57 92
for 10 independently 6 47 89
selected college freshmen. 7 28 73
Plot the scatter diagram and 8 75 98
use the method of least 9 34 56
10 52 75
squares to fit a line to the
given 10 points.
Example
Scatterplot of Math Achievement Score
and Calculus Grade
120

100
Calculus Grade

80

60

40

20

0
0 10 20 30 40 50 60 70 80

Math Achievement Score


Example
𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖 Student X Y X2 XY
𝛽መ1 = 2 1 39 65 1521 2535
𝑛 σ𝑛𝑖=1 𝑋𝑖2
− 𝑛
σ𝑖=1 𝑋𝑖
2 43 78 1849 3354
(10)(36854) − (460)(760) 3 21 52 441 1092
=
(10)(23634) − (460)2 4 64 82 4096 5248
= 0.766 5 57 92 3249 5244
6 47 89 2209 4183
7 28 73 784 2044
𝛽መ𝑜 = 𝑌ሜ − 𝛽መ1 𝑋ሜ 8 75 98 5625 7350
=76 – (0.766)(46)=40.78 9 34 56 1156 1904
10 52 75 2704 3900
Total 460 760 23634 36854
Estimated regression equation:
𝑌෠ = 40.78 + 0.766𝑋
Using the Estimated Regression Equation
Estimated regression equation:
𝑌෠ = 40.78 + 0.766𝑋
where Y=calculus grade and X=math achievement score

If the model fits the data adequately, we can use this equation for
prediction purposes and for describing the nature of the
relationship between the variables.

 We can only predict the value of Y within the values of X in our


data set, that is, for math achievement scores (X) from 21 to 75
only. For example, when X=50, we predict the calculus score to be
𝑌෠ = 40.78 + (0.766)(50) = 79.08.
Using the Estimated Regression Equation
෡ 𝟎 and 𝜷
Interpretation of 𝜷 ෡𝟏
 𝛽መ1 =0.766 means that as the math achievement score (X)
increases by 1 unit, the mean calculus grade (Y) is estimated to
increase by 0.766.
 𝛽መ0 =40.78 has no meaningful interpretation because X=0 is not
within the range of values we used in our estimation.
Confidence Interval Estimation
𝑆𝑆𝐸 σ𝑛 ෠ 2
𝑖=1 𝑌𝑖 −𝑌𝑖
An estimator for 2: 𝑀𝑆𝐸 = =
𝑛−2 𝑛−2
Confidence interval estimator for 1:
𝛽መ1 − 𝑡𝛼Τ2 𝑣 = 𝑛 − 2 𝑆𝛽෡1 , 𝛽መ1 + 𝑡𝛼Τ2 𝑣 = 𝑛 − 2 𝑆𝛽෡1
𝑀𝑆𝐸
where 𝑆𝛽෡1 = 2
σ𝑛 𝑋
𝑖=1 𝑖
2 − σ𝑛 𝑋
𝑖=1 𝑖 ൗ𝑛

Confidence interval estimator for 0:


𝛽መ0 − 𝑡𝛼Τ2 𝑣 = 𝑛 − 2 𝑆𝛽෡0 , 𝛽መ0 + 𝑡𝛼Τ2 𝑣 = 𝑛 − 2 𝑆𝛽෡0

𝑀𝑆𝐸 σ𝑛 𝑋 2
where 𝑆𝛽෡0 = 𝑛 2
𝑖=1 𝑖
𝑛 2
𝑛 σ𝑖=1 𝑋𝑖 − σ𝑖=1 𝑋𝑖
Hypothesis Testing
To test if there is a significant linear relationship between Y and X:
Ho: 1=0 vs Ha: 10

Test Statistic:
𝛽መ1
𝑇=
𝑆𝛽෡1
𝑀𝑆𝐸
where 𝑆𝛽෡1 = 2
σ𝑛
𝑖=1 𝑋𝑖
2− 𝑛
σ𝑖=1 𝑋𝑖 ൗ𝑛

Critical region: |t| > t/2(v=n-2)


Example
Math Calculus Squared
Student Score (X) Grade (Y) X2 Predicted Y Residuals residuals
1 39 65 1521 70.6410671 -5.6410671 31.821638
2 43 78 1849 73.7033145 4.29668553 18.46150654
3 21 52 441 56.8609539 -4.86095392 23.62887302
4 64 82 4096 89.7801132 -7.78011318 60.53016105
5 57 92 3249 84.4211803 7.578819725 57.43850843
6 47 89 2209 76.7655618 12.23443816 149.681477
7 28 73 784 62.2198868 10.78011318 116.2108401
8 75 98 5625 98.2012935 -0.20129345 0.040519054
9 34 56 1156 66.8132579 -10.8132579 116.926546
10 52 75 2704 80.5933711 -5.59337106 31.2857998
Total 460 760 23634 SSE=606.025869
Example
Predicted Y: 𝑌෠ = 40.78 + 0.766𝑋 Residuals: e = 𝑌 − 𝑌෠

Ho: 1=0 vs Ha: 10 at =0.05


෡1
𝛽 0.766
Test Statistic: 𝑇 = = = 4.375
𝑆𝛽
෡ 0.174985
1

𝑆𝑆𝐸/(𝑛−2) (606.025869)/8
where 𝑆𝛽෡1 = 2 = = 0.174985
𝑛
σ𝑖=1 𝑋𝑖 23634− (460)2 /10
σ𝑛 2
𝑖=1 𝑋𝑖 − 𝑛

Critical region: |t| > t/2(v=n-2); that is, t >2.306 or t < -2.306
Coefficient of Determination (p. 593)
Definition 18.4
The coefficient of determination, denoted by R2, is defined as the
proportion of the variability in the observed values of the response
variable that can be explained by the explanatory variable through
their linear relationship
Remarks
 We can use the coefficient of determination to assess the
goodness-of-fit of the linear regression model.
 The realized value of the coefficient of determination will be from
0 to 1. Usually, this value is expressed in percentage so that we
may interpret this as the percentage of the variation in the values
of Y that is explained by the explanatory variable X through the
model.
 If the model has perfect predictability then R2=1. If the model has
no predictive capability then R2=0.
Relationship between r and 𝛽መ1
𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖
𝛽መ1 = 2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − 𝑛
σ𝑖=1 𝑋𝑖

𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖


𝑟=
2 2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − 𝑛
σ𝑖=1 𝑋𝑖 𝑛 σ𝑛𝑖=1 𝑌𝑖2 − 𝑛
σ𝑖=1 𝑌𝑖

Note: (Verify!)
2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − 𝑛
σ𝑖=1 𝑋𝑖
𝑟 = 𝛽መ1
2
𝑛 σ𝑛𝑖=1 𝑌𝑖2 − 𝑛
σ𝑖=1 𝑌𝑖
Computing for R 2

Student X Y X2 Y2 XY
1 39 65 1521 4225 2535
2 43 78 1849 6084 3354
3 21 52 441 2704 1092
4 64 82 4096 6724 5248
5 57 92 3249 8464 5244
6 47 89 2209 7921 4183
7 28 73 784 5329 2044
8 75 98 5625 9604 7350
9 34 56 1156 3136 1904
10 52 75 2704 5625 3900
Total 460 760 23634 59816 36854
Computing for R 2

𝑛 σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖 − σ𝑛𝑖=1 𝑋𝑖 σ𝑛𝑖=1 𝑌𝑖


𝑟=
2 2
𝑛 σ𝑛𝑖=1 𝑋𝑖2 − 𝑛
σ𝑖=1 𝑋𝑖 𝑛 σ𝑛𝑖=1 𝑌𝑖2 − 𝑛
σ𝑖=1 𝑌𝑖
(10)(36854) − (460)(760)
= = 0.8398
(10)(23634) − (460)2 (10)(59816) − (760)2

R2(100%) = (.8398)2(100%)=70.52%

70.52% of the variability of the grades in calculus can be explained by


the math achievement scores.
Software Output (MS Excel)
A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet). A random
sample of 10 houses is selected.
–Dependent variable (𝑌) = house price in $1000s
–Independent variable (𝑋) = square feet
Software Output (R/RStudio)
house.size <- c(1400,1600,1700,1875,1100,1550,2350,2450,1425,1700)
house.price <- c(245,312,279,308,199,219,405,324,319,255)
summary(lm(house.price ~ 1 + house.size))

##
## Call:
## lm(formula = house.price ~ 1 + house.size)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.388 -27.388 -6.388 29.577 64.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.24833 58.03348 1.693 0.1289
## house.size 0.10977 0.03297 3.329 0.0104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41.33 on 8 degrees of freedom
## Multiple R-squared: 0.5808, Adjusted R-squared: 0.5284
## F-statistic: 11.08 on 1 and 8 DF, p-value: 0.01039
Software Output (SAS)
PROC REG DATA = stat115;
MODEL house_price = house_size;
RUN;
QUIT;
Software Output (SPSS)
Software Output (Stata)
. regress houseprice housesize

Source SS df MS Number of obs = 10


F(1, 8) = 11.08
Model 18934.9348 1 18934.9348 Prob > F = 0.0104
Residual 13665.5652 8 1708.19565 R-squared = 0.5808
Adj R-squared = 0.5284
Total 32600.5 9 3622.27778 Root MSE = 41.33

houseprice Coef. Std. Err. t P>|t| [95% Conf. Interval]

housesize .1097677 .0329694 3.33 0.010 .0337401 .1857954


_cons 98.24833 58.03348 1.69 0.129 -35.57711 232.0738

Вам также может понравиться