Вы находитесь на странице: 1из 37

Simple Linear Regression

Ms RU Cruz
Overview
1. Simple Regression Model
2. Least Squares Method
3. Coefficient of Determination
4. Model Assumptions
5. Testing of Significance
6. Residual Analysis
Example
MCAS periodically has a special week long
sale. As part of the advertising campaign
MCAS runs one or more TV commercials
during the weekend preceding the sale. Data
from a sample of 5 weekly previous sales are
shown: Week TV Ads Cars Sold
1 1 14
2 3 24
3 2 18
4 1 17
5 3 27
Questions
1. Develop a scatter diagram for these data
2. What is the slope using the simple linear
regression model?
3. What is the y intercept?
4. What is the estimated regression equation?
Scatter Diagram
30

25

20

15
Series1

10

0
0 0.5 1 1.5 2 2.5 3 3.5
Computation Table for Slope
Week TV Ads Cars Sold xi x yi y (xi x) (xi x)2
(x) (y) (yi y)
1 1 14 -1 -6 6 1
2 3 24 1 4 4 1
3 2 18 0 -2 0 0
4 1 17 -1 -3 3 1
5 3 27 1 7 7 1
x = 10 y = 100 20 4
x=2 y = 20

b1 ( x x )( y y ) 20
i i
5
(x x )
i
2
4
3. Y intercept is
b0 y b1 x 20 5(2) 10
4. Regression Equation
y b0 b1 x

y 10 5x
y = mx + b
y= 5x + 10
Coefficient of Determination
Coefficient of Determination is a measure of
the goodness of fit of the estimated regression
equation. It can be interpreted as the proportion
of the variability in the dependent variable y that
is explained by the estimated regression
equation.
ith residual is the difference between the
observed value of the dependent variable and
the value predicted using the estimated
regression equation. For the ith observation the
ith residual is yi i
Coefficient of Determination
Relationship Among SST
SST = SSR + SSE
(yi y)2 = (i y)2 + (yi i)2
Coefficient of Determination
r2 = SSR/SST
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Calculation for SSE
Week TV Ads Cars Sold Predicted yi i (yi i)2
(x) (y) Sales Residuals
= 10 +5x
1 1 14 15 -1 1
2 3 24 25 -1 1
3 2 18 20 -2 4
4 1 17 15 2 4
5 3 27 25 2 4
x = 10 y = 100 SSE = 14
x=2 y = 20
Calculation for SST
Week TV Ads Cars Sold yi (yi
(x) (y) y(bar) ybar)2

1 1 14 -6 36
2 3 24 4 16
3 2 18 -2 4
4 1 17 -3 9
5 3 27 7 49
x = 10 y = 100 SST=114
x=2 y = 20
SSR = SST SSE = 114 14 = 100 The regression relationship is very strong; 87.72%of the variability in the
number of cars sold can be explained by the linear relationship between the
r2 = SSR/SST = 100/114 = 0.8772 number of TV ads and the number of cars sold.
Coefficient of Determination
Excel Value Worksheet (showing r 2)

30
25
Cars Sold

20
y = 5x + 10
15 2
R = 0.8772
10
5
0
0 1 2 3 4
TV Ads
Model Assumptions
1. The error term is a random variable with a
mean or expected value of zero.
2. The variance of denoted by 2 is the same
for all values of x.
3. The values of are independent
4. The error term is a normally distributed
random variable.
Testing for Significance
To test for a significant regression relationship,
we must conduct a hypothesis test to determine
whether the value of 1 is zero
The two most commonly used tests for are:
F test and the T test
Both tests require an estimate of 2, the variance
of in the regression model.
The mean square error (MSE) provides the
estimate of 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n 2) SSE ( yi y i ) 2 ( yi b0 b1 xi ) 2
Standard Error of the Estimate
To estimate standard deviations , you simply
square root 2
In this case, the resulting s is called the
standard error of the estimate.
SSE
s MSE
n2
Testing for Significance: t Test
Hypothesis: Ho : 1 = 0
Ha : 1 0
Test Statistic: t b1 where s
sb
sb1 1
( xi x ) 2

Rejection Rule:
critical value approach: Reject Ho if t t/2
Where t/2 is based on a t distribution with n 2
degrees of freedom.
Steps of Significant t Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
b1
t
sb1
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Testing for Significance: F Test
Hypothesis: Ho : 1 = 0
Ha : 1 0
Test Statistic: F = MSR/MSE
Rejection Rule:
p value approach: Reject Ho if p value
Critical value approach: Reject Ho if F F
Where F is based on an F distribution with 1
degree of freedom in the numerator and
n 2 degrees of freedom in the denominator.
Steps of Significant F Tests
1. Determine the hypotheses
2. Specify the level of significance
3. Select the test statistic
F = MSR/MSE
4. State the rejection rule
5. Compute the value of the test statistic
6. Determine conclusion via rejection rule
Hypothesis Test for Significance
r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
For a two tail test for significance:
(The correlation is not significant)

(The correlation is significant)

The sampling distribution for r is a t-distribution with


n 2 d.f.

Standardized test
statistic
Test of Significance
The correlation between the number of times absent and a
final grade r = 0.975. There were seven pairs of data.Test the
significance of this correlation. Use = 0.01.

1. Write the null and alternative hypothesis.


(The correlation is not significant)

(The correlation is significant)

2. State the level of significance.


= 0.01

3. Identify the sampling distribution.


A t-distribution with 5 degrees of freedom
Rejection Regions

Critical Values t0

t
4.032 0 4.032
df\
0.40 0.25 0.10 0.05 0.025 0.01 0.005 0.0005
p

4. Find the critical value. 1


0.32492
0
1.00000
0
3.07768
4
6.31375
2
12.7062
0
31.8205
2
63.6567
4
636.619
2

0.28867 0.81649 1.88561 2.91998


2 4.30265 6.96456 9.92484 31.5991
5 7 8 6

5. Find the rejection region. 3 0.27667 0.76489 1.63774 2.35336


3.18245 4.54070 5.84091 12.9240
1 2 4 3

0.27072 0.74069 1.53320 2.13184


4 2.77645 3.74695 4.60409 8.6103
2 7 6 7
6. Find the test statistic.
0.26718 0.72668 1.47588 2.01504
5 2.57058 3.36493 4.03214 6.8688
1 7 4 8
t
0
4.032 +4.032

7. Make your decision.

t = 9.811 falls in the rejection region. Reject the null hypothesis.

8. Interpret your decision.

There is a significant negative correlation between the number of times


absent and final grades.
The Line of Regression

Regression indicates the degree to which the variation in one variable X, is related to or can
be explained by the variation in another variable Y
Once you know there is a significant linear correlation, you can write an equation describing
the relationship between the x and y variables. This equation is called the line of regression
or least squares line.
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.

The line of regression is:

The slope m is:

The y-intercept is:


(xi,yi) = a data point

= a point on the line with the same x-value

= a residual

Best fitting straight line


260
250
240
revenue

230
220
210
200
190
180
1.5 2.0 Ad $ 2.5 3.0
x y xy x2 y2
Write the equation of the
1 8 78 624 64 6084
2 2 92 184 4 8464
line of regression with
3 5 90 450 25 8100 x = number of absences
4 12 58 696 144 3364 and y = final grade.
5 15 43 645 225 1849
6 9 74 666 81 5476
7 6 81 486 36 6561
Calculate m and b.
57 516 3751 579 39898

The line of regression is: = 3.924x + 105.667


The Line of Regression
m = 3.924 and b = 105.667
The line of regression is:

95
90
85
Grade

80
75
70
65
Final

60
55
50
45
40
0 2 4 6 8 10 12 14 16
Absences

Note that the point = (8.143, 73.714) is on the line.


Predicting y Values
The regression line can be used to predict values of y for values of x
falling within the range of the data.

The regression equation for number of times absent and final grade is:
= 3.924x + 105.667
Use this equation to predict the expected grade for a student with

(a) 3 absences (b) 12 absences

(a) = 3.924(3) + 105.667 = 93.895


(b) = 3.924(12) + 105.667 = 58.579
Strength of the Association
The coefficient of determination, r2, measures the strength of the association and is the
ratio of explained variation in y to the total variation in y.

The correlation coefficient of number of times absent and final grade is r = 0.975.
The coefficient of determination is r2 = (0.975)2 = 0.9506.

Interpretation: About 95% of the variation in final grades can be explained by the
number of times a student is absent. The other 5% is unexplained and can be due to
sampling error or other variables such as intelligence, amount of time studied, etc.
ANOVA table
Source of Sum of Degrees of Mean F Test P value
Variation Squares Freedom Square
Regression
SSR 1 MSR = F =
SSR/1 MSR/
MSE
Error
SSE n2 MSE =
SSE/
n2
Total
SST n1
Residual Analysis
Residual Analysis the analysis of the residuals
used to determine whether the assumptions
made about the regression model appear to be
valid. Residual analysis is used to identify outliers
and influential observation.
Outliers a data point or observation that does
not fit the trend shown by the remaining data
Influential observation an observation that has
a strong influence or effect on the regression
results.
Residual Plot Graphical representation of the
residuals that can be used to determine whether
the assumptions made about the regression
model appear to be valid.
Residual Analysis
If the assumptions about the error term appears
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid. The
residuals provide the best information about . Much
of the residual analysis is based on an examination of
graphical plots.
Residuals for observation : yi i
If the assumption that the variance of is the same for
all values of x is valid, and the assumed regression
model is an adequate representation of the
relationship between the variable then, the residual
plot should give an overall impression band of points.
Residual Plot Against x

y y
Good Pattern
Residual

x
Residual Plot Against x

y y
Nonconstant Variance
Residual

x
Residual Plot Against x

y y
Model Form Not Adequate
Residual

x
Standardized Residuals
Standard Residuals the value obtained by
dividing the residual by its standard deviation
Normal probability plot a graph of the
standardized residuals plotted against the
values of the normal scores. This plot helps
determine whether the assumption that the
error term has a normal probability
distribution that appears to be valid.
High leverage points observations with
extreme values for the independent variables
Standardized Residual
Standardized Residual for Observation i.
y i y i
syi y i
Standard Deviation of the ith residual
syi y i s 1 hi
Leverage of observation i.
1 ( xi x ) 2
hi
n ( x i x )2

Вам также может понравиться