4 STAT-602 Regression & Correlation (Mid&Final)

Regression Analysis
The objective of many investigations is to understand and explain the relationship among variables.
Frequently, one wants to know how and to what extent a certain variable (response variable) is related to a
set of other variables (explanatory variables).
Regression analysis helps us to determine the nature and the strength of relationship among variables.
Types of relationship:
i) Deterministic relationship also called functional relationship
ii) Probabilistic relationship also called statistical relationship
In deterministic relationship the relationship between two variables is known exactly such as
a) Area of a circle= r2
b) F=k(m1m2/r2) (Newton’s law of gravity)
c)The relationship between dollar sales (Y) of a product sold at a fixed price and the number of units sold.
In statistical relationship the relation between variables is not know exactly and we have to approximate
the relationship and develop models that characterize their main features. Regression analysis is
concerned with developing such “approximating” models.
For example, in a chemical process the yield of product is related to the operating temperature, it may be
of interest to build a model relating yield to temperature and then use the model for prediction, process
optimization, or process control.
The word regression is used to investigate the dependence of one variable called the dependent variable
denoted by Y, on one or more variables, called independent variables denoted by X’s and provides an
equation to be used for estimating or predicting the average value of the dependent variable from the
known values of the independent variables. When we study the dependence of a variable on a single
independent variable, it is called simple regression. Where as the dependence of a variable on two or more
than two independent variables is called multiple regression. When the parameters in the model are in
linear form, then we say that model is linear.
The dependent variable is also called the predictand, the response , the regressand, where as the
independent variable is also called the predictor ,the explanatory or the regressor variable.
The regression analysis is generally classified into two kinds.
1. Linear Regression
Simple Linear Regression
Multiple Linear Regression
Curvi Linear Regression
2. Nonlinear Regression
Intrinsically Linear
Intrinsically Non-Linear
Linear:- The regression model is linear if the parameter in the model are in linear form (that is no
parameter appears as an exponent or is multiplied or divided by any other parameter). Otherwise, non-
linear model.
Suppose
Y    1 X 1   2 x 2   where & are parameters. It is a linear model
But if
X
Y= X or Y= it is non-linear.
Non Linear Model:- The non linear model that can be linearized (that is it can be converted into linear
model) by an appropriate transformation is called intrinsically linear and those that can not be so
transformed is called intrinsically non-linear.
e.g. Y= X Apply log on both sides Log(Y) = Log( )+ Log(X)
X
Y= Apply log on both sides Log(Y) = Log( )+ X Log( )
Regressor:- The variable that forms the basis of estimation or prediction is called the regressor. It is also
called independent variable, or explanatory or controlled or predictor variable, usually denoted by X.
STAT-602 [Muhammad Imran Khan is thankful for the contributors of these notes] Page 1
Regressand:- The variable whose resulting values depends upon the known values of independent
variable, is called regressand. It is also called response, dependent, or random variable, usually denoted by
Y.
In simple regression, the dependence of response variable (Y) is investigated on only one
regressor (X). if the relationship of these variables can be described by a straight line, it is termed as
simple linear regression.
The population simple linear regression model is defined as:
Y= 0 + 1 X+ , Population Regression Model

Y= 0 + 1 X Population Regression Line
where 0 and 1 are the population regression coefficients and i is a random error peculiar to the i-th
observation. Thus, each response is expressed as the sum of a value predicted from the corresponding X,
plus a random error.
The sample regression equation is an estimate of the population regression equation. Like any other
estimate, there is an uncertainty associated with it.
Y^ = b0 + b1 X Sample Regression Line

Where
b0 : Y intercept
b1: Slope of regression line
b0 & b1 also called regression coefficients. X1 is independent variable and Y is the dependent variable.
This model is said to be simple (b/c only one independent variable) linear in parameters and linear
in independent variable (as it is in first power not X2 or X3)
How to identify the relationship between variables

In order to begin regression analysis, useful tool is to plot the Y verses X this plot is called a scatter plot
and may suggest that what type of mathematical functions would be appropriate for summarizing the data.
A variety of functions are useful in fitting models to data.
LEAST SQUARE LINE

After using scatter diagram to illustrate the relationship between independent and dependent variable, the
next step is to specify the mathematical formulation of the linear regression model, which provides the
basis for statistical analysis. In scatter plot the observed data points do not all fall on a straight line but
cluster about it. Many lines can be drawn through the data points; the problem is to select among them.
The method of LEAST SQUARE results in a line that minimizes the sum of squared vertical distances from
the observed data points to the line (i.e Random Error). Any other line has a larger sum.
A least square line is described in terms of its Y-intercept (the height at which it intercepts the Y-axis)
and its slope (the angle of the line). The line can be expressed by the following relation
Y=a + bX or Y  b0  b1 X (Estimated regression of Y on X)
Where
S ( XY )
b Called slope of the line
S ( XX )
 
a  Y  b X , Called intercept of the line
In other words
S XY
b1 
S 2
X
b 0  Y  b1 X
Example: - The following data are the sparrow wing length in cm at various times in days after hatching
Wing Age XY X2 Y2 Y^ e=Y-Y^ e2
Length (X)
(Y)
1.4 3 4.2 9 1.96 1.525 -0.125 0.015625
1.5 4 6.0 16 2.25 1.795 -0.295 0.087025
2.2 5 11 25 4.84 2.065 0.135 0.018225
2.4 6 14.4 36 5.76 2.335 0.065 0.004225
3.1 8 24.8 64 9.61 2.875 0.225 0.050625
3.2 9 28.8 81 10.24 3.145 0.055 0.003025
3.2 10 32.0 100 10.24 3.415 -0.215 0.046225
3.9 11 42.9 121 15.21 3.685 0.215 0.046225
4.1 12 49.2 144 16.81 3.955 0.145 0.021025
4.7 14 65.8 196 22.09 4.495 0.205 0.042025
4.5 15 67.5 225 20.25 4.765 -0.265 0.070225
5.2 16 83.2 256 27.04 5.035 0.165 0.027225
5.0 17 85.0 289 25.00 5.305 -0.305 0.093025
44.4 130 514.80 1562 171.3 44.395 0.005 0.525
(i):- Draw scatter plot for the data
(ii):- Fit simple linear regression and interpret the parameters
(iii):-Find Standard error of estimate, SE(b0) and SE(b1).
(iv):-Test the hypothesis that there is no linear relation between Y and X. i.e 1=0
(v):- Test the hypothesis that 0=0.95
(vi):-Construct 90% C.I for regression parameters.
(vii):-Perform Analysis of Variance. Calculate coefficient of determination and interpret it.
(viii):- Test the hypothesis that the mean wing length of 13 day-old birds in the population is
4cm. Also find 95%C.I for mean value of Y when X=13.
(ix):-Test the hypothesis that the wing length of one 13 day-old birds in the population is 4.2 cm.
Also Construct 95 % C.I for single value of Y when X=13.
Solution:-
Wing length VS Days

Wing length (Cm)
6
4
2
0
0 2 4 6 8 10 12 14
age (days)
X  10 Y  3.415
 XY   n   70.8
n ( X )( Y )
S ( XY )   ( X i  X )(Yi  Y ) 
i 1
( X )2
S ( XX )   ( X i  X ) 2   X 2   262
n
( Y )2
S (YY )   (Yi  Y )   Y  2 2
 19.6569
n
S ( XY )
b1   0.270 cm/day
S ( XX )
 
bo  Y  b1 X  0.715 cm
So estimated simple linear regression equation is
Y=0.715 + 0.270 X
Interpretation of estimated regression parameter
 The value of b1=0.270, indicates that the average wing length is expected to increase by 0.270 cm
with each one day increase in age.
The observed range of age(Explanatory Variable) in the experiment was 3 to 17 days(i.e scope of the
model), therefore it would be an unreasonable extrapolation to expect this rate of increase in wing length
to continue if number of days were to increase. It is safe to use the results of regression only within the
range of the observed value of the independent variable only (i.e within the scope of the model).
 In regression equation b0=0.715, is the average wing length when age=0 day. In this example since
scope of the model does not cover x=0 so b0 does not have any particular meaning as a separate term
in the regression equation.
NOTE: Interpolation and Extrapolation
Interpolation is making a prediction within the range of values of the predictor in the sample used to
generate the model. Interpolation is generally safe. Extrapolation is making a prediction outside the range
of values of the predictor in the sample used to generate the model. The more removed the prediction is
from the range of values used to fit the model, the riskier the prediction becomes because there is no way
to check that the relationship continues to be linear
Standard Error of Estimate

The observed values of (X,Y) do not all fall on the regression line but they scatter away from it. The
degree of scatter of the observed values about the regression line is measured by what is called standard
error of estimate or standard deviation of regression and denoted by (  e ), its estimate is Se
^
2 (Y Y) 2
Y 2
boY b1XY
S e

n 2
 0.048 OR
n 2
 0.525/11 0.048
Se.=0.218
1 X2 1
SE(b0)  Se  0.148 SE(b1) Se 0.0135
n S(XX) S(XX)
Inference in Simple Linear Regression (From samples to population)
Generally, more is sought in regression analysis than a description of observed data. One usually wishes
to draw inferences about the relationship of the variables in the population from which the sample was
taken. To draw inferences about population values based on sample results, the following assumptions are
needed.
 Linearity
 Equal Variances for error
 Independence of errors
 Normality of errors
The slope and the intercept estimated from a single sample typically differ from the population values and
vary from sample to sample. To use these estimates for inference about the population values, the
sampling distributions of the two statistics are needed. When the assumptions of the linear regression
model are met, the sampling distribution of bo & b1are normal with mean 0 and 1 with standard errors
1 X2 1
SE (b0)  Se  SE (b1)  S e
n S(X , X ) S(X , X )
Test of hypothesis for 1

1) Construction of hypotheses
Ho : 1 = 0
H1: 1  0
2) Level of significance
 = 5%
3) TEST STATISTIC
b1   1 0.270  0
t   20.03
SE (b1) 0.0135
4) Decision Rule:- Reject Ho if tcal  t/2(n-2)=2.201 or tcal  - t/2(n-2)=-2.201
5) Result:- So reject Ho and conclude that there is significant relationship between age and wig
length

Ho : o = 0.95
H1: o  0.95
 = 5%
3) Test Statistic
bo  o 0.715  0.95
t   1.588
SE (bo ) 0.148
4) Decision Rule:- Reject Ho if tcal  t/2(n-2) =2.201 or tcal  - t/2(n-2)=-2.201
5) Result:- So do’t reject Ho.
Confidence intervals for regression parameters
A statistics calculated from a sample provides a point estimate of the unknown parameter. A point
estimate can be thought of as the single best guess for the population value. While the estimated value
from the sample is typically different from the value of the unknown population parameter, the hope is
that it isn’t too for away. Based on the sample estimates, it is possible to calculate a range of values that,
with a designated likelihood, includes the population value. Such a range is called a confidence interval.
90% C.I for 1
b1 t / 2( n 2) SE (b1)  0.270  t.05(11) 0.0135
(0.2458 , 0.2942)
90% C.I can be interpret as If we take 100 samples of the same size under the same conditions and
compute 100 C.I’s about parameter, one from each sample, then 90 such C.Is will contain the parameter
(i.e not all the constructed C.Is)
Confidence interval estimate of a parameter is more informative than point estimate because it reflects the
precision of the estimate.
The width of the C.I (i.e U.L – L.L)is called precision of the estimate. The precision can be increased
either by decreasing the confidence level or by increasing the sample size.
Confidence level C.I Width
99% (0.2281,0.3119) 0.0838
95% (0.2403,0.2997) 0.0594
90% (0.2458,0.2942) 0.0484
90% C.I for 0
b1 t / 2( n 2) SE (b0)  0.715  t .05(11) 0.148
(0.4492 , 0.9808 )
ANALYSIS OF VARIANCE IN SIMPLE LINEAR REGRESSION

Partition of variation in dependent variable into explained and unexplained variation
Total variation=Explained variation (Variation due to X also called variation due to regression)
+ Unexplained variation (Variation due to unknown factors)
Total variation:- S(YY)=19.6569
Explained variation (Variation in Y due to X also called variation due to regression):
bS(XY) =0.270(70.80)=19.1322
Unexplained Variation: Total variation – explained variation=19.6569-19.1322=0.5247
The hypothesis 1=0 may be tested by analysis of variance procedure.

ANOVA TABLE
Source Of Degree of Sum of Mean Sum Fcal Ftab
Variation Freedom Squares of Squares
(S.O.V) (DF) (SS) (MSS=SS/df)
Regression 1 19.1322 19.1322 401.1* F.05(1,11)=4.84
Error 13-2=11 0.5247 0.0477
TOTAL 13-1=12 19.6569
Relation between F and t for testing 1=0
F=t2 401.1=(20.03)2
Goodness of Fit
An important part of any statistical procedure that builts models from data are establishing how well the
model actually fits. This topic encompasses the detecting of possible violations of the required
assumptions in the data being analyzed and to check how close the observed data points to the fitted line.
A commonly used measure of the goodness of fit of a linear model is R 2 called coefficient of
determination. If all the observations fall on the regression line R 2 is 1. If no linear relationship between Y
& X R2 is 0. R2 =0 does not necessarily mean that there is no association between the variables. Instead,
it indicates that there is no linear relationship.
The co-efficient of determination tells us the proportion of variation in the dependent variable explained
by the independent variable
Re g .SS 19.1322
R2  x100  x100  97.33%
TotalSS 19.6569
The value of R2, indicates that about 97% variation in the dependent variable has been explained by the
linear relationship with X and remaining are due to some other unknown factors.
Test of hypothesis for mean value of Y i.e y/x

 1 ( Xo  X )2
Y13=0.715 + 0.270 (13)=4.225 SE (Y13 )  Se   0.073
n S ( XX )
Ho : y/13 = 4
H1: y/13  4
 = 5%
3) Test Statistic
Y13  Y /13 4.225  4
t   3.082
0.073
SE (Yˆ13 )
5) Result:- So reject Ho.
Confidence interval for mean value of Y i.e y/x

Yˆ13  t  / 2( n  2) SE (Yˆ13 )  4.225  (2.201)0.073
(4.064 , 4.386)
Test of hypothesis for single of Y
1 ( Xo  X ) 2
Y13=0.715 + 0.270 (13)=4.225 SE (Yˆ )1  Se 1    0.230
n S ( XX )
Ho : Y13 = 4.2
H1: Y13  4.2
 = 5%
3) TEST STATISTIC
Yˆ13  Y13 4.225  4.2

t   0.109
0.230
SE (Yˆ13 )1
5) Result:- So do’t reject Ho.
Confidence interval for single value of Y

Y13  t  / 2( n 2) SE (Yˆ )1  4.225  ( 2.201)0.230
(3.719 , 4.731)
Transformation to a straight line
It is easy to deal with the regression, which is linear in parameters, but in some situations the
models are non-linear. The non-linear models can be divided into two types
(1):-Intrinsically Linear (2): - Intrinsically Non-Linear models
The models that can be transformed in to linear models after applying some suitable transformation are
called intrinsically linear models and the models that can not be transformed in to linear models are called
intrinsically non-linear models.
Following are the examples of some common non-linear models with suitable transformation to convert
them into linear models:
Non-linear Form Transformation Linear model
1. Y  aX b
1. Log (Y )  Log ( a )  bLog ( X ) 1. Y *  a *  bX *
2. Y  ab X 2. Log (Y )  Log (a )  XLog (b) 2. Y *  a *  b * X

1 1
3.  a  bX 3. Y *  a  bX (  Y*)
Y Y 3. Y *  a  bX
4. Y  ae bX 4. Ln(Y )  Ln(a )  bX 4. Y *  a *  bX
5. Y  a  b X 5. Y  a  bX * X  X* 5. Y  a  bX *
Y
6.  aX  b 6. Y *  b  aX
6. Y  aX  bX
2
X
Example:- The number (Y) of bacteria per unit volume present in a culture after X hours is given in the
following table
Y X Log(Y)=Y* XY* X2
32 0 1.50515 0 0
47 1 1.6721 1.6721 1
65 2 1.81291 3.6258 4
92 3 1.96379 5.8914 9
132 4 2.12057 8.4823 16
190 5 2.27875 11.3938 25
275 6 2.43933 14.636 36
833 21 13.7926 45.7014 91
Fit a leasr square curve having the form Y=abX to the data. Estimate the value of Y when X=7.
We have to estimate a model Y=abX for which transformed line takes the form:
Log (Y )  Log (a )  XLog (b)
Y*= a* + b* X
S ( XY *) 4.33
b*    0.154
S ( XX ) 28
 
a*  Y * b * X  1.51
The regression equation is
Log(Y) = 1.51 + 0.154 X
Now
Log(a)=1.51 Log(b)=0.154
Antilog [Log(a)]=Antilog(1.51) = 32.36
Antilog[Log(b)]=Antilog(0.154) = 1.43
The estimated model is: Yˆ  (32.36)(1.43) X

Predict the grouth7 hours from now(X=7) to be: Yˆ  (32.36)(1.43) 7  395.70
Multiple Linear Regression
Multiple linear regression is a relationship that describes the dependence of mean values of the response
variable (y) for given values of two or more than two independent variable (X)
There are many applications where many explanatory variables affect the dependent var. for
example
1) Yield of a crop depend upon the fertility of the land, dose of the fertilizer applied, quantity of seed
etc.
2) The grade point average of students depend on aptitude, mental ability , hours devoted to study,
type and nature of grading by teachers.
3) The systolic blood pressure of a person depends upon one’s weight, age, etc.
If there are only two independent variables than Multiple Regression Model is:
Y    1 X 1   2 x 2   Population Regression Model
 Y / X 1, X 2     1 X 1   2 x 2 Population Regression Line
ˆ Y / X 1, X 2  Yˆ  a  b1 X 1  b2 x 2 Sample Regression Line
Where
X1 & X2 are independent variables and Y is the dependent variable.
a: Y intercept
b1 & b2 also called partial regression coefficients.
Where a, b1, b2 can be estimated from sample information as:
S ( X 2 , X 2 )S ( X 1 , Y )  S ( X 1 , X 2 )S ( X 2 , Y )
b1 
S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )]2
S ( X 1 , X 1 )S ( X 2 ,Y )  S ( X 1 , X 2 )S ( X 1 ,Y )
b2 
S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
 
bo  Y  b1 X 1  b 2 X 2
Interpretation of regression coefficients:
 a is the mean value of Y when X1=X2=0
 b1 is average change (increase or decrease) in response variable Y for one unit increase in the
explanatory variable X1 when the effect of X2 is held constant.
 b2 measures the average change in Y for unit increase in X2 when the effect of X1 is held constant.
EXAMPLE: The following data represent the performance of a chemical process as a function of several
controllable process variables:
CO2 Solvent Hydrogen
Product Total Consumption Y2 X 12 X 22 X1Y X2 Y X1X2
Y X1 X2
36.98 2227.25 2.06 1367.52 4960643 4.2436 82364 76.179 4588.1
13.74 434.90 1.33 188.79 189138 1.7689 5976 18.274 578.4
10.08 481.19 0.97 101.61 231544 0.9409 4850 9.778 466.8
8.53 247.14 0.62 72.76 61078 0.3844 2108 5.289 153.2
36.42 1645.89 0.22 1326.42 2708954 0.0484 59943 8.012 362.1
26.59 907.59 0.76 707.03 823720 0.5776 24133 20.208 689.8
19.07 608.05 1.71 363.66 369725 2.9241 11596 32.610 1039.8
5.96 380.55 3.93 35.52 144818 15.4449 2268 23.423 1495.6
15.52 213.40 1.97 240.87 45540 3.8809 3312 30.574 420.4
56.61 2043.36 5.08 3204.69 4175320 25.8064 115675 287.579 10380.3
229.50 9189.32 18.65 7608.87 13710479 56.0201 312224 511.926 20174.4
Y^ e  Y  Yˆ e2
47.3928 -10.41 108.42633
13.30172 0.44 0.1920925
13.68672 -3.61 13.008438
8.901963 -0.37 0.1383564
34.23853 2.18 4.7588268
21.29525 5.29 28.034412
16.99981 2.07 4.2857005
15.69707 -9.74 94.810435
10.04365 5.48 29.990376
47.9425 8.67 75.125472
229.50 0.00 358.77
1. Fit a multiple linear regression relating CO2 product to total solvent and hydrogen consumption
and calculate the value of R2
2. Test the significance of Regression
3. Test the significance of partial regression coefficients and construct confidence intervals
4. Can we conclude that total solvent and hydrogen consumption are sufficient number of
independent variables for explaining the variability in CO 2 product?
43.9475
Y
18.6225
1723.79
X1
716.86
3.865
X2
1.435
25 75 6. 8
6 .7 9 35 65
. 62 . 94 71 23 1.4 3 .8
18 43 17
X 1  918.93 X 2  1.865 Y  22.95

( X 1 )( Y ) (9189.32)(229.5)
S ( X 1Y )   X 1Y  312224   101329.106
n 10
( X 2 )( Y ) (18.65)(229.5)
S ( X 2Y )   X 2Y   511.926   83.91
n 10
( X 1 ) 2
S(X1X1)   X1 
2
5266118 .8
n
( X 2 ) 2
S(X 2 X 2 )   X 2 
2
 21.24
n
( X 1 )( X 2 )
S(X1X 2 )   X1X 2   3036.32
n
( Y ) 2
S (YY )   Y  2
 2341.84
n
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )]2  102633124.2
S ( X 2 , X 2 )S ( X 1 ,Y )  S ( X 1 , X 2 )S ( X 2 , Y ) 1897452.6
b1    0.0185
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
102633124.2
S ( X 1 , X 1 )S ( X 2 , Y )  S ( X 1 , X 2 )S ( X 1 , Y ) 134212437.4
b2    1.31
D  S ( X 1 , X 1 ) S ( X 2 , X 2 )  [ S ( X 1 , X 2 )] 2
102633124.2

bo  Y  b1 X 1  b 2 X 2  3.52
Fitted regression line is Y = 3.52 + 0.0185 X1 + 1.31 X2
ANALYSIS OF VARIANCE IN MULTIPLE LINEAR REGRESSION

The hypothesis 1=2=0 may be tested by analysis of variance procedure.
Total SS=S(Y,Y)= 2341.84
Reg.SS =b1 S(X1,Y)+ b2 S(X2,Y)=(0.0185)( 101329.106 )+(1.31)( 83.91 )=1983.07
ANOVA TABLE
Degree of Mean Sum
Source Of Variation Sum of Squares
Freedom of Squares Fcal Ftab
(S.O.V) (SS)
(DF) (MSS=SS/df)
Regression 2 1983.07 991.54 F.05(2,7)=4.74
19.35*
Error 7 358.77 51.25
TOTAL 9 2341.84
Coefficient of Determination
by the independent variables
Re g .SS 1983.07
R2  x100  x100  84.7%
TotalSS 12341.84
The value of R2, indicates that about 85 % variation in the dependent variable has been explained by the
linear relationship with X1 & X2 and remaining are due to some other unknown factors.
Test of hypothesis about significance of the partial regression coefficients:

Ho : 1 = 0
H1: 1  0
 = 5%
3) TEST STATISTIC
b1   1 0.0185  0
t   5.68
SE (b1) 0.003257
S ( X 2, X 2) 21.24
where S .E (b1)  S e  7.16  0.003257
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
102633124.2
4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

( n 2)
2
5) Result:- So reject Ho and conclude that there is significant relationship between CO2 Product
and Solvent Total
95% C.I for 1

b1 t / 2( n 2) SE (b1) 
0.0185  t .025( 6) 0.003257 
0.0185  (2.306)0.003257  = (0.011 , 0.026)
Ho : 2 = 0
H1: 2  0
 = 5%
3) TEST STATISTIC
b 2   2 1.31  0
t   0.81
SE (b 2) 1.622
S ( X 1, X 1) 5266118.8
where S .E (b 2)  S e  7.16  1.622
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
102633124.2

( n 2)
2
5) Result:- So don’t reject Ho and conclude that there is significant relationship between CO2 Product
and Hydrogen Consumption
95% C.I for 2

b 2 t  / 2( n 2) SE (b 2) 
1.31  t .025( 6) 1.622 
1.31  (2.306)1.622
(-2.43, 5.05)
Relative importance of independent variables

Standardized regression coefficients are useful for measuring the relative importance of the independent
variables because Standardized regression coefficients are unit free quantities
 S( X1, X1)   5266118.8 
b1*  b1    0.0185   0.38
 S (YY )   12341.84 
 S(X 2, X 2 )   21.24 
b2*  b2    1.31   0.054
 S (YY )   12341.84 
So Solvent Total(X1) is more important variable than Hydrogen Consumption(X2) in predicting the CO 2
Product.
Polynomial Regression
Example:- The data is regarding time (in weeks)[X] and the corresponding yield ( in Kg) [Y]of cotton per
plot in the specified period
Put X=X1 and X2=X2
Y X1 X2 X12 X22 X1X2 X1Y X2Y Y2 Y^ e=Y-Y^ e2

100 1 1 1 1 1 100 100 10000 95.08 4.92 24.189
125 2 4 4 16 8 250 500 15625 118.59 6.41 41.076
118 3 9 9 81 27 354 1062 13924 135.91 -17.91 320.790
135 4 16 16 256 64 540 2160 18225 147.04 -12.04 144.983
160 5 25 25 625 125 800 4000 25600 151.98 8.02 64.291
170 6 36 36 1296 216 1020 6120 28900 150.73 19.27 371.204
148 7 49 49 2401 343 1036 7252 21904 143.30 4.70 22.133
120 8 64 64 4096 512 960 7680 14400 129.67 -9.67 93.474
100 9 81 81 6561 729 900 8100 10000 109.85 -9.85 97.052
90 10 100 100 10000 1000 900 9000 8100 83.85 6.15 37.878
1266 55 385 385 25333 3025 6860 45974 166678 1266.00 0.00 1217.07
Y  126.6 X 1  5.50 X 2  38.50
  X 1
2
S ( X 1, X 1)   X 1 2
  82.50
n
 X 2
2
S ( X 2, X 2)   X 2 2
  10510.50
n
(  X 1)( Y ) (55)(1266)
S ( X 1, Y )   X 1Y   6860   103
n 10
( X 2)( Y ) (385)(1266)
S ( X 2, Y )   X 2Y   45974   2767
n 10
( X 1)( X 2)
S ( X 1, X 2)   X 1X 2   907.50
n
( Y ) 2
S (Y , Y )   Y 
2
 6402.40
n
D=43560
S ( X 2, X 2) S ( X 1, Y )  S ( X 1, X 2) S ( X 2, Y )
b1   32.7932
 S ( X 1, X 1)S ( X 2, X 2)  S ( X 1, X 2)
2
D
S ( X 1, X 1) S ( X 2, Y )  S ( X 1, X 2) S ( X 1, Y )
b2   3.0947
 S ( X 1, X 1)S ( X 2, X 2)  S ( X 1, X 2)
2
D
  
bo  Y  b1 X 1 b 2 X 2 =65.3800
Fitted regression is Y= 65.3800 + 32.7932 X1 - 3.0947 X 2
ANALYSIS OF VARIANCE
The hypothesis 1=2=0 may be tested by analysis of variance procedure.
Total SS=S(Y,Y)=6402.4
Reg.SS =b1 S(X1,Y)+ b2 S(X2,Y)=(32.7932)(-103)+(-3.0947)(-2767)=5185.3
ANOVA TABLE
Source Of Variation Degree of Sum of Squares Mean Sum Fcal Ftab
(S.O.V) Freedom (SS) of Squares
(DF) (MSS=SS/df)
2
Regression (X , X ) 2 5185.3 2592.7 14.91* F.05(2,7)=4.74
Error 7 1217.1 173.9
TOTAL 9 6402.40
Test of significance of Quadratic regression
Ho : 2 = 0
H1: 2  0
 = 5%
3) TEST STATISTIC
b 2   2  3.0947  0
t   5.39
SE (b 2) 0.5738
S ( X 1, X 1) 82.50
where S .E (b 2)  S e  13.19  0.5738
S ( X 1, X 1) S ( X 2, X 2)  [ S ( X 1, X 2)] 2
43560

( n 2)
2
5) Result:- So reject Ho and conclude that Quadratic regression is a useful model to explain the
variation in the dependent variable
Coefficient of Determination
by the independent variable
Re g .SS 5185.33
R2  x100  x100  81%
TotalSS 6404.40
The 2nd degree curve is appropriate for the above data set
 b1
The value of X at which maximum or minimum value of quadratic regression occur X  =5.30
2b 2
b12
The maximum or minimum value of Y is bo  =152.28
4b 2
Comparison of Ist degree and 2nd degree curve
SCATTER PLOT
170
160
150
140
130
y
120
110
100
90
0 1 2 3 4 5 6 7 8 9 10
x1
SIMPLE LINEAR REGRESSION ( 1st degree curve) Curvilinear REGRESSION ( 2nd degree curve)
y = 133 - 1.25 X y = 65.4 + 32.8 X - 3.09 X2
Se = 28.00 R2 = 2.0% Se = 13.19 R2 = 81.0%
CORRELATION ANALYSIS
SIMPLE CORRELATION
Q.1. The following data represent the wing length and tail length of sparrows
Wing length Tail length
(X) (Y) XY X2 Y2
10.4 7.4 76.96 108.16 54.76
10.8 7.6 82.08 116.64 57.76
11.1 7.9 87.69 123.21 62.41
10.2 7.2 73.44 104.04 51.84
10.3 7.4 76.22 106.09 54.76
10.2 7.1 72.42 104.04 50.41
10.7 7.4 79.18 114.49 54.76
10.5 7.2 75.6 110.25 51.84
10.8 7.8 84.24 116.64 60.84
11.2 7.7 86.24 125.44 59.29
10.6 7.8 82.68 112.36 60.84
11.4 8.3 94.62 129.96 68.89
128.2 90.8 971.37 1371.31 688.40
X Y XY X2 Y2
(a) Find Coefficient of Correlation between wing length and Tail length.
(b) Test the hypothesis H 0 : 12  0
Solution
(a) Coefficient of Correlation between wing length and Tail length
X  1 0 .6 8 Y  7 .5 7
S XY   XY  nX Y  1 .3 2
SX2   X  n( X ) 2 2
 1 .7 2
SY 2   Y  n (Y )
2 2
 1 .3 5
S XY
r   0 .8 6 6
S X 2 SY 2
(b) Test of hypothesis for =0
H 0 : 12  0
1) Construction of hypotheses:
H1 : 12  0
2) Level of significance :  = 5%
r12  12
3) TEST STATISTIC t 
SE(r12 )
0.866  0 1 r122 1 0.8662
4) Calculation: tcal   5.47 where SE(r12 )    0.158
0.158 n2 12  2
5) Critical Region:- tTab  t 2(n2)df  t0.025(10)df  2.228
6) Conclusion:- Since tcal  tTab so we reject Ho and conclude that there is significant linear relationship
between wing and tail length.
Q.2. A random sample of 10 families had the following income and expenditure per week
Let Y=Family Expenditure and X=Family Income
Y X Y2 X2 XY
7 20 49 400 140
9 30 81 900 270
8 33 64 1089 264
11 40 121 1600 440
5 15 25 225 75
4 13 16 169 52
8 26 64 676 208
10 38 100 1444 380
9 35 81 1225 315
10 43 100 1849 430
81 293 701 9577 2574
X Y Y2 X2 XY
X  8.1 Y  29.3
S XY   XY  nXY  200.70
S X 2   X 2  n ( X ) 2  992.75
S Y 2   Y 2  n (Y ) 2  44.90
S XY 2 0 0 .7 0 2 0 0 .7 0
r1 2     0 .9 5
S X 2 SY 2 (9 9 2 .7 5 )( 4 4 .9 0 ) 2 1 1 .1 3
H 0 : 12  0
H1 : 12  0
r12  12
SE(r12 )
0.95  0 1 r122 1 0.952
4) Calculation: tcal   3..04 where SE(r12 )    0.3122
0.3122 n2 10  2
between City size and development expenditure.
Q.3.The following data represent the city size and Expenditure.
Let X=City size Y= Expenditures
X Y X2 Y2 XY
30 65 900 4225 1950
50 77 2500 5929 3850
75 79 5625 6241 5925
100 80 10000 6400 8000
150 82 22500 6724 12300
200 90 40000 8100 18000
175 84 30625 7056 14700
120 81 14400 6561 9720
900 638 126550 51236 74445
112.5 79.75 15818.75 6404.5 9305.625
(a) Find Coefficient of Correlation between wing length and Tail length.
(b) Test the hypothesis H 0 : 12  0
Solution. (a)
X  1 1 2 .5 0 Y  7 9 .7 5
S XY   XY  nXY  2 6 6 3 .6 2
SX2   X  n( X ) 2 2
 25400
SY 2   Y  n (Y )
2 2
 3 5 5 .5 0
S XY 2 6 6 3 .6 2 2 6 6 3 .6 2
r1 2     0 .8 9
S X 2 SY 2 ( 2 5 4 0 0 )(3 5 5 .5 0 3 0 0 4 .9 4
H 0 : 12  0
H1 : 12  0
r12  12
SE(r12 )
0.89  0 1 r122 1 0.892
0.1861 n2 8 2
between City size and development expenditure.
PARTIAL CORRELATION
Q.1. :- Suppose that X1=Fish Length X2=Fish weight X3=Fish age and r 12=0.60 , r13 =0.70,
r23=0.65 n=15
(a) Find partial correlation coefficient between X1 and X2 while the effect of X3 kept constant. Or find r12.3 .
(b) Test the hypothesis H 0 : 12.3  0
Solution. (a)
r12  r13r23 (0.60)  (0.70)(0.65)
r12.3    0.27
(1  r13 )(1  r23 )
2 2
(1  0.70 )(1  0.65 )
2 2
(b) Test of hypothesis for partial correlation coefficient

H 0 : 12.3  0
H1 : 12.3  0
r12.3  12.3
SE(r12.3 )
0.27  0 1 r12.3
2
1 0.272
0.1861 n2k 15  2 1
5) Critical Region:- tTab  t 2(n2k )df  t0.025(1521)df  t0.025(12)df  2.179
6) Conclusion:- Since tcal  tTab so we don’t reject Ho and conclude that there is a non- significant
linear relationship between X and Y.
r23=0.65 n=15
(a) Find partial correlation coefficient between X1 and X3 while the effect of X2 kept constant. Or find r13.2 .
(b) Test the hypothesis H 0 : 13.2  0
Solution. (a)
r13  r12r23 (0.70)  (0.60)(0.65)
r13.2    0.51
(1 r12 )(1 r )
2
23
2
(1 0.60 )(1 0.65 )
2 2

H 0 : 13.2  0
H1 : 13.2  0
r13.2  13.2
SE(r13.2 )
0.51 0 1 r13.2
2
1 0.512
0.25 n2k 15  2 1
MULTIPLE CORRELATION
r23=0.65 n=15
(a) Find Multiple correlation coefficient between X1 and joint effect of X2 and X3.
(b) Find R1.23 and Test the hypothesis H 0 : 1.23  0
Solution. (a)
r122  r132  2r12r13r23 (0.60)2  (0.70)2  2(0.60)(0.70)(0.65)

R1.23    0.73
(1 r232 ) [1 (0.65)2 ]
H 0 : 1.23  0
H1 : 1.23  0
(n  k 1)R1.23
2
3) TEST STATISTIC F 
k(1 R1.23
2
)
(15  2 1)(0.73)2
4) Calculation: Fcal   6.85
2[1 (0.73)2 ]

4 STAT-602 Regression &amp; Correlation (Mid&amp;Final)

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

4 STAT-602 Regression &amp; Correlation (Mid&amp;Final)

Загружено:

Авторское право:

Доступные форматы

Regression Analysis

Y= 0 + 1 X+ , Population Regression Model

Y^ = b0 + b1 X Sample Regression Line

How to identify the relationship between variables

LEAST SQUARE LINE

Wing length VS Days

Standard Error of Estimate

Test of hypothesis for 1

Test of hypothesis for 0

ANALYSIS OF VARIANCE IN SIMPLE LINEAR REGRESSION

The hypothesis 1=0 may be tested by analysis of variance procedure.

Test of hypothesis for mean value of Y i.e y/x

Test of hypothesis for single of Y

Yˆ13  Y13 4.225  4.2

2. Y  ab X 2. Log (Y )  Log (a )  XLog (b) 2. Y *  a *  b * X

The estimated model is: Yˆ  (32.36)(1.43) X

 Y / X 1, X 2     1 X 1   2 x 2 Population Regression Line

ˆ Y / X 1, X 2  Yˆ  a  b1 X 1  b2 x 2 Sample Regression Line

X 1  918.93 X 2  1.865 Y  22.95

Fitted regression line is Y = 3.52 + 0.0185 X1 + 1.31 X2

ANALYSIS OF VARIANCE IN MULTIPLE LINEAR REGRESSION

Test of hypothesis about significance of the partial regression coefficients:

Test of hypothesis for 1

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

95% C.I for 1

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

95% C.I for 2

Relative importance of independent variables

Y X1 X2 X12 X22 X1X2 X1Y X2Y Y2 Y^ e=Y-Y^ e2

Fitted regression is Y= 65.3800 + 32.7932 X1 - 3.0947 X 2

4) Decision Rule:- Reject Ho if t cal  t   t 0.025(8)  2.306

y = 133 - 1.25 X y = 65.4 + 32.8 X - 3.09 X2

Se = 28.00 R2 = 2.0% Se = 13.19 R2 = 81.0%

(b) Test of hypothesis for partial correlation coefficient

(b) Test of hypothesis for partial correlation coefficient

r122  r132  2r12r13r23 (0.60)2  (0.70)2  2(0.60)(0.70)(0.65)

Вам также может понравиться

4 STAT-602 Regression & Correlation (Mid&Final)

4 STAT-602 Regression & Correlation (Mid&Final)