Simple Linear Regression and Correlation: Kathmandu University Course: MATH 208, ENVE II/II

Kathmandu University
Course: MATH 208, ENVE II/II
Prepared by: Dr. Samir Shrestha
Simple Linear Regression and Correlation

Many problems in science and engineering involve exploring the relation between two or more variables.
Regression analysis is a statistical technique for modeling and investigating the relation between the
variables. For instance, in a chemical process, we might be interested in the relationship between the output of
the process, the temperature at which it occurs, and the amount of catalyst employed. Knowledge of such a
relationship would enable us to predict the output for various values of temperature and amount of catalyst.
In general, suppose that there is a single dependent variable (called response variable) dependents on
independent variables say 1 , 2 , , (also called regressor or predictor variables). The response
variable Y is a random variable, while the regressor variables 1 , 2 , , are measured with negligible error.
The regressor variables can be controlled by the experimenter. The mathematical relation between these
variables is called the regression equation. This regression equation is fitted to the set of data, that means,
= (1 , 2 , , ). Polynomial functions are usually employed as an approximation function.
Simple Linear Regression
A relationship between a single regressor variable x and a response variable Y is called a simple linear
regression. The regressor variable x is controlled by the experimenter.
Suppose that the true relationship between Y and x is a straight line and the observation Y at each x is a random
variable. So, it is reasonable to consider that
= 0 + 1 + (1),
where 0 1 are called the regression coefficients. These coefficients are unknown and they can be
estimated from the observed data. The equation (1) is called simple regression line. The symbol denotes the
random error in the modeling of equation (1), the random error is assumed to have mean zero ( = 0).
Then
(|) = 0 + 1 (2)
Estimation of
Suppose we have n pair of observation 1, 1 , 2, 2 , , , , these data are used to estimate the
unknown parameters 0 1 of the equation (1) such that the sum of the squares of the errors is least
possible (miminum).
For the set of data , , = 1,2, , , we get from equation (1)
= 0 + 1 + , = 1, 2, ,
= 0 1 , = 1, 2, ,
Now, the sum of the squares of the errors , = 1,2, ,

2
= =1 to be minimized.
1

That mean, = =1( 0 1 ) 2 (3) to be minimized.
Let 0 1 be the estimators of 0 1 for attaining the minimum value of S. For minimum value of S,

we must have = 0 1
=0
0

Now for = 0 2 =1( 0 1 ) = 0
0
( 0 1 ) = 0
=1
0 1 = 0
=1 =1
0 + 1 = (4)
=1 =1

and for = 0 2 =1 ( 0 1 ) = 0
1
( 0 1 2 ) = 0
=1
0 1 2 = 0
=1 =1 =1
0 + 1 2 = (5)
=1 =1 =1
Equations (4) and (5) are called the least square normal equations. We rewrite equation (4) to get

1 1
0 = 1

=1 =1
0 = 1 (6)
n
Multiplying (4) by i=1 x i and (5) by n and then subtract (4) from (5) to get
1
=1 =1 =1
1 = 2 (7)
2 1
=1 =1
Equations (6) and (7) give the least square estimator of 0 1 .
Thus the least square fitted simple linear regression line is
2
= 0 + 1 (8)
Equation (8) is used to predict the value of response variable Y for given regressor x.
Note: The least square fitted simple regression line (8) passes through the center point ( , ) because of
equation (6).
Notation: For numerical calculation

2
Denote = =1 = =
=

= =

=1 = =1 =1

Then from equation (7) using above notations, we have 1 =

Note: The least square estimators 0 1 are random variables, since they are calculated using the linear
combinations of random values of the random variable Y.
Equation of regression line is = 0 + 1

Where, 1 = , 0 = 1

Unbiasedness and variance of the estimators

1 2
(1) 0 = 0 and 0 = 2( + )

2
(2) 1 = 1 and 1 =

Error Sum of Squares
Let : the observed value of response variable Y at
: the predicted value of response variable Y at using regression line given by equation (8).
Then the difference = is known as residue (error) of Y at . The sum of the square of the residues
or error sum of the square is

= 2
=1

2
= ( )
=1
But from equation (8), we have, = 0 + 1 , so
= ( 0 1 )2
=1

2
= (2 0 1 ) , after simplifications, we get
=1
= 1 ---------------------- (9)
Regression Sum of Squares
Regression sum of square is defined by

2
= =1 .
It is related with the estimated variance of the response variable Y.

2
Then, = =1

2 2
= +
=1 =1
= +
= (10) ,
= + 1 (Using equation (9))
= 1 -------------------------(11)
The coefficient of Determination (Measure of quality of regression line fit)
The coefficient of determination is a measure of how well the regression line given in the equation (8)
represents the data. The coefficient of determination is the quantity is defined by

2 =

2 =
(Using equation (10))

2 = 1

4
So,

2 = =1

Remarks:
The coefficient of determination 2 has values, 0 2 1

The coefficient of determination represents the percent of the data that is closest to the fitted regression
line (8). For instance, if 2 = 0.9963 imples 99.63% of the total variation in Y explained by the
regression line (8). Equivalently (100-99.63)%=0.57% remained unexplained by the regression line (8)
If 2 = 1, the regression line perfectly fitted the observed data , = 1, 2, , .
Correlation
Scatter Diagram: Plot of data point , , = 1, 2, , in xy-plane or , , , = 1, 2, , in

xyz-space is called the scatter diagram.
Figure: Scatter Diagram.

Source: Introduction to probability and statistics for engineers and scientists 4 th editions by
Sheldon M. Ross.
Correlation:
Correlation is the relation between two random variables X and Y. It is the measure of how things are
related. For examples the correlation between
The rainfall and level of pollutant in a city.
Temperature and consumption of cold drinks.
5
Height and weight of kids of age 5.
Three different types of correlations:

Positive Correlation
Negative Correlation
No Correlation
Figure: Three types of correlations.

Source: http://www.statisticshowto.com/what-is-the-pearson-correlation-coefficient/
Correlation Coefficient
Correlation Coefficient is the measure of correlation between two random variables X and Y.
It measures the strength and direction of linear relation between two random variables X and
Y. For the given values of the pair , , = 1,2, , of random variables (X, Y), the
correlation coefficient is given by

=1 ( )
=
=
2 2
=1 =1
Properties of correlation coefficient r:
The values of r lie in 1 1.
If r lies in the range 0 < 1, then correlation is said to be positive and if = 1,
then correlation is perfectly positive.
If r lies in the range 1 0, then correlation is said to be negative and if
= 1, then correlation is perfectly negative.
If = 0, then there is no correlation between the random variables X and Y.
Determinant of coefficient 2 = 2 .
Working Formula for the Finding the Correlation Coefficient
1
=1 =1 =1
=
2 1 2 2 1 2
=1
=1 =1
=1
6

=1 =1 =1
=
2 2
2 2
=1 =1 =1 =1
Example: A study of the amount of rainfall and the quantity of air pollution removed produced the
following data:
() ( ) () (/3)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108
a) Plot a scatter diagram.
b) Find the equation of the regression line to predict, the particulate removed from the amount of
daily rainfall.
c) Estimate the amount of particulate removed when the daily rainfall is = 4.8 units.
d) Find the determination of coefficient.
e) Find the correlation coefficient.
Answer:
(a) Scatter diagram:
Figure: Scatter diagram of particulate removed versus daily rainfall.
(b) Equation of the regression line:

The equation of the regression line is
= 0 + 1 (1)
7

Where, 1 = , and 0 = 1 and

2
1
= 2

=1 =1

1
=

=1 =1 =1
Table for computation:
2 2
4.3 126 18.49 15876 541.8
4.5 121 20.25 14641 544.5
5.9 116 34.81 13456 684.4
5.6 118 31.36 13924 660.8
6.1 114 37.21 12996 695.4
5.2 118 27.04 13924 613.6
3.8 132 14.44 17424 501.6
2.1 141 4.41 19881 296.1
7.5 108 56.25 11664 810

= 45 = 1094 2 = 244.26 2 = 133786 = 5348.2

=1 =1 =1 =1 =1
=9 ,
2 2
=1 = 45, =1 = 1094, =1 = 244.26, =1 = 133786, =1 = 5348.2
1 45
= =1 = = 5,
9

1 1094
= = = 121.56
9
=1
So,
2
2
1
=

=1 =1
1 2
= 244.26 45
9
= 19.26
1
And = =1 =1 =1
1
= 5348.2 45 1094
9
= 121.80
8
121.80
Now, 1 = = 19.26
= 6.32

0 = 1 = 121.56 6.32 5 = 153.16
So, the equation of regression line (1) is given by
= 153.16 6.32 (2)
(c) Estimate the amount of particulate removed when the daily rainfall is = . units:
When = 4.8
= 153.16 6.32 4.8
= 122.82
(d) Determination of coefficient:

2 = 1

Where, = 1
1
2 = 1

1
2 = 1 1 +

1
2 =
2 1 2
Now, = =1 =1
1 2
= 133786 1094
9
= 804.22
1 (6.32) (121.80)
2 = =
804.22
2 = 0.9572
Thus, 2 = 0.9572 implies 95.72% of the total variation in Y explained by the regression line (2).
(e) Correlation coefficient:

=1 =1 =1
=
2 2 2 2
=1 =1 =1 =1
9
9 5348.2 45 1094
=
9 244.26 45 2 9 133786 1094 2
1096.20
=
13.17 85.08
= 0.9783
Thus, = 0.9783 implies the random variable Y strongly negatively correlated with the random
variable X. That means particulate removed decreases strongly when daily rainfall increases.
10

Simple Linear Regression and Correlation: Kathmandu University Course: MATH 208, ENVE II/II

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Simple Linear Regression and Correlation: Kathmandu University Course: MATH 208, ENVE II/II

Загружено:

Авторское право:

Доступные форматы

Kathmandu University

Course: MATH 208, ENVE II/II

Prepared by: Dr. Samir Shrestha

Simple Linear Regression and Correlation

Simple Linear Regression

For the set of data , , = 1,2, , , we get from equation (1)

Now, the sum of the squares of the errors , = 1,2, ,

Equations (6) and (7) give the least square estimator of 0 1 .

Thus the least square fitted simple linear regression line is

Notation: For numerical calculation

Equation of regression line is = 0 + 1

Unbiasedness and variance of the estimators

Error Sum of Squares

Let : the observed value of response variable Y at

But from equation (8), we have, = 0 + 1 , so

Regression Sum of Squares

Regression sum of square is defined by

It is related with the estimated variance of the response variable Y.

= + 1 (Using equation (9))

The coefficient of Determination (Measure of quality of regression line fit)

The coefficient of determination 2 has values, 0 2 1

Scatter Diagram: Plot of data point , , = 1, 2, , in xy-plane or , , , = 1, 2, , in

Figure: Scatter Diagram.

Three different types of correlations:

Figure: Three types of correlations.

Working Formula for the Finding the Correlation Coefficient

(a) Scatter diagram:

Figure: Scatter diagram of particulate removed versus daily rainfall.

(b) Equation of the regression line:

Table for computation:

= 45 = 1094 2 = 244.26 2 = 133786 = 5348.2

0 = 1 = 121.56 6.32 5 = 153.16

So, the equation of regression line (1) is given by

= 153.16 6.32 (2)

= 153.16 6.32 4.8

(d) Determination of coefficient:

Вам также может понравиться