Вы находитесь на странице: 1из 10

# Kathmandu University

## Simple Linear Regression and Correlation

Many problems in science and engineering involve exploring the relation between two or more variables.
Regression analysis is a statistical technique for modeling and investigating the relation between the
variables. For instance, in a chemical process, we might be interested in the relationship between the output of
the process, the temperature at which it occurs, and the amount of catalyst employed. Knowledge of such a
relationship would enable us to predict the output for various values of temperature and amount of catalyst.
In general, suppose that there is a single dependent variable (called response variable) dependents on
independent variables say 1 , 2 , , (also called regressor or predictor variables). The response
variable Y is a random variable, while the regressor variables 1 , 2 , , are measured with negligible error.
The regressor variables can be controlled by the experimenter. The mathematical relation between these
variables is called the regression equation. This regression equation is fitted to the set of data, that means,
= (1 , 2 , , ). Polynomial functions are usually employed as an approximation function.

## Simple Linear Regression

A relationship between a single regressor variable x and a response variable Y is called a simple linear
regression. The regressor variable x is controlled by the experimenter.

Suppose that the true relationship between Y and x is a straight line and the observation Y at each x is a random
variable. So, it is reasonable to consider that

= 0 + 1 + (1),

where 0 1 are called the regression coefficients. These coefficients are unknown and they can be
estimated from the observed data. The equation (1) is called simple regression line. The symbol denotes the
random error in the modeling of equation (1), the random error is assumed to have mean zero ( = 0).
Then
(|) = 0 + 1 (2)

Estimation of

Suppose we have n pair of observation 1, 1 , 2, 2 , , , , these data are used to estimate the
unknown parameters 0 1 of the equation (1) such that the sum of the squares of the errors is least
possible (miminum).

## For the set of data , , = 1,2, , , we get from equation (1)

= 0 + 1 + , = 1, 2, ,

= 0 1 , = 1, 2, ,

## Now, the sum of the squares of the errors , = 1,2, ,

2
= =1 to be minimized.

1

That mean, = =1( 0 1 ) 2 (3) to be minimized.

Let 0 1 be the estimators of 0 1 for attaining the minimum value of S. For minimum value of S,

we must have = 0 1
=0
0

Now for = 0 2 =1( 0 1 ) = 0
0

( 0 1 ) = 0
=1

0 1 = 0
=1 =1

0 + 1 = (4)
=1 =1

and for = 0 2 =1 ( 0 1 ) = 0
1

( 0 1 2 ) = 0
=1

0 1 2 = 0
=1 =1 =1

0 + 1 2 = (5)
=1 =1 =1

Equations (4) and (5) are called the least square normal equations. We rewrite equation (4) to get

1 1
0 = 1

=1 =1

0 = 1 (6)
n
Multiplying (4) by i=1 x i and (5) by n and then subtract (4) from (5) to get

1
=1 =1 =1
1 = 2 (7)
2 1
=1 =1

## Thus the least square fitted simple linear regression line is

2
= 0 + 1 (8)

Equation (8) is used to predict the value of response variable Y for given regressor x.

Note: The least square fitted simple regression line (8) passes through the center point ( , ) because of
equation (6).

## Notation: For numerical calculation

2
Denote = =1 = =
=

= =

=1 = =1 =1

Then from equation (7) using above notations, we have 1 =

Note: The least square estimators 0 1 are random variables, since they are calculated using the linear
combinations of random values of the random variable Y.

## Equation of regression line is = 0 + 1

Where, 1 = , 0 = 1

## Unbiasedness and variance of the estimators

1 2
(1) 0 = 0 and 0 = 2( + )

2
(2) 1 = 1 and 1 =

## Let : the observed value of response variable Y at

: the predicted value of response variable Y at using regression line given by equation (8).

Then the difference = is known as residue (error) of Y at . The sum of the square of the residues
or error sum of the square is

= 2
=1

2
= ( )
=1

## But from equation (8), we have, = 0 + 1 , so

= ( 0 1 )2
=1

2
= (2 0 1 ) , after simplifications, we get
=1

= 1 ---------------------- (9)

2
= =1 .

2
Then, = =1

2 2
= +
=1 =1

= +

= (10) ,

## = + 1 (Using equation (9))

= 1 -------------------------(11)

## The coefficient of Determination (Measure of quality of regression line fit)

The coefficient of determination is a measure of how well the regression line given in the equation (8)
represents the data. The coefficient of determination is the quantity is defined by

2 =

2 =
(Using equation (10))

2 = 1

4
So,

2 = =1

Remarks:

## The coefficient of determination 2 has values, 0 2 1

The coefficient of determination represents the percent of the data that is closest to the fitted regression
line (8). For instance, if 2 = 0.9963 imples 99.63% of the total variation in Y explained by the
regression line (8). Equivalently (100-99.63)%=0.57% remained unexplained by the regression line (8)
If 2 = 1, the regression line perfectly fitted the observed data , = 1, 2, , .

Correlation

## Scatter Diagram: Plot of data point , , = 1, 2, , in xy-plane or , , , = 1, 2, , in

xyz-space is called the scatter diagram.

## Figure: Scatter Diagram.

Source: Introduction to probability and statistics for engineers and scientists 4 th editions by
Sheldon M. Ross.

Correlation:
Correlation is the relation between two random variables X and Y. It is the measure of how things are
related. For examples the correlation between
The rainfall and level of pollutant in a city.
Temperature and consumption of cold drinks.

5
Height and weight of kids of age 5.

## Three different types of correlations:

Positive Correlation
Negative Correlation
No Correlation

## Figure: Three types of correlations.

Source: http://www.statisticshowto.com/what-is-the-pearson-correlation-coefficient/

Correlation Coefficient
Correlation Coefficient is the measure of correlation between two random variables X and Y.
It measures the strength and direction of linear relation between two random variables X and
Y. For the given values of the pair , , = 1,2, , of random variables (X, Y), the
correlation coefficient is given by

=1 ( )
=
=
2 2
=1 =1
Properties of correlation coefficient r:
The values of r lie in 1 1.
If r lies in the range 0 < 1, then correlation is said to be positive and if = 1,
then correlation is perfectly positive.
If r lies in the range 1 0, then correlation is said to be negative and if
= 1, then correlation is perfectly negative.
If = 0, then there is no correlation between the random variables X and Y.
Determinant of coefficient 2 = 2 .

## Working Formula for the Finding the Correlation Coefficient

1
=1 =1 =1
=
2 1 2 2 1 2
=1
=1 =1
=1

6

=1 =1 =1
=
2 2
2 2
=1 =1 =1 =1

Example: A study of the amount of rainfall and the quantity of air pollution removed produced the
following data:
() ( ) () (/3)
4.3 126
4.5 121
5.9 116
5.6 118
6.1 114
5.2 118
3.8 132
2.1 141
7.5 108
a) Plot a scatter diagram.
b) Find the equation of the regression line to predict, the particulate removed from the amount of
daily rainfall.
c) Estimate the amount of particulate removed when the daily rainfall is = 4.8 units.
d) Find the determination of coefficient.
e) Find the correlation coefficient.

## (b) Equation of the regression line:

The equation of the regression line is
= 0 + 1 (1)

7

Where, 1 = , and 0 = 1 and

2
1
= 2

=1 =1

1
=

=1 =1 =1

## Table for computation:

2 2
4.3 126 18.49 15876 541.8
4.5 121 20.25 14641 544.5
5.9 116 34.81 13456 684.4
5.6 118 31.36 13924 660.8
6.1 114 37.21 12996 695.4
5.2 118 27.04 13924 613.6
3.8 132 14.44 17424 501.6
2.1 141 4.41 19881 296.1
7.5 108 56.25 11664 810

## = 45 = 1094 2 = 244.26 2 = 133786 = 5348.2

=1 =1 =1 =1 =1

=9 ,
2 2
=1 = 45, =1 = 1094, =1 = 244.26, =1 = 133786, =1 = 5348.2
1 45
= =1 = = 5,
9

1 1094
= = = 121.56
9
=1

So,
2
2
1
=

=1 =1

1 2
= 244.26 45
9

= 19.26
1
And = =1 =1 =1

1
= 5348.2 45 1094
9

= 121.80

8
121.80
Now, 1 = = 19.26
= 6.32

## = 153.16 6.32 (2)

(c) Estimate the amount of particulate removed when the daily rainfall is = . units:

When = 4.8

= 122.82

## (d) Determination of coefficient:

2 = 1

Where, = 1
1
2 = 1

1
2 = 1 1 +

1
2 =
2 1 2
Now, = =1 =1

1 2
= 133786 1094
9

= 804.22

1 (6.32) (121.80)
2 = =
804.22

2 = 0.9572

Thus, 2 = 0.9572 implies 95.72% of the total variation in Y explained by the regression line (2).
(e) Correlation coefficient:

=1 =1 =1
=
2 2 2 2
=1 =1 =1 =1

9
9 5348.2 45 1094
=
9 244.26 45 2 9 133786 1094 2

1096.20
=
13.17 85.08

= 0.9783

Thus, = 0.9783 implies the random variable Y strongly negatively correlated with the random
variable X. That means particulate removed decreases strongly when daily rainfall increases.

10