Вы находитесь на странице: 1из 4

Regression Analysis

The objective of many investigations is to understand and explain the relationship among variables. Frequently, one wants to know how and to what extent a certain variable (response variable) is related to a set of other variables (explanatory variables). Regression analysis helps us to determine the nature and the strength of relationship among variables. Types of relationship:

i) Deterministic relationship also called functional relationship

ii) Probabilistic relationship also called statistical relationship

In deterministic relationship the relationship between two variables is known exactly such as

a) Area of a circle=

b) F=k(m1m2/r 2 ) (Newtonís law of gravity)

c)The relationship between dollar sales (Y) of a product sold at a fixed price and the number of units sold.

In statistical relationship the relation between variables is not know exactly and we have to approximate the relationship and develop models that characterize their main features. Regression analysis is concerned with developing such ìapproximatingî models. For example, in business research the sale of the product is related to the advertising expenditure of the product. It is usually required to build a model relating sale to advertising expenditure. The word regression is used to investigate the dependence of one variable called the dependent variable denoted by Y, on one or more variables, called independent variables denoted by Xís and provides an equation to be used for estimating or predicting the average value of the dependent variable from the known values of the independent variables. When we study the dependence of a variable on a single independent variable, it is called simple regression. Where as the dependence of a variable on two or more than two independent variables is called multiple regression.

r 2

Regressor:- The variable that forms the basis of estimation or prediction is called the regressor. It is also called independent variable, or explanatory or controlled or predictor variable, usually denoted by X. Regressand:- The variable whose resulting values depends upon the known values of independent variable, is called regressand. It is also called response, dependent, or random variable, usually denoted by

Y.

In simple regression, the dependence of response variable (Y) is investigated on only one regressor (X). If the relationship of these variables can be described by a straight line, it is termed as simple linear regression. The population simple linear regression model is defined as:

0

0

+

+

1 X

1 X +

Y

Y

,

=

=

Population Regression Model

Population Regression Line

where 0 and 1 are the population regression coefficients and i is a random error peculiar to the i-th observation. Thus, each response is expressed as the sum of a value predicted from the corresponding X, plus a random error. The sample regression equation is an estimate of the population regression equation. Like any other estimate, there is an uncertainty associated with it.

Y^ = b 0 + b 1 X Where

Sample Regression Line

 b 0 : Y intercept b1: Slope of regression line

b0 & b1 also called regression coefficients. X1 is independent variable and Y is the dependent variable. This model is said to be simple (b/c only one independent variable) linear in parameters and linear in independent variable (as it is in first power not X 2 or X 3 )

How to identify the relationship between variables

In order to begin regression analysis, useful tool is to plot the Y verses X this plot is called a scatter plot

and may suggest that what type of mathematical functions would be appropriate for summarizing the data.

A variety of functions are useful in fitting models to data.

LEAST SQUARE LINE A least square line is described in terms of its Y-intercept (the height at which it intercepts the Y-axis) and its slope (the angle of the line). The line can be expressed by the following relation

Y=a + bX Where

or

Y

b b

0

1

X (Estimated regression of Y on X)

 b  S ( XY ) Called slope of the line In other words S  ( XX  ) b 1  S S X Y a  Y  b X , Called intercept of the line X 2 b 0  Y  b 1 X

Example: - The following data are the sparrow wing length in cm at various times in days after hatching

 Wing Age XY X 2 Y 2 ^ Y e=Y-Y ^ e 2 Length (X) (Y) 1.4 3 4.2 9 1.96 1.525 -0.125 0.015625 1.5 4 6.0 16 2.25 1.795 -0.295 0.087025 2.2 5 11 25 4.84 2.065 0.135 0.018225 2.4 6 14.4 36 5.76 2.335 0.065 0.004225 3.1 8 24.8 64 9.61 2.875 0.225 0.050625 3.2 9 28.8 81 10.24 3.145 0.055 0.003025 3.2 10 32.0 100 10.24 3.415 -0.215 0.046225 3.9 11 42.9 121 15.21 3.685 0.215 0.046225 4.1 12 49.2 144 16.81 3.955 0.145 0.021025 4.7 14 65.8 196 22.09 4.495 0.205 0.042025 4.5 15 67.5 225 20.25 4.765 -0.265 0.070225 5.2 16 83.2 256 27.04 5.035 0.165 0.027225 5.0 17 85.0 289 25.00 5.305 -0.305 0.093025 44.4 130 514.80 1562 171.3 44.395 0.005 0.525

(i):- Draw scatter plot for the data

(ii):- Fit simple linear regression and interpret the parameters

(iii):- Calculate coefficient of determination and interpret it. (iv):- Estimate the value of Y when X=13.

v):-Estimate and interpret simple correlation coefficient

Solution:-

Wing length VS Days

6
4
2
0
0
2
4
6
8
10
12
14
Wing length (Cm)

age (days)

X

10

Y

3.415

( )

S XY

( )

S XX

n

i 1

(

X

i

(

X

 

i

X

X

)(

Y

i

Y

)

2

)

X

2

XY

(

 

X

)(

Y

)

n

(

X

)

2

n

262

S ( YY

)

(

Y

i

Y

)

2

Y

2

S

(

)

XY

1

bo

b

S

Y

 

(

XX

b1 X

)

0.270 cm/day

0.715 cm

(

Y

)

2

n

19.6569

70.8

So estimated simple linear regression equation is Y=0.715 + 0.270 X Interpretation of estimated regression parameter

The value of b 1 =0.270, indicates that the average wing length is expected to increase by 0.270 cm with each one day increase in age.

The observed range of age(Explanatory Variable) in the experiment was 3 to 17 days(i.e scope of the model), therefore it would be an unreasonable extrapolation to expect this rate of increase in wing length

to continue if number of days were to increase. It is safe to use the results of regression only within the

range of the observed value of the independent variable only (i.e within the scope of the model).

In regression equation b 0 =0.715, is the average wing length when age=0 day. In this example since scope of the model does not cover x=0 so b 0 does not have any particular meaning as a separate term in the regression equation. NOTE: Interpolation and Extrapolation Interpolation is making a prediction within the range of values of the predictor in the sample used to

generate the model. Interpolation is generally safe. Extrapolation is making a prediction outside the range

of values of the predictor in the sample used to generate the model. The more removed the prediction is

from the range of values used to fit the model, the riskier the prediction becomes because there is no way

to check that the relationship continues to be linear

Total variation:- S(YY)=19.6569 Explained variation (Variation in Y due to X also called variation due to regression):

bS(XY) =0.270(70.80)=19.1322 Unexplained Variation: Total variation ñ explained variation=19.6569-19.1322=0.5247 Goodness of Fit An important part of any statistical procedure that builts models from data are establishing how well the model actually fits. This topic encompasses the detecting of possible violations of the required

assumptions in the data being analyzed and to check how close the observed data points to the fitted line.

A commonly used measure of the goodness of fit of a linear model is R 2 called coefficient of

determination. If all the observations fall on the regression line R 2 is 1. If no linear relationship between Y & X R 2 is 0. R 2 =0 does not necessarily mean that there is no association between the variables. Instead,

it indicates that there is no linear relationship.

The co-efficient of determination tells us the proportion of variation in the dependent variable explained by the independent variable

R 2

Re g SS

.

TotalSS

 x 100  19.1322 x 100 19.6569

97.33%

The value of R 2 , indicates that about 97% variation in the dependent variable has been explained by the

linear relationship with X and remaining are due to some other unknown factors. Finding the value of Y when X=13

Y 13 =0.715 + 0.270 (13)=4.225

CORRELATION ANALYSIS

SIMPLE CORRELATION Q.1. The following data represent the wing length and tail length of sparrows

 Wing length Tail length (X) (Y) XY X 2 Y 2 10.4 7.4 76.96 108.16 54.76 10.8 7.6 82.08 116.64 57.76 11.1 7.9 87.69 123.21 62.41 10.2 7.2 73.44 104.04 51.84 10.3 7.4 76.22 106.09 54.76 10.2 7.1 72.42 104.04 50.41 10.7 7.4 79.18 114.49 54.76 10.5 7.2 75.6 110.25 51.84 10.8 7.8 84.24 116.64 60.84 11.2 7.7 86.24 125.44 59.29 10.6 7.8 82.68 112.36 60.84 11.4 8.3 94.62 129.96 68.89 128.2 90.8 971.37 1371.31 688.40 X Y XY X 2 Y 2

(a)

(b) Test the hypothesis

Find Coefficient of Correlation between wing length and Tail length.

H

0

:

12

0

Solution (a) Coefficient of Correlation between wing length and Tail length

X

S

S

S

X Y

X

2

Y

2

10.68
Y
7.57
 X Y
nX Y
1.32
2
2
 X
n
(
X
)
1.72
2
2
 Y
n
(
Y
)
1.35
S XY
r 
0.866
S
S
2
2
X
Y

PARTIAL CORRELATION

Q. :- Suppose that X1=Fish Length

X2=Fish weight

X3=Fish age and r 12 =0.60 , r 13 =0.70, r 23 =0.65

n=15

(a) Find partial correlation coefficient between X1 and X2 while the effect of X3 kept constant.

r

12.3

 r 12  r r 13 23 2 )(1 2 ) (1  r 13  r 23

Q.:- Suppose that X1=Fish Length

(0.60)

(0.70)(0.65)

(1

2

0.70 )(1

2

0.65 )

0.27

MULTIPLE CORRELATION

X2=Fish weight

X3=Fish age and r 12 =0.60 , r 13 =0.70, r 23 =0.65

n=15

(a) Find Multiple correlation coefficient between X1 and joint effect of X2 and X3.

R

1.23

2
2
r
  2
r
r r r
12
13
12 13
23
2
(1
r
)
23

2
2
(0.60)
(0.70)
2(0.60)(0.70)(0.65)
2
[1
(0.65) ]

0.73