Вы находитесь на странице: 1из 7

Simple Linear Regression Analysis

I. Correlation Analysis
Goal: measure the strength and direction of a linear association between two variables. Basic concepts Scatter diagram plot of individual pairs of observations on a two-dimensional graph; used to visualize the possible underlying linear relationship. Example: Consider the following hypothetical data: x y 1 2 3.5 4.5 2 2 4 3.5 3 3 7 6.5 3 8 3 4 6 7.9 4 5 6 7 9.4 9.3 6 6 7 7 11 10.5 12.4 11.5 7 10 8 15 8 8 11 13.7

The scatter plot is as follows:


16 14 12

10
8 6 4 y

2 0
0 2 4 x 6 8 10

Linear correlation coefficient () a measure of the strength of the linear relationship existing between two variables, say X and Y, that is independent of their respective scales of measurements. Some characteristics of : It can only assume values between -1 and 1. The sign describes the direction of the linear relationship between X and Y: If is positive, the line slopes upward to the right, i.e., as X increases, the value of Y also increases.

If is negative, the line slopes downward to the right, and so as X increases, the value of Y decreases. If =0, then there is NO LINEAR RELATIONSHIP between X and Y. If is -1 or 1, there is perfect linear relationship between X and Y and all the points (x,y) fall on a straight line. A that is close to 1 or -1 indicates a strong linear relationship. A strong linear relationship does not necessarily imply that X causes Y or Y causes X. it is possible that a third variable may have caused the change in both X and Y, producing the observed relationship. The Pearson product moment correlation coefficient between X and Y, denoted by r, is defined as: =1 =1 =1 = 2 =1 2 =1 2 2 =1 =1 Example: Compute for r of the hypothetical data given above. Solution: x 1 2 2 2 3 3 3 3 4 4 5 6 6 6 7 7 7 8 8 8 95 y xy 3.5 3.5 4.5 9 4 8 3.5 7 7 21 6.5 19.5 8 24 6 18 7.9 31.6 7 28 9.4 47 9.3 55.8 11 66 10.5 63 12.4 86.8 11.5 80.5 10 70 15 120 11 88 13.7 109.6 171.7 956.3 x^2 y^2 1 12.25 4 20.25 4 16 4 12.25 9 49 9 42.25 9 64 9 36 16 62.41 16 49 25 88.36 36 86.49 36 121 36 110.25 49 153.76 49 132.25 49 100 64 225 64 121 64 187.69 553 1689.21

Sum

We obtain the following values: n=20


=1 = =1 =95 =1 =171.7

956.3

2 =1 = 2 =1 =

553 1689.21

Substituting these values to the formula, we have: = 20 956.3 95(171.7) 20 553 952 (20 1689.21 (171.72 ) = 0.9511

Tests of Hypotheses for Null Hypothesis Ho =o Alternative Hypothesis Ha <o >o o = Test Statistic Critical Region (i.e., Reject Ho if) < ( = 2) > ( = 2) > /2 ( = 2)

( ) 2 1 2

Example: Consider the hypothetical data given above. Suppose that the linear correlation coefficient between X and Y in the past is 0.9. Determine if the correlation has significantly increased compared to the past. a. Ho: =0.90 b. =0.05 c. = d. = vs Ha: >0.90

( ) 2 1 2 (0.9511 0.9) 18 10.95112

= 0.7019

e. Decision rule: Reject Ho if > = 2 = .05 18 = 1.734 f. Since t = 0.7019 is not greater than .05 18 = 1.734, we do not reject Ho. At 0.05 level of significance, there is a sufficient evidence to conclude that the correlation coefficient between X and Y is 0.9. NOTE: Even if two variables are highly correlated, it is not a sufficient proof of causation. One variable may cause the other or vice versa, or a third factor is involved, or a rare event may have occurred.

II.

Simple Linear Regression Analysis


Goal: To evaluate the relative impact of a predictor on a particular outcome. The simple linear regression model is given by the equation = + 1 +

Where

- the value of the response variable for the ith element - the value of the explanatory variable for the ith element - regression coefficient that gives Y- intercept of the regression line. 1 - regression coefficient that gives the slope of the line - random error for the ith element, where are independent, normally distributed with mean 0 and variance 2 for i=1, 2, , n n number of elements

Remark: The model tells us that two or more observations having the same value for X will not necessarily have the same value for Y. However, the different values of Y for a given value of X, say x i, will be generated by a normal distribution whose mean is + 1 , that is, = + 1 . This is known as the regression equation where the parameters and 1 are interpreted as follows: is the value of the mean of Y when X=0 1 is the amount of change in the mean of Y for every unit increase in the value of X. The random error It may be thought of as a representation of the effect of other factors, that is, apart from X, not explicitly stated in the model but do affect the response variable to some extent. Sources of random error: o Other response variables not explicitly stated in the model o Inherent and inevitable variation present in the response variable o Measurement errors Satisfies the following: o The error terms are independent from one another; o The error terms are normally distributed; o The error terms all have a mean of 0; and o The error terms have constant variance, 2 .

Typical steps in doing a simple linear regression analysis: 1. Obtain the equation that best fits the data. 2. Evaluate the equation to determine the strength of the relationship for prediction and estimation. 3. Determine if the assumptions on the error terms are satisfied. 4. If the model fits the data adequately, use the equation for prediction and for describing the nature of the relationship between the variables.

Obtaining the equation: Method of Least Squares The best-fitting line is selected as the one that minimizes the sum of squares of the deviations of the observed value of Y from its expected value. That is we want to estimate and 1 such that =1 2 is smallest, where = = + 1 Based on this criterion, the following formulas for b o , the estimate for , and b1 , the estimate for 1 , are obtained: =
=1 =1 =1 2 =1 2 =1

= 1 Thus, the estimated regression equation is given by = + 1 Remarks: The estimated regression equation is appropriate only for the relevant range of X, i.e., for the values of X used in developing the regression model. If X=0 is not included in the range of the sample data, the will not have a meaningful interpretation. Example: Consider the given hypothetical example where we fit a linear model of the form = + 1 + Using the method of least squares, the following values are needed to estimate and 1 :

n=20
=1 = =1 =95

=1 =171.7

956.3

2 =1 =

553

We get the values of bo and b1 as: 1 = 20 956.3 95 (171.7) = 1.383 20(553) 95 2

= 8.585 1.383 4.75 = 2.016 Hence, the prediction equation is given by: = 2.016 + 1.383 Interpretation: For every 1 unit increase in X, the mean of Y is estimated to increase by 1.383. Note that bo =2.016 has no meaningful interpretation since X=0 is not within the range of values used in the estimation.

Mean Square Error The common variance of and Y, denoted by 2 , is given by: =
2

=1

2 2

where SSE stands for sum of squares due to error and MSE stands for mean square error. The MSE is the variance of the data, Y, about the estimated regression line, .

Determining the strength of relationship between X and Y A (1-)100% Confidence Interval for 1 is (1 = 2 1 , 1 + = 2 1 )
2 2

Where 1 =

2 =1 2 =1

A (1-)100% Confidence Interval for o is ( = 2 , + = 2 )


2 2 ( 2 ) =1 2 2 =1 =1

Where =

Test of Hypothesis concerning 1 Null Hypothesis Ho 1 =0 Alternative Hypothesis Ha 1 <0 1 >0 1 0 Test Statistic 1 1 Critical Region (i.e., Reject Ho if) < ( = 2) > ( = 2) > /2 ( = 2)

Coefficient of Determination (R2 ) The proportion of the variability in the observed values of the response variable that can be explained by the explanatory variable through their linear relationship. The realized value of the coefficient of determination, r 2 , will be between 0 and 1. If a model has perfect predictability, then R2 =1; but if a model has no perfect predictive capability, then R2 =0. Interpretation: R2 *(100%) of the variability in the response variable, Y, can be explained by the explanatory variable, X, through the simple linear regression model.

Residual (di) The difference between the observed value and predicted value of the response variable. That is, = . If indeed the variances of the error terms are constant, then the plot of the residuals versus X should tend to form a horizontal band, i.e., spread of the residuals should not increase or decrease with values of the independent variable.