Вы находитесь на странице: 1из 56

LINEAR REGRESSION K.F.Turkman

Contents

1 Introduction

3

2 Straight line

4

2.1 Examining the regression equation

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5

2.2 Some distributional theory

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

2.3 Confidence intervals and tests of hypotheses regarding (β 0 , β 1 ).

9

2.4 Predicted future value of y .

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

2.5 Straight line regression in matrix terms .

.

.

.

.

.

.

.

.

.

.

.

.

11

3 Generalization to multivariate (multiple) regression

 

12

3.1 Precision of the regression equation

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

3.2 Is our model correct?

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

18

3.3 R 2 when there are repeated observations

.

.

.

.

.

.

.

.

.

.

.

.

23

3.4 Correlation Coefficients

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

3.5 Partial correlation coefficient

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

3.6 Use of qualitative variables in the regression equation

 

26

4 Selecting the best Regression equation

 

31

4.1 Extra sums of squares and partial F-tests

 

.

.

.

.

.

.

.

.

.

.

31

4.2 Methods of selecting the best regression

 

.

.

.

.

.

.

.

.

.

.

.

.

34

5 Examination of residuals

 

38

5.1 Testing the independence of residuals

.

.

.

.

.

.

.

.

.

.

.

.

.

.

39

5.2 Checking for normality

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

41

5.3 Plots of residuals

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

41

5.4 Outliers and test for influential observations

.

.

.

.

.

.

.

.

.

.

44

6 Further Topics

47

6.1 Transformations

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

47

6.2 Unequal variances

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

50

6.3 Ill-conditioned regression, collinearity and Ridge regression

 

50

6.4 Generalized Linear models (GLIM)

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

52

6.5 Nonlinear models

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

56

 

1

Approximate duration of the course: 12 hours References:

1. A.Sen and M. Srivastava(1990) Regression Analysis, Springer Verlag

2. N. Draper and H. Smith(1998) Applied Regression Analysis,

3. V.K. Rohatgi(1976) An Introduction to Probability Theory and Math- ematical Statistics. J.Wiley and sons.

Recommended software: STATISTICA

2

1

Introduction

A

common question in experimental science is to examine how some sets of

variables effect others. Some relations are deterministic and easy to interpret,

others are too complicated to grasp or describe in simple terms, possibly having a random component. In these cases, we approximate these actual relationships by simple functions or random processes using relatively simple empirical methods. Among all the methods available for approximating such complex relationships, linear regression possibly is the most used one. A common feature of this methodology is to assume a functional, parametric relationship between the variables in question, typically linear in unknown parameters which are to be estimated from the available data. Two sets of variables can be distinguished at this stage: Predictor variables and response variables. Predictor variables are those that can either be set to a desired value (controlled) or else take values that can be observed without any error. Our objective is to find out how changes in the predictor variables effect the values of the response variables. Other names frequently attached to these variables in different books by different authors are the following:

Predictor variables

=

Input variables

=

X-variables

=

Regressors

=

Independent variables

Response variable

=

Output variable

=

Y-variable

=

Dependent variable.

We shall be concerned with relationships of the form

Response variable

=

Linear model function in terms of input variables

+

random error.

In

function of the form

the simplest case when we have data (y 1 , x 1 ),(y 2 , x 2 ),

y i = β 0 + β 1 x i + i

,

i = 1, 2,

, n

3

,(y

n , x n ), the linear

(1)

can be used to relate y to x. We will also write down this model in generic terms as

y = β 0 + β 1 x + ,

Here, is a random quantity measuring the error of any individual y may fall off the regression line, or is a random quantity measuring the variation in y not explained by x. We also assume that the input variable x is assumed to be either controlled or measured without error, thus is not a random variable (As long as the error in measurement that may exist in x is smaller than the measurement error in y, then this assumption is a fairly robust one) If the relation between y and x is more complex than the linear relation- ship given in (1), then models of the form

y = β 0 + β 1 x + β 2 x 2 +

+ β p x p +

(2)

can be used. Note that we say the model is linear in the sense that the model is linear in parameters. For example,

y = β 0 x 1 + β 1 x 2 2 + β 3 x 1 x 2 +

is linear, whereas

is not.

y

= α 0 + α 1 x β x γ +

1

2

2 Straight line

Suppose that we observe the data (y 1 , x 1 ),(y 2 , x 2 ), that the model (1)

,(y

n , x n ) and we think

(3)

is the right model.Here β 0 , β 1 are fixed but unknown model parameters to be estimated from data. One way of obtaining estimators b 0 , b 1 of β 0 , β 1 is by minimizing

(4)

y i = β 0 + β 1 x i + i

i = 1, 2,

,

, n

n

n

S =

2

i =

(y i β 0 β 1 x i ) 2

i=1

i=1

in terms of β 0 , β 1 , where (b 0 , b 1 ) is the value of (β 0 , β 1 ) corresponding to the minimal value of S. Here S is called the sum of squares of errors and this

4

method is called the least square method. Under certain general conditions, the estimators (b 0 , b 1 ) obtained this way also turn out to be the minimum unbiased estimators as well as the maximum likelihood estimators. We can determine (b 0 , b 1 ) by differentiating S with respect to (β 0 , β 1 ) , setting them to 0 and solving for (β 0 , β 1 ), giving

∂S

∂β 0

∂S

∂β 1

= 2

= 2

n

(y i β 0 β 1 x i ) = 0,

i=1

n

i=1

(y i β 0 β 1 )x i = 0,

(5)

resulting in

b 1 = n

i=1 (x i x)(y i y)

n

i=1 (x i x) 2

,

(6)

(7)

Here x = 1/n i=1 n x i is the sample average and (5) are called the normal equations. We will call yˆ i = b 0 + b 1 x i the fitted value of y at x = x i and ˆ i = y i yˆ i the residual for the ith observation. Note that

yˆ i = y + b 1 (x i x),

b 0 = y b 1 x.

so that

n

i=1

i =

n

i=1

(y i yˆ i ) = 0.

2.1 Examining the regression equation

So far, we made no assumption involving the probability structure of , and hence of y . We now give the basic assumptions regarding the model (1). In model (1), we assume

, n are identically distributed uncorrelated random variables

1. i , i = 1,

with mean E( i ) = 0 and variance V ( i ) = σ 2 , so that (assuming x variables are measured without error or controlled) y i are random

variables with E(y i ) = β 0 + β 1 x i and V (y i ) = σ 2 . (In

notation for E(y i ) is E(Y |X = x i ).) Note that if both independent

fact, correct

5

variable X as well as the response variable Y are random variables than the random variable f (X) that minimizes

E[(Y f (X)) 2 |X]

is given by

E(Y |X).

If further (X, Y ) have joint normal distribution, then E(Y |X = x) is a linear function of the form

E(Y |X = x) = β 0 + β 1 x.

Hence, yˆ i = b 0 + b 1 x i can and should be seen as estimator of this conditional mean.

ˆ

E(Y |X = x i ), the

2. i and j , for all i = j are uncorrelated hence y i , y j are also uncorrelated

.

3. i N (0, σ 2 ), thus they are independent. Hence

Y |X = x i N(β 0 + β 1 x i , σ 2 )

and y i are independent but not identically distributed random variables (With some abuse in notation, let y i = (Y |X = x i ) N(β 0 +β 1 x i , σ 2 )) Note that in regression model, we specify the conditional distribution of Y given X = x i ; full inference on (Y, X) would require the specification of the joint distribution for (Y, X).

2.2 Some distributional theory

While examining the regression equation, we will need to test various hy- potheses which, in general will depend on the distributional properties of sums of squares of independent normal variables and their ratios. In this section, we give a brief summary of distributional results of these quadratic forms.

Normal density: Random variable X has a normal distribution with mean µ and variance σ 2 if it has the density

f(x) =

σ(2π) 1/2 exp[( (x µ) 2 )],

1

2σ 2

6

for ∞ ≤ x ≤ ∞. Z = (Xµ) transforms X to a standard normal variate with mean 0 and variance 1.

(central) t-distribution: X has a t-distribution with v degrees of free- dom ( denote it by t(v)) if it has the density

σ

f v (t) = Γ((v + 1)/2)

() 1/2 Γ(v/2) (1 + t

v

2

) (v+1)/2 ),

for −∞ ≤ t ≤ ∞. Here, Γ(q) =

e x x q1 dx is the gamma function.

0

In General t-distribution looks like a normal distribution with heavier

tails. As v → ∞, t-distribution tends to the normal distribution and in fact t() = N (0, 1). For all practical purposes, when v > 30, they are equal.

(central) F-distribution: X has a F-distribution with m and n degrees of freedom (F m,n ) if it has the density

f m,n (x) = Γ((m + n)/2)(m/n) m/2

Γ(m/2)Γ(n/2)

x m/21

(1 + mx/n) (m+n)/2 ,

for x 0. If X has a F 1,n distribution, than it is equivalent to the square of a random variable with t(n) distribution that is F 1,n = t Another characteristic is that F m,n = 1/F n,m .

n .

2

X is said to have a χ 2 distribution with n degrees of freedom (χ 2 (n)) if it has the density function

for 0 < x ≤ ∞.

f(x) =

1

Γ(n/2)2 n/2 e x/2 x n/21 ,

How do these distributions appear in the regression analysis? As we will see, most of the tests of hypotheses as well as estimators of model param- eters will depend on sums of squares of independent, normally distributed random variables and their ratios and these sums usually have χ 2 distribu- tion, whereas the ratio of independent random variables with χ 2 distributions have F distribution. Here is the summary of distributional results we need. For detail, see Sen and Srivastava(1990) or any good statistics book such as

Rohatgi(1976)

7

1. If X 1 ,

, X n are independent, normally distributed random variables

, µ n ) and common variance σ 2 , then σ 2 Z T AZ

, X n µ n ) and A is any symmetric matrix

with r = tr(A), has a (central) χ 2 distribution with r degrees of free-

dom.(Sum of the diagonal elements of an symmetric matrix A is called the trace of the matrix and denoted by tr(A))

where Z T = (X 1 µ 1 ,

with means (µ 1 , µ 2 ,

2. In particular,

n

i=1

(X i µ i ) 2

σ

2

has χ 2 (n) distribution, whereas, if µ i = µ constant and is estimated by

X, then

n

i=1

(X i X) 2

σ

2

has χ 2 (n 1) distribution.

freedom in estimating the mean µ by X .)

(this is due to the loss of one degree of

3. Ratio of two independent χ 2 random variables each divided by their respective degrees of freedom has an F distribution. That is, if X χ 2 (m) and Y χ 2 (n) and X, Y are independent then F m,n = (X/m)

has a F distribution with m, n degrees of freedom.

(Y /n)

4. If X N (µ, σ 2 ) and Y χ 2 (n) and if further X and Y are indepen- dent, then t = (Xµ)/σ has a t distribution with n degrees of freedom.

Y /n

Thus we immediately see that t 2 = (Xµ) 2 /σ 2 has F distribution with

Y /n

1, n degrees of freedom. This distribution appears when we want to look at the distribution of the form (X µ), when X has normal distribution, but σ is not known and substituted by the empirical stan- dard deviation.

This is all the distributional theory we need to deal with inference on re- gression equation. Most of the effort in proving results on the distributional properties of estimators and tests of hypotheses will fall on showing the in- dependence of various forms of quadratic forms.

8

2.3 Confidence intervals and tests of hypotheses re- garding (β 0 , β 1 ).

A

simple calculation shows that n

b 1 =

i=1 (x i x)(y i y)

n

i=1 (x i x) 2

hence

n

i=1 (x i x)y i

=

(b 1 ) =

V

i=1 (x i x) 2 ,

n

σ 2

n

i=1 (x i x) 2 .

(8)

(9)

In general, σ 2 is not known (usually is the case) then a suitable estimator for σ 2 can replace σ 2 . If the assumed model (1) is correct, then it is known that under the normality assumption on residuals,

is

s 2 =

1

n 2

n

i=1

(y i yˆ i ) 2

the minimum variance unbiased estimator of σ 2 . Note that

s 2 = 1

n

n

i=1

(y i yˆ i ) 2

is

are normal, we can construct the usual 100(1 α)%) confidence interval for β 1 :

the MLE estimator, but it is biased. Hence under the assumption that i

b 1 ± t(n 2, 1 1/2α)s

n [ i=1

(x i x) 2 )] 1/2 .

Here, t(n 2, 1 1/2α) is the 1 1/2α percentage point of a t-distribution with n 2 degrees of freedom. (It is left to the reader to verify that

b 1 β 1

s.e(b 1 )

has a t-distribution with n 2 degrees of freedom. Here s.e stands for the standard error) The test of hypotheses

H 0 : β 1 = β

1

,

v.s.

9

H 1 : β 1 = β

1

can be performed by calculating the test statistic

t = (b 1 β

1

)

s

n


(

i=1

(x i x) 2 ) 1/2 ,

and comparing |t| with table value t(n 2, 1 1/2α). Standard error of b 0 can similarly be calculated:

s.e(b 0 ) =

n

x) 2 1/2

2

i=1 x i

n n

i=1 (x i

σ,

hence (1 α)100% confidence interval for β 0 is given by

and the test

b 0 ± t(n 2, 1 1/2α)

n

x) 2 1/2

2

i=1 x i

n n

i=1 (x i

0

H 0 : β 0 = β

,

v.s.

H 1 : β 0 = β

0

s,

can be performed by comparing the absolute value of

t = (b 0 β

0

)

s

with t(n 2, 1 1/2α).

n

x) 2 1/2

2

i=1 x i

n n

i=1 (x i

2.4 Predicted future value of y

Suppose that we want to predict the future value y k of the response variable y

ˆ

at the observed value of x k . The expected predicted value is yˆ k = E(Y |X = x k ) = b 0 + b 1 x k , that is, the estimator of the mean of Y conditional at X = x k . Putting in the expressions (6) and (7) for b 1 and b 0 and simplifying the expression, we get yˆ k = y + b 1 (x k x). One can easily check that b 1 and y are uncorrelated so that Cov(b 1 , y) = 0 and hence

V

y k )

=

=

V (E(Y k |X = x k )) = V ar(y) + (x k x) 2 V (b 1 )

σ 2 /n +

(x k x) 2

n i=1 (x i x) 2 σ 2 .

(10)

10

Now, a future observation y k varies around its mean E(Y |X = x k ) with a variance σ 2 , hence

V (y k ) = σ 2 (1 + 1/n +

(x k x) 2

i=1 n (x i x) 2 )

(11)

Hence, 100(1 α)% confidence interval for future observation y k is given

by

yˆ k ± t(n 2, 1 1/2α)s[1 + 1/n +

(x k x) 2

x) 2 ] 1/2 .

n

i=1 (x i

(12)

2.5 Straight line regression in matrix terms

Suppose we have the observations

Let

(y 1 , x 1 ), (y 2 , x 2 ),

,

 

(y n , x n ).

y 1

y 2

.

 

Y =

 

.

.

y n

 

,

 

1

x

1

 

1

.

x

.

2

X =

 

 

,

. .

 

. .

1 n

x

β = β 0 ,

β

1

 

b = b 0

b

1

,

=

 

1

2

 

.

.

 

.  

.

n

11

Then, the model (1) can be written as

Note that

Y = Xβ + .

T = (Y Xβ) T (Y Xβ) =

n

i=1

(y i β 0 β 1 x i ) 2 ,

(13)

hence the least square estimates are obtained by minimizing T and the normal equations in (5) are given in matrix form by

(14)

and hence

(15)

provided that the inverse of the matrix X T X exists. As we will often see, the matrix X T X and its inverse (X T X) 1 are the back bone of multiple regression analysis. Note that

b = (X T X) 1 X T Y,

X T Xb = X T Y,

hence

Cov(b 0 , b 1 ) = Cov(y b 1 x, b 1 ) =

x

i=1 n (x i x) 2 ,

V (b) = (X T X) 1 σ 2 .

ˆ

(16)

Letting a k = (1, x k ), we can write E(Y |X = x k ) = b 0 + b 1 x k = a k b, and

ˆ

V ( E(Y |X = x k ))

=

=

=

V (b 0 ) + 2x k Cov(b 0 , b 1 ) + x k

a k V (b)a T

2 V (b 1 )

k

a k (X T X) 1 a T σ 2 .

k

Here, V (b) is the covariance matrix of b.

(17)

3 Generalization to multivariate (multiple) re- gression

Suppose that we have p independent variables (x 1 , x 2 ,

to know the effect of these variables on the response variable y through the linear relationship

(18)

, x p ) and we want

y i = β 0 + β 1 x 1i + β 2 x 2i +

+

β p x pi + i ,

12

for i = 1, 2,

,

where

n. We can write this model in matrix term

Y = Xβ + ,

X =

1

.

.

.

.

1

Y

x

x

=

11

.

.

.

.

1n

β =

b =

=

.

.

.

.

.

.

y

y

.

.

.

y

1

2

n

.

.

.

.

.

.

β

0

1

β

.

.

β

b

b

.

.

b

.

.

.

p

0

1

p

1

2

n

,

,

.

.

.

.

.

.

,

.

.

.

.

.

.

.

x p1

.

.

.

.

x pn

,

For this model, we again assume

1. E( j ) = 0, V ( j ) = σ 2 , for every

j

13

(19)

2. i , j are uncorrelated

3. i have normal distribution and hence, N (0, Σ), where Σ = Iσ 2 , I being the identity matrix.

The generalization to the straight line gives similar results:

and

b = (X T X) 1 X T Y,

V (b) = (X T X) 1 σ 2 ,

V

ˆ

( E(Y |X = x k )) = a k (X T X) 1 a T σ 2 .

k

Hence, tests of hypotheses as well as confidence intervals on individual pa-

rameters β 0 ,

based on the individual standard errors of b i . (Students are strongly urged to construct these confidence intervals and tests of hypotheses, which are standard exercises in basic statistics)

, β p and on the future observation Y k can easily be constructed

3.1 Precision of the regression equation

So far, we looked at the problem of inference on the individual parameters of the model (18)

(20)

y i = β 0 + β 1 x 1i + β 2 x 2i +

+

β p x pi + i ,

however, there are many more important questions to ask;

1. Is the model we use correct?,

2. if it is the correct model, how significant it is, in the sense that how

, x p ) are contributing in explain-

much the independent variables (x 1 ,

ing the variation in the response variable y?,

3. how can we reach to a more parsimonious model by excluding those in- dependent variables which do not contribute significantly in explaining the variation in y?

We start by answering the second question. Lets assume that the model in (18) is the correct model. Then the second question can be formulated by testing the hypotheses

= β p = 0 , H 1 : not all are 0. (21)

H 0 : β 1 = β 2 =

14

If we do not reject the null hypothesis, then the model is not statistically different from the model

y = β 0 + ,

which means that whatever the variation in the independent variables, E(Y ) remains constant, indicating that the independent variables do not contribute anything in explaining y. However, before going any further to see how such

a test can be performed, we note that the test

H 0 : β 1 = β 2 =

= β p = 0

,

H 1 : not all are 0

is not equal to testing p consecutive hypotheses

(i)

H

0

: β i = 0

H

(i)

1

: β i = 0.

Let α i = α is the type one error in testing the hypotheses H

0

(i) , that is,

α = P (rejectingH

(i) when it is true).

0

Suppose that we perform these p tests independently. Then the probability of rejecting at least one of the p hypotheses when all are true is 1(1α) p , which

is the type 1 error for the composite hypotheses(carried independently). Note

that this type of error increases when p increases. hence, individual hypothe-

ses can not substitute a composite hypotheses without increasing the type one error. Now let us see how we can perform the composite hypotheses (21). We can write

n

i=1

(y i y) 2

=

=

we can show that

Hence

n

i=1

n

i=1

(y i yˆ i + yˆ i y) 2

n n

i=1

(y i yˆ i ) 2 +

i=1

y i y) 2 + 2

n

i=1

(y i y)(y i yˆ i )(22)

n


2

i=1

(y i y)(y i yˆ i ) =

0.

(y i y) 2 =

n

i=1

y i y) 2 +

15

n

i=1

(y i yˆ i ) 2 .

(23)