Вы находитесь на странице: 1из 11

DS-GA 1002 Lecture notes 12 Fall 2016

Linear regression

1 Linear models
In statistics, regression consists of learning a function relating a certain quantity of interest
y, the response or dependent variable, to several observed variables x1 , x2 , . . . , xp ,
known as covariates, features or independent variables. To ease notation, we store the
features in a vector ~x Rp . For example, we might be interested in estimating the price of
a house based on its extension, the number of rooms, the year it was built, etc.
The assumption in regression is that the predictor is generated according to a function h
applied to the features and then perturbed by some unknown noise z
y = h (~x) + z. (1)
The aim is to learn h from n examples of responses and their corresponding features
y (1) , ~x (1) , y (2) , ~x (2) , . . . , y (n) , ~x (n) .
  
(2)
In linear regression, we assume that h is a linear function, so that the response is well modeled
as a linear combination of the predictors:
y (i) = ~x (i) T ~ + z (i) , 1 i n, (3)

where z (i) is an unknown perturbation. The function is parametrized by a vector of weights


~ Rp , which we would like to learn from the data.
Combining the equations in (3) we obtain a matrix representation of the linear-regression
problem
(1) (1) (1) (1)

~xp ~1 z (1)

y ~x1 ~x2
~xp ~2
y (2) (2) (2) (2)
~x1 ~x2 z (2)

= + . (4)

y (n) ~x1
(n)
~x2
(n) (n)
~xp ~p z (n)

Equivalently,
~y = X ~ + ~z, (5)
where X is a n p matrix containing the features, ~y contains the response and ~z Rn
represents the noise.
Example 1.1 (Linear model for GDP). We consider the problem of building a linear model
to predict the gross domestic product (GDP) of a state in the US from its population and
unemployment rate. We have available the following data:

Population Unemployment GDP


rate (%) (USD millions)
California 38 332 521 5.5 2 448 467
Minnesota 5 420 380 4.0 334 780
Oregon 3 930 065 5.5 228 120


Nevada 2 790 136 5.8 141 204


Idaho 1 612 136 3.8 65 202


Alaska 735 132 6.9 54 256

South Carolina 4 774 839 4.9 ???

The GDP is the response, whereas the population and the unemployment rate are the fea-
tures. Our goal is to fit a linear model to the data so that we can predict the GDP of South
Carolina.
Since the features are on different scales we normalize them so that the columns in the
regression matrix have unit norm. We also normalize the response, although this is not
strictly necessary. We could also apply some additional preprocessing such as centering the
data.

0.984 0.982 0.419
0.135 0.139 0.305

0.092 0.101 0.419
~y :=
0.057 ,
X :=
0.071 0.442 .
(6)

0.026 0.041 0.290
0.022 0.019 0.526

If we find a ~ R2 such that ~y X ~ then we can estimate the GDP of South Carolina by
~ where
computing ~xscT ,
 
0.122
~xsc := (7)
0.373

corresponds to the population and employment rate of South Carolina divided by the con-
stants used to normalize X .

2
1.2
Data
1.0 Least-squares fit

0.8

y 0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2
x

Figure 1: Linear model learnt via least-squares fitting for a simple example where there is just
one feature (p = 1) and 40 examples (n = 40).

2 Least-squares estimation
To calibrate the linear regression model, we need to estimate the weight vector so that it
yields a good fit to the data. We can evaluate the fit for a specific choice of ~ Rp using
the sum of the squares of the error,
X n  2 2
(i) (i) T ~ ~
y ~x = ~y X . (8)

2
i=1

The least-squares estimate ~LS is the vector of weights that minimizes this cost function,

~LS := arg min ~y X ~ . (9)

~
2

The least-squares cost function is convenient from a computational view, since it is convex
and can be minimized efficiently (in fact, as we will see in a moment it has a closed-form
solution). In addition, it has intuitive geometric and probabilistic interpretations. Figure 1
shows the linear model learnt using least squares in a simple example where there is just one
feature (p = 1) and 40 examples (n = 40).

Example 2.1 (Linear model for GDP (cont.)). The least-squares estimate for the regression
coefficients in the linear GDP model are
 
~ 1.010
LS = . (10)
0.019

3
The model essentially learns that the GDP is roughly proportional to the population and
that the unemployment does not help when fitting the data linearly. We now compare the
fit provided by the linear model to the original data, as well as its prediction of the GDP of
South Carolina:

GDP Estimate
California 2 448 467 2 446 186
Minnesota 334 780 334 584
Oregon 228 120 233 460

Nevada 141 204 159 088

Idaho 65 202 90 345

Alaska 54 256 23 050

South Carolina 199 256 289 903

2.1 Geometric interpretation

The following theorem, proved in Section 2.2, shows that the least-squares problem has a
closed form solution.

Theorem 2.2 (Least-squares solution). For p n, if X is full rank then the solution to the
least-squares problem (9) is
1
~LS := X T X X T ~y . (11)

A corollary to this result provides a geometric interpretation for the least-squares estimate
of ~y : it is obtained by projecting the response onto the column space of the matrix formed
by the predictors.

Corollary 2.3. For p n, if X is full rank then X ~LS is the projection of ~y onto the column
space of X .

We provide a formal proof in Section B of the appendix, but the result is very intuitive. Any
vector of the form X ~ is in the span of the columns of X . By definition, the least-squares
estimate is the closest vector to ~y that can be represented in this way, so it is the projection
of ~y onto the column space of X . This is illustrated in Figure 2.

4
Figure 2: Illustration of Corollary 2.3. The least-squares solution is a projection of the data onto
the subspace spanned by the columns of X , denoted by X1 and X2 .

2.2 Probabilistic interpretation


~ which has entries that
If we model the noise in (5) as a realization from a random vector Z
are independent Gaussian random variables with mean zero and a certain variance 2 , then
we can interpret the least-squares estimate as a maximum-likelihood estimate. Under that
assumption, the data are a realization of the random vector
Y~ := X ~ + Z,
~ (12)

which is an iid Gaussian random vector with mean X ~ and covariance matrix 2 I. The joint
pdf of Y~ is equal to
n    2 
Y 1 1 
fY~ (~a) = exp 2 ~ai X ~ (13)
i=1
2 2 i
 2 
1 1
=p exp1 2 ~a X ~ . (14)

n n
(2) 2 2

The likelihood is the probability density function of Y~ evaluated at the observed data ~y and
~
interpreted as a function of the weight vector ,
 2 
  1 1
L~y ~ = p exp ~y X ~ . (15)

n
(2) 2 2

5
To find the ML estimate, we maximize the log likelihood. We conclude that it is given by
the solution to the least-squares problem, since
 
~ML = arg max L~y ~ (16)
~

 
= arg max log L~y ~ (17)
~

2
= arg min ~y X ~ (18)

~
2

= ~LS . (19)

3 Overfitting
Imagine that a friend tells you:
I found a cool way to predict the temperature in New York: Its just a linear combination of
the temperature in every other state. I fit the model on data from the last month and a half
and its perfect!
Your friend is not lying, but the problem is that she is using a number of data points to fit
the linear model that is roughly the same as the number of parameters. If n p we can find
a ~ such that ~y = X ~ exactly, even if ~y and X have nothing to do with each other! This is
called overfitting and is usually caused by using a model that is too flexible with respect to
the number of data that are available.
To evaluate whether a model suffers from overfitting we separate the data into a training set
and a test set. The training set is used to fit the model and the test set is used to evaluate
the error. A model that overfits the training set will have a very low error when evaluated
on the training examples, but will not generalize well to the test examples.
Figure 3 shows the result of evaluating the training error and the test error of a linear model
with p = 50 parameters fitted from n training examples. The training and test data are
generated by fixing a vector of weights ~ and then computing

~ytrain = Xtrain ~ + ~ztrain , (20)


~ytest = Xtest ~ , (21)

where the entries of Xtrain , Xtest , ~ztrain and ~ are sampled independently at random from a
Gaussian distribution with zero mean and unit variance. The training and test errors are

6
0.5
Error (training)
Error (test)
Noise level (training)
0.4

Relative error (l2 norm) 0.3

0.2

0.1

0.0 50 100 200 300 400 500


n

Figure 3: Relative `2 -norm error in estimating the response achieved using least-squares regression
for different values of n (the number of training data). The training error is plotted in blue, whereas
the test error is plotted in red. The green line indicates the training error of the true model used
to generate the data.

defined as

Xtrain ~LS ~ytrain

2
errortrain = , (22)
||~ytrain ||2

~
Xtest LS ~ytest

2
errortest = . (23)
||~ytest ||2

Note that even the true ~ does not achieve zero training error because of the presence of
the noise, but the test error is actually zero if we manage to estimate ~ exactly.
The training error of the linear model grows with n. This makes sense as the model has
to fit more data using the same number of parameters. When n is close to p := 50, the
fitted model is much better than the true model at replicating the training data (the error
of the true model is shown in green). This is a sign of overfitting: the model is adapting to
the noise and not learning the true linear structure. Indeed, in that regime the test error is
extremely high. At larger n, the training error rises to the level achieved by the true linear
model and the test error decreases, indicating that we are learning the underlying model.

7
4 Global warming
In this section we describe the application of linear regression to climate data. In particular,
we analyze temperature data taken in a weather station in Oxford over 150 years.1 Our
objective is not to perform prediction, but rather to determine whether temperatures have
risen or decreased during the last 150 years in Oxford.
In order to separate the temperature into different components that account for seasonal
effects we use a simple linear model with three predictors and an intercept
   
~ ~ 2t ~ 2t
~yt 0 + 1 cos + 2 sin + ~3 t (24)
12 12

where 1 t n denotes the time in months (n equals 12 times 150). The corresponding
matrix of predictors is

1 cos 2t 2t1
 
12
1
sin 12
t1

2t2 2t2
 
1 cos 12 sin 12

t2
X := . (25)



1 cos 2t sin 2t
 
12
n
12
n
tn

The intercept ~0 represents the mean temperature, ~1 and ~2 account for periodic yearly
fluctuations and ~3 is the overall trend. If ~3 is positive then the model indicates that tem-
peratures are increasing, if it is negative then it indicates that temperatures are decreasing.
The results of fitting the linear model are shown in Figures 4 and 5. The fitted model
indicates that both the maximum and minimum temperatures have an increasing trend of
about 0.8 degrees Celsius (around 1.4 degrees Fahrenheit).

1
The data is available at http://www.metoffice.gov.uk/pub/data/weather/uk/climate/
stationdata/oxforddata.txt.

8
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Model Model
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

25 14
12
20
10
Temperature (Celsius)

Temperature (Celsius)

15 8
6
10 4
2
5
Data 0 Data
Model Model
0 2
1900 1901 1902 1903 1904 1905 1900 1901 1902 1903 1904 1905

25 15

20
10
Temperature (Celsius)

Temperature (Celsius)

15
5
10
0
5

5
0
Data Data
Model Model
5 10
1960 1961 1962 1963 1964 1965 1960 1961 1962 1963 1964 1965

Figure 4: Temperature data together with the linear model described by (24) for both maximum
and minimum temperatures.

9
Maximum temperature Minimum temperature

30 20

25 15

20
Temperature (Celsius)

Temperature (Celsius)
10
15
5
10
0
5
5
0 Data Data
Trend Trend
10
1860 1880 1900 1920 1940 1960 1980 2000 1860 1880 1900 1920 1940 1960 1980 2000

+ 0.75 C / 100 years + 0.88 C / 100 years


Figure 5: Temperature trend obtained by fitting the model described by (24) for both maximum
and minimum temperatures.

A Proof of Proposition 2.2


Let X = U VT be the singular-value decomposition (SVD) of X . Under the conditions of
1 T
the theorem, X T X X y = V U T . We begin by separating ~y into two components

y = UUT y + I UUT y

(26)

where U U T y is the projection of ~y onto the column space of X . Note that I U U T y is
orthogonal to the column space of X and consequently to both U U T y and X ~ for any . ~
By Pythagorass Theorem
2  2 2
~y X ~ = I U U T y 2 + U U T y X ~ . (27)

2 2

is
The minimum value of this cost function that can be achieved by optimizing over beta
2
||~yX ||2 . This can be achieved by solving the system of equations

U U T y = X ~ = U VT .
~ (28)

Since U T U = I because p n, multiplying both sides of the equality yields the equivalent
system
~
U T y = VT . (29)

10
Since X is full rank, and V are square and invertible (and by definition of the SVD
V 1 = V T ), so

~LS = V U T y (30)

is the unique solution to the system and consequently also of the least-squares problem.

B Proof of Corollary 2.3


Let X = U V T be the singular-value decomposition of X . Since X is full rank and p n
we have U T U = I, V T V = I and is a square invertible matrix, which implies
1
X ~LS = X X T X XTy (31)
T T T T

= U V V U U V V U y (32)
= U U T y. (33)

11

Вам также может понравиться