Вы находитесь на странице: 1из 25

Linear and Logistic Regression

Marta Arias
marias@lsi.upc.edu

Dept. LSI, UPC

Fall 2012
Linear regression
Simple case: R2

Here is the idea:


1. Got a bunch of points in R2 , {(x i , y i )}.
2. Want to fit a line y = ax + b that describes the trend.
3. We define a cost function that computes the total squared
error of our predictions w.r.t. observed values y i
P
J (a, b) = (ax i + b − y i )2 that we want to minimize.
4. See it as a function of a and b: compute both derivatives,
force them equal to zero, and solve for a and b.
5. The coefficients you get give you the minimum squared
error.
6. Can do this for specific points, or in general and find the
formulas.
7. More general version in Rn .
Linear regression
Simple case: R2
P
Let h(x ) = ax + b, and J (a, b) = (h(x i ) − y i )2

P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂a ∂a
X ∂(ax i + b − y i )2
=
∂a
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂a
i
X ∂(ax i )
= 2 (ax i + b − y i )
∂a
i
X
= 2 (ax i + b − y i )x i
i
Linear regression
Simple case: R2
P
Let h(x ) = ax + b, and J (a, b) = (h(x i ) − y i )2

P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂b ∂b
X ∂(ax i + b − y i )2
=
∂b
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂b
i
X ∂(b)
= 2 (ax i + b − y i )
∂b
i
X
= 2 (ax i + b − y i )
i
Linear regression
Simple case: R2

Normal equations
Given {(x i , y i )}i , solve for a, b:
X X
(ax i + b)x i = x iyi
i i
X X
(ax i + b) = yi
i i
Linear regression
General case: Rn

I Now, each xi = hx0i , x1i , x2i , .., xni i, where x0i = 1 for all i
I Parameters to estimate are a = ha0 , .., an iT 1
P P
I For j = 0, .., n, we have ∂J∂a(a)
j
= i ( nk=0 ak xki − y i )xji

Normal equations
Given {(xi , y i )}i , solve for a0 , a1 , .., an :

X Xn X
( ak xki )xji = xji y i (for each j = 0, .., n)
i k =0 i

1
Notice a is defined as a column vector.
Linear regression
General case: Rn

I Remember a = ha0 , a1 , a2 , ..., an iT


I Let y = hy 1 , y 2 , ..., y m iT 2
 1  1
x11 xn1

x x0 ...
 x2   x 2 x12 ... xn2 
   0 i
I Let X =  .  =  . ..  where all x0 = 1

..
 ..   .. . . 
xmx0m x1m .. xnm
P P P
Now, the normal equation i ( nk=0 ak xki )xji = i xji y i can be
rewritten as:
X X
n X
xji ( ak xki ) = xji (xi a) = XT
j y
i k =0 i

where Xj is the j -th column of X


2
Notice y is defined as a column vector.
Linear regression
General case: Rn
P
We have i xji (xi a) = XT
j y for each j = 0, .., n. Compactly:

XT Xa = XT y

which can be solved as

a = (XT X)−1 XT y

3
How to compute parameters in GNU Octave
Given X of size m × (n + 1) 4 and given label vector y, you can
solve the least squares regression problem with the single
command

pinv(X’ * X) * X’ * y 5

3
http://www.gnu.org/software/octave/
4
Assuming the original data matrix has been prepended an all-1 column.
5
Equivalent to X \ y using the built-in operator ’\’.
Linear regression
Practical example with Octave

We have a dataset with data for 20 cities; for each city we have
information on:
I Nr. of inhabitants
I Percentage of families’ incomes below 5000 USD
I Percentage of unemployed
I Number of murders per 106 inhabitants per annum
We wish to perform regression analysis on the number of
murders based on the other 3 features.
Linear regression
Practical example with Octave

Octave code:
Result:
load data.txt
a =
n = size(data, 2)
-3.6765e+01
m = size(data, 1)
7.6294e-07
X = [ ones(m, 1) data(:,1:n-1) ]
1.1922e+00
y = data(:,n)
4.7198e+00
a = pinv(X’*X) * X’ * y

So, we see that the variable that has the most impact is the
percentage of unemployed.
Linear Regression
What if n is too large?

Computing a = (XT X)−1 XT y may not be feasible if n is large,


since it involves the inverse of a matrix of size n × n (or
(n + 1) × (n + 1) if we added the extra “all 1” column)

Gradient descent: an iterative optimization solution


Start with any parameters a, and update a iteratively in order
to minimize J (a). Gradient descent tells us that J (a) should
decrease fastest if we follow the direction of the negative
gradient of the cost function J (a):

a = a − α∇J (a)

where α is a positive, real-valued parameter dictating how large


each step is, and ∇J (a) = h ∂J∂a(a)
0
, ∂J∂a(a)
1
, .., ∂J∂a(a)
n
iT .
Gradient descent
Gradient descent
Algorithm, I

Pseudocode: given J , α
I Initialize a to a random non-zero vector
I Repeat until convergence
I for all j = 0, .., n, do aj0 = aj − α ∂J∂a(a)
j
I for all j = 0, .., n, do aj = aj0
I Output a

Should be careful with ..


I setting α small enough so that algorithm converges, but
not too small because it may need innecessarily too many
iterations
I perform feature scaling so that all features are “on the same
range” (this is necessary because they share the same α in
the updates)
Gradient descent
Algorithm, II

I m examples {(xi , y i )}i


I example x = hx0 , x1 , .., xn i
Pn
I ha (x) = a0 x0 + a1 x1 + .. + an xn = j =0 aj xj = xa
1 Pm i i 2
I J (a) = 2m i =1 (ha (x ) − y )
∂J (a) 1 P m i i i 1 T
∂aj = m i =1 xj (ha (x ) − y ) = m Xj (Xa − y)
I

1 T
I ∇J (a) = m X (Xa − y)

Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a− m X (Xa − y)
I Output a
Gradient descent
Algorithm, II

I m examples {(xi , y i )}i


I example x = hx0 , x1 , .., xn i
Pn
I ha (x) = a0 x0 + a1 x1 + .. + an xn = j =0 aj xj = xa
1 Pm i i 2
I J (a) = 2m i =1 (ha (x ) − y )
∂J (a) 1 P m i i i 1 T
∂aj = m i =1 xj (ha (x ) − y ) = m Xj (Xa − y)
I

1 T
I ∇J (a) = m X (Xa − y)

Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a− m X (Xa − y)
I Output a
Linear regression
Practical example with Octave

Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = X*a - y
a = a - alpha / m * X’ * D
% we store consecutive values of J over time t
J(t) = 1/2/m * D’ * D
Logistic regression
What if y i ∈ {0, 1} instead of continuous real value?

Binary classification
Now, datasets are of the form {(x1 , 1), (x2 , 0), ..}. In this case,
linear regression will not do a good job in classifying examples
as positive (y i = 1), or negative (y i = 0).
Logistic regression
Hypothesis space
P
I ha (x) = g( nj=0 aj xj ) = g(xa)
1
I g(z ) = 1+e −z is sigmoid function (a.k.a. logistic function)
I 0 6 g(z ) 6 1, for all z ∈ R
I lim g(z ) = 0 and lim g(z ) = 1
z →−∞ z →+∞
I g(z ) > 0.5 iff z > 0
I Given example x
I predict positive iff ha (x) > 0.5 iff g(xa) > 0.5 iff xa > 0
Logistic regression
Least square minimization for logistic regression

Let us assume that


I P (y = 1|x ; a) = ha (x), and so
I P (y = 0|x ; a) = 1 − ha (x)

Given m training examples {(xi , y i )}i where y i ∈ {0, 1} we


compute the likelihood (assuming independence of training
examples)
Y
L(a) = p(y i |xi ; a)
i
Y i i
= ha (xi )y (1 − ha (xi ))1−y
i

Our strategy will be to maximize the log likelihood


Logistic regression
We will run gradient ascent to maximize the log
likelihood, using:
∂ log f (x ) 1 ∂f (x )
I for any function f (x ), ∂x = f (x ) ∂x
I for the sigmoid function g(x ),

∂g(x ) ∂ 1
=
∂x ∂x 1 + e −x
1 ∂e −x
= − −x 2
(1 + e ) ∂x
1
= e −x
(1 + e −x )2
 
1 1
= 1−
1 + e −x 1 + e −x
= g(x )(1 − g(x ))
Logistic regression
Maximizing the log likelihood

Y X
log L(a) = log p(y i |xi ; a) = log p(y i |xi ; a)
i i
X  i i

= log ha (xi )y (1 − ha (xi ))1−y
i
X
= y i log ha (xi ) + (1 − y i ) log(1 − ha (xi ))
i
Logistic regression
Computing partial derivatives

∂ log L(a) P ∂y i log ha (xi ) ∂(1−y i ) log(1−ha (xi ))


∂aj = i ∂aj + ∂aj
P i i
= i y i ∂ log∂a
g(x a)
j
+ (1 − y i ) ∂ log(1−g(x
∂aj
a))

P yi ∂g(xi a) (1−y i ) ∂g(xi a)


= i g(xi a) ∂aj − 1−g(xi a) ∂aj
P  yi a (1−y i )

∂g(xi a)
= i g(xi a)
− 1−g(xi a) ∂aj
P  yi (1−y i )
 i
= i g(xi a)
− 1−g(xi a)
g(xi a)(1 − g(xi a)) ∂x
∂aj
a

P  yi (1−y i )

= i g(xi a)
− 1−g(xi a)
g(xi a)(1 − g(xi a))xji
= (y i − g(xi a))xji
= (y i − ha (xi ))xji
Gradient ascent for logistic regression
Algorithm, I

Pseudocode: given α, {(xi , y i )}m


i =1

I Initialize a = h1, .., 1iT


I Perform feature scaling on the examples’ attributes
I Repeat until convergence
I for each j = 0, .., n:
P
I aj0 = aj + α i (y
i
− ha (xi ))xji
I for each j = 0, .., n:
I aj = aj
I Output a
Gradient ascent for logistic regression
Algorithm, II
I m examples {(xi , y i )}i
I g sigmoid function; g its generalization to vectors:
g(hz1 , .., zk i) = hg(z1 ), .., g(zk )i
P
I ha (x) = g( nj=0 aj xj ) = g(xa)
P
I J (a) = m1 i y i log ha (xi ) + (1 − y i ) log(1 − ha (xi ))
∂J (a) 1 Pm i i i 1 T
∂aj = m i =1 xj (y − ha (x )) = m Xj (y − g(Xa)))
I
1 T
I ∇J (a) = mX (g(Xa) − y))

Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a+ m X (y − g(Xa))
I Output a
Logistic regression
Practical example with Octave

Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = y - sigmoid(X*a)
a = a + alpha / m * X’ * D
% we store consecutive values of J over time t
G = sigmoid(X*a)
J(t) = 1/m * (log(G)’*y + log(1-G)’*(1-y))

Вам также может понравиться