Linear Regression Guide: Modeling Relationships Between Variables

Linear and Logistic Regression
Marta Arias
marias@lsi.upc.edu
Dept. LSI, UPC
Fall 2012
Linear regression
Simple case: R2
Here is the idea:

1. Got a bunch of points in R2 , {(x i , y i )}.
2. Want to fit a line y = ax + b that describes the trend.
3. We define a cost function that computes the total squared
error of our predictions w.r.t. observed values y i
P
J (a, b) = (ax i + b − y i )2 that we want to minimize.
4. See it as a function of a and b: compute both derivatives,
force them equal to zero, and solve for a and b.
5. The coefficients you get give you the minimum squared
error.
6. Can do this for specific points, or in general and find the
formulas.
7. More general version in Rn .
Linear regression
Simple case: R2
P
Let h(x ) = ax + b, and J (a, b) = (h(x i ) − y i )2
P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂a ∂a
X ∂(ax i + b − y i )2
=
∂a
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂a
i
X ∂(ax i )
= 2 (ax i + b − y i )
∂a
i
X
= 2 (ax i + b − y i )x i
i
Linear regression
Simple case: R2
P
Let h(x ) = ax + b, and J (a, b) = (h(x i ) − y i )2
P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂b ∂b
X ∂(ax i + b − y i )2
=
∂b
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂b
i
X ∂(b)
= 2 (ax i + b − y i )
∂b
i
X
= 2 (ax i + b − y i )
i
Linear regression
Simple case: R2
Normal equations
Given {(x i , y i )}i , solve for a, b:
X X
(ax i + b)x i = x iyi
i i
X X
(ax i + b) = yi
i i
Linear regression
General case: Rn
I Now, each xi = hx0i , x1i , x2i , .., xni i, where x0i = 1 for all i
I Parameters to estimate are a = ha0 , .., an iT 1
P P
I For j = 0, .., n, we have ∂J∂a(a)
j
= i ( nk=0 ak xki − y i )xji
Normal equations
Given {(xi , y i )}i , solve for a0 , a1 , .., an :
X Xn X
( ak xki )xji = xji y i (for each j = 0, .., n)
i k =0 i
1
Notice a is defined as a column vector.
Linear regression
General case: Rn
I Remember a = ha0 , a1 , a2 , ..., an iT

I Let y = hy 1 , y 2 , ..., y m iT 2
 1  1
x11 xn1

x x0 ...
 x2   x 2 x12 ... xn2 
   0 i
I Let X =  .  =  . ..  where all x0 = 1

..
 ..   .. . . 
xmx0m x1m .. xnm
P P P
Now, the normal equation i ( nk=0 ak xki )xji = i xji y i can be
rewritten as:
X X
n X
xji ( ak xki ) = xji (xi a) = XT
j y
i k =0 i
where Xj is the j -th column of X

2
Notice y is defined as a column vector.
Linear regression
General case: Rn
P
We have i xji (xi a) = XT
j y for each j = 0, .., n. Compactly:
XT Xa = XT y
which can be solved as
a = (XT X)−1 XT y
3
How to compute parameters in GNU Octave
Given X of size m × (n + 1) 4 and given label vector y, you can
solve the least squares regression problem with the single
command
pinv(X’ * X) * X’ * y 5
3
http://www.gnu.org/software/octave/
4
Assuming the original data matrix has been prepended an all-1 column.
5
Equivalent to X \ y using the built-in operator ’\’.
Linear regression
Practical example with Octave
We have a dataset with data for 20 cities; for each city we have
information on:
I Nr. of inhabitants
I Percentage of families’ incomes below 5000 USD
I Percentage of unemployed
I Number of murders per 106 inhabitants per annum
We wish to perform regression analysis on the number of
murders based on the other 3 features.
Linear regression
Octave code:
Result:
load data.txt
a =
n = size(data, 2)
-3.6765e+01
m = size(data, 1)
7.6294e-07
X = [ ones(m, 1) data(:,1:n-1) ]
1.1922e+00
y = data(:,n)
4.7198e+00
a = pinv(X’*X) * X’ * y
So, we see that the variable that has the most impact is the
percentage of unemployed.
Linear Regression
What if n is too large?
Computing a = (XT X)−1 XT y may not be feasible if n is large,

since it involves the inverse of a matrix of size n × n (or
(n + 1) × (n + 1) if we added the extra “all 1” column)
Gradient descent: an iterative optimization solution

Start with any parameters a, and update a iteratively in order
to minimize J (a). Gradient descent tells us that J (a) should
decrease fastest if we follow the direction of the negative
gradient of the cost function J (a):
a = a − α∇J (a)
where α is a positive, real-valued parameter dictating how large

each step is, and ∇J (a) = h ∂J∂a(a)
0
, ∂J∂a(a)
1
, .., ∂J∂a(a)
n
iT .
Gradient descent
Gradient descent
Algorithm, I
Pseudocode: given J , α
I Initialize a to a random non-zero vector
I Repeat until convergence
I for all j = 0, .., n, do aj0 = aj − α ∂J∂a(a)
j
I for all j = 0, .., n, do aj = aj0
I Output a
Should be careful with ..

I setting α small enough so that algorithm converges, but
not too small because it may need innecessarily too many
iterations
I perform feature scaling so that all features are “on the same
range” (this is necessary because they share the same α in
the updates)
Gradient descent
Algorithm, II
I m examples {(xi , y i )}i

I example x = hx0 , x1 , .., xn i
Pn
I ha (x) = a0 x0 + a1 x1 + .. + an xn = j =0 aj xj = xa
1 Pm i i 2
I J (a) = 2m i =1 (ha (x ) − y )
∂J (a) 1 P m i i i 1 T
∂aj = m i =1 xj (ha (x ) − y ) = m Xj (Xa − y)
I
1 T
I ∇J (a) = m X (Xa − y)
Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
α T
I a=a− m X (Xa − y)
I Output a
Gradient descent
Algorithm, II

I example x = hx0 , x1 , .., xn i
Pn
I ha (x) = a0 x0 + a1 x1 + .. + an xn = j =0 aj xj = xa
1 Pm i i 2
I J (a) = 2m i =1 (ha (x ) − y )
∂J (a) 1 P m i i i 1 T
∂aj = m i =1 xj (ha (x ) − y ) = m Xj (Xa − y)
I
1 T
I ∇J (a) = m X (Xa − y)
I Normalize X
α T
I a=a− m X (Xa − y)
I Output a
Linear regression
Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = X*a - y
a = a - alpha / m * X’ * D
% we store consecutive values of J over time t
J(t) = 1/2/m * D’ * D
Logistic regression
What if y i ∈ {0, 1} instead of continuous real value?
Binary classification
Now, datasets are of the form {(x1 , 1), (x2 , 0), ..}. In this case,
linear regression will not do a good job in classifying examples
as positive (y i = 1), or negative (y i = 0).
Logistic regression
Hypothesis space
P
I ha (x) = g( nj=0 aj xj ) = g(xa)
1
I g(z ) = 1+e −z is sigmoid function (a.k.a. logistic function)
I 0 6 g(z ) 6 1, for all z ∈ R
I lim g(z ) = 0 and lim g(z ) = 1
z →−∞ z →+∞
I g(z ) > 0.5 iff z > 0
I Given example x
I predict positive iff ha (x) > 0.5 iff g(xa) > 0.5 iff xa > 0
Logistic regression
Least square minimization for logistic regression
Let us assume that

I P (y = 1|x ; a) = ha (x), and so
I P (y = 0|x ; a) = 1 − ha (x)
Given m training examples {(xi , y i )}i where y i ∈ {0, 1} we

compute the likelihood (assuming independence of training
examples)
Y
L(a) = p(y i |xi ; a)
i
Y i i
= ha (xi )y (1 − ha (xi ))1−y
i
Our strategy will be to maximize the log likelihood

Logistic regression
We will run gradient ascent to maximize the log
likelihood, using:
∂ log f (x ) 1 ∂f (x )
I for any function f (x ), ∂x = f (x ) ∂x
I for the sigmoid function g(x ),
∂g(x ) ∂ 1
=
∂x ∂x 1 + e −x
1 ∂e −x
= − −x 2
(1 + e ) ∂x
1
= e −x
(1 + e −x )2

1 1
= 1−
1 + e −x 1 + e −x
= g(x )(1 − g(x ))
Logistic regression
Maximizing the log likelihood
Y X
log L(a) = log p(y i |xi ; a) = log p(y i |xi ; a)
i i
X i i

= log ha (xi )y (1 − ha (xi ))1−y
i
X
= y i log ha (xi ) + (1 − y i ) log(1 − ha (xi ))
i
Logistic regression
Computing partial derivatives
∂ log L(a) P ∂y i log ha (xi ) ∂(1−y i ) log(1−ha (xi ))

∂aj = i ∂aj + ∂aj
P i i
= i y i ∂ log∂a
g(x a)
j
+ (1 − y i ) ∂ log(1−g(x
∂aj
a))
P yi ∂g(xi a) (1−y i ) ∂g(xi a)

= i g(xi a) ∂aj − 1−g(xi a) ∂aj
P yi a (1−y i )

∂g(xi a)
= i g(xi a)
− 1−g(xi a) ∂aj
P yi (1−y i )
i
= i g(xi a)
− 1−g(xi a)
g(xi a)(1 − g(xi a)) ∂x
∂aj
a
P yi (1−y i )

= i g(xi a)
− 1−g(xi a)
g(xi a)(1 − g(xi a))xji
= (y i − g(xi a))xji
= (y i − ha (xi ))xji
Gradient ascent for logistic regression
Algorithm, I
Pseudocode: given α, {(xi , y i )}m

i =1

I Perform feature scaling on the examples’ attributes
I for each j = 0, .., n:
P
I aj0 = aj + α i (y
i
− ha (xi ))xji
I for each j = 0, .., n:
I aj = aj
I Output a
Gradient ascent for logistic regression
Algorithm, II
I g sigmoid function; g its generalization to vectors:
g(hz1 , .., zk i) = hg(z1 ), .., g(zk )i
P
I ha (x) = g( nj=0 aj xj ) = g(xa)
P
I J (a) = m1 i y i log ha (xi ) + (1 − y i ) log(1 − ha (xi ))
∂J (a) 1 Pm i i i 1 T
∂aj = m i =1 xj (y − ha (x )) = m Xj (y − g(Xa)))
I
1 T
I ∇J (a) = mX (g(Xa) − y))
I Normalize X
α T
I a=a+ m X (y − g(Xa))
I Output a
Logistic regression
Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = y - sigmoid(X*a)
a = a + alpha / m * X’ * D
% we store consecutive values of J over time t
G = sigmoid(X*a)
J(t) = 1/m * (log(G)’*y + log(1-G)’*(1-y))

Linear Regression Guide: Modeling Relationships Between Variables

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Linear Regression Guide: Modeling Relationships Between Variables

Загружено:

Авторское право:

Доступные форматы

Linear and Logistic Regression

Dept. LSI, UPC

Here is the idea:

I Remember a = ha0 , a1 , a2 , ..., an iT

where Xj is the j -th column of X

which can be solved as

Computing a = (XT X)−1 XT y may not be feasible if n is large,

Gradient descent: an iterative optimization solution

where α is a positive, real-valued parameter dictating how large

Should be careful with ..

I m examples {(xi , y i )}i

I m examples {(xi , y i )}i

Let us assume that

Given m training examples {(xi , y i )}i where y i ∈ {0, 1} we

Our strategy will be to maximize the log likelihood

∂ log L(a) P ∂y i log ha (xi ) ∂(1−y i ) log(1−ha (xi ))

P yi ∂g(xi a) (1−y i ) ∂g(xi a)

Pseudocode: given α, {(xi , y i )}m

I Initialize a = h1, .., 1iT

Вам также может понравиться