Академический Документы
Профессиональный Документы
Культура Документы
Marta Arias
marias@lsi.upc.edu
Fall 2012
Linear regression
Simple case: R2
P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂a ∂a
X ∂(ax i + b − y i )2
=
∂a
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂a
i
X ∂(ax i )
= 2 (ax i + b − y i )
∂a
i
X
= 2 (ax i + b − y i )x i
i
Linear regression
Simple case: R2
P
Let h(x ) = ax + b, and J (a, b) = (h(x i ) − y i )2
P i)
∂J (a, b) ∂ i (h(x − y i )2
=
∂b ∂b
X ∂(ax i + b − y i )2
=
∂b
i
X ∂(ax i + b − y i )
= 2(ax i + b − y i )
∂b
i
X ∂(b)
= 2 (ax i + b − y i )
∂b
i
X
= 2 (ax i + b − y i )
i
Linear regression
Simple case: R2
Normal equations
Given {(x i , y i )}i , solve for a, b:
X X
(ax i + b)x i = x iyi
i i
X X
(ax i + b) = yi
i i
Linear regression
General case: Rn
I Now, each xi = hx0i , x1i , x2i , .., xni i, where x0i = 1 for all i
I Parameters to estimate are a = ha0 , .., an iT 1
P P
I For j = 0, .., n, we have ∂J∂a(a)
j
= i ( nk=0 ak xki − y i )xji
Normal equations
Given {(xi , y i )}i , solve for a0 , a1 , .., an :
X Xn X
( ak xki )xji = xji y i (for each j = 0, .., n)
i k =0 i
1
Notice a is defined as a column vector.
Linear regression
General case: Rn
XT Xa = XT y
a = (XT X)−1 XT y
3
How to compute parameters in GNU Octave
Given X of size m × (n + 1) 4 and given label vector y, you can
solve the least squares regression problem with the single
command
pinv(X’ * X) * X’ * y 5
3
http://www.gnu.org/software/octave/
4
Assuming the original data matrix has been prepended an all-1 column.
5
Equivalent to X \ y using the built-in operator ’\’.
Linear regression
Practical example with Octave
We have a dataset with data for 20 cities; for each city we have
information on:
I Nr. of inhabitants
I Percentage of families’ incomes below 5000 USD
I Percentage of unemployed
I Number of murders per 106 inhabitants per annum
We wish to perform regression analysis on the number of
murders based on the other 3 features.
Linear regression
Practical example with Octave
Octave code:
Result:
load data.txt
a =
n = size(data, 2)
-3.6765e+01
m = size(data, 1)
7.6294e-07
X = [ ones(m, 1) data(:,1:n-1) ]
1.1922e+00
y = data(:,n)
4.7198e+00
a = pinv(X’*X) * X’ * y
So, we see that the variable that has the most impact is the
percentage of unemployed.
Linear Regression
What if n is too large?
a = a − α∇J (a)
Pseudocode: given J , α
I Initialize a to a random non-zero vector
I Repeat until convergence
I for all j = 0, .., n, do aj0 = aj − α ∂J∂a(a)
j
I for all j = 0, .., n, do aj = aj0
I Output a
1 T
I ∇J (a) = m X (Xa − y)
Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a− m X (Xa − y)
I Output a
Gradient descent
Algorithm, II
1 T
I ∇J (a) = m X (Xa − y)
Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a− m X (Xa − y)
I Output a
Linear regression
Practical example with Octave
Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = X*a - y
a = a - alpha / m * X’ * D
% we store consecutive values of J over time t
J(t) = 1/2/m * D’ * D
Logistic regression
What if y i ∈ {0, 1} instead of continuous real value?
Binary classification
Now, datasets are of the form {(x1 , 1), (x2 , 0), ..}. In this case,
linear regression will not do a good job in classifying examples
as positive (y i = 1), or negative (y i = 0).
Logistic regression
Hypothesis space
P
I ha (x) = g( nj=0 aj xj ) = g(xa)
1
I g(z ) = 1+e −z is sigmoid function (a.k.a. logistic function)
I 0 6 g(z ) 6 1, for all z ∈ R
I lim g(z ) = 0 and lim g(z ) = 1
z →−∞ z →+∞
I g(z ) > 0.5 iff z > 0
I Given example x
I predict positive iff ha (x) > 0.5 iff g(xa) > 0.5 iff xa > 0
Logistic regression
Least square minimization for logistic regression
∂g(x ) ∂ 1
=
∂x ∂x 1 + e −x
1 ∂e −x
= − −x 2
(1 + e ) ∂x
1
= e −x
(1 + e −x )2
1 1
= 1−
1 + e −x 1 + e −x
= g(x )(1 − g(x ))
Logistic regression
Maximizing the log likelihood
Y X
log L(a) = log p(y i |xi ; a) = log p(y i |xi ; a)
i i
X i i
= log ha (xi )y (1 − ha (xi ))1−y
i
X
= y i log ha (xi ) + (1 − y i ) log(1 − ha (xi ))
i
Logistic regression
Computing partial derivatives
P yi (1−y i )
= i g(xi a)
− 1−g(xi a)
g(xi a)(1 − g(xi a))xji
= (y i − g(xi a))xji
= (y i − ha (xi ))xji
Gradient ascent for logistic regression
Algorithm, I
Pseudocode: given α, X, y
I Initialize a = h1, .., 1iT
I Normalize X
I Repeat until convergence
α T
I a=a+ m X (y − g(Xa))
I Output a
Logistic regression
Practical example with Octave
Octave code:
% X is original m x n matrix
a = ones(n, 1) % initial value for parameter vector
X = studentize(X) % normalize X
X = [ones(m, 1) X] % prepend all 1s column
for t = 1:100 % repeat 100 times
D = y - sigmoid(X*a)
a = a + alpha / m * X’ * D
% we store consecutive values of J over time t
G = sigmoid(X*a)
J(t) = 1/m * (log(G)’*y + log(1-G)’*(1-y))