Вы находитесь на странице: 1из 6

Logistic Regression

compiled by Alvin Wan from Professor Jitendra Maliks lecture

1 Univariate Derivation for Logistic Regression

Let us consider the single-variable case, when

Pr(x|C1 ) N (1 , 2 ), P (C1 ) = 1
Pr(x|C2 ) N (2 , 2 ), P (C2 ) = 2

We find that

1
Pr(C1 |x) =
1 + exp(z)

where z = x+, with parameters and . We wont prove the univariate case for Pr(C1 |x)
here but instead, well prove it for the more general multivariate case below.

2 Multivariate Derivation for the Logistic Regression

2.1 Logit

Before we begin the derivation, here is the reason for the so-called logit expression. Begin
with some probability p.

However, a linear model for probability doesnt make sense, since we only have 0 p 1
for any probability distribution. Now, take the odds, which are the probability of success
over probability of failure.

1
p
1p

This is better, because our range of values is now lower bounded by 0. However, this isnt
symmetric. So we take the log of this value, to get the log-odds or the logit.

p
logit(p) = log
1p

This is the reason why we consider the logit expression for logistic regression. We will now
begin deriving our result for Pr(C1 |x).

2.2 Computing Logit

Well begin by applying Bayes rule.

Pr(X|C1 ) Pr(C1 )
Pr(C1 |X) =
Pr(X)

Then, take the logit of this probability, so that we have the following.

Pr(C1 |X)
logit(Pr(C1 |X)) = log
1 Pr(C1 |X)

Note that were only considering two classes C1 and C2 , so Pr(C1 |X) + Pr(C2 |X) = 1. We
can thus substitute 1 Pr(C1 |X) for Pr(C2 |X)

Pr(C1 |X)
= log
Pr(C2 |X)
Pr(X|C1 ) Pr(C1 )/ Pr(X)
= log
Pr(X|C2 ) Pr(C2 )/ Pr(X)
Pr(X|C1 ) Pr(C1 )
= log
Pr(X|C2 ) Pr(C2 )
Pr(X|C1 ) Pr(C1 )
= log + log
Pr(X|C2 ) Pr(C2 )

2
Let us now plug in the gaussian pdf. We find that the preceding coefficients cancel, since
both conditional probability densities share the same variance. Since we take the log of these
quantities, we are left with only the terms in the exponent.

1 1 2
= (x 2 )T 1 (x 2 ) + (x 1 )T 1 (x 1 ) + log
2 2 1
1 T 1 1 1 1 2
= x x T2 1 x T2 1 2 + xT 1 x T1 1 x + T1 1 1 + log
2 2 2 2 1

We note that the 12 xT 1 x terms cancel out.

1 1 2
= T2 1 x T2 1 2 T1 1 x T1 1 1 + log
2 2 1

We can rearrange the terms to get the following expression.

1 1 2
= T2 1 x T1 1 x T2 1 2 T1 1 1 + log
2 2 1
1 1 2
= (T2 1 T1 1 )x + ( T2 1 2 T1 1 1 + log )
2 2 1
= x +

2.3 Deriving Pr(C1 |X)

To assess the probability of some output, we will take the inverse of the logit.

3
logit(Pr(C1 |X)) = x +
Pr(C1 |X)
log = x +
1 Pr(C1 |X)
1 Pr(C1 |X)
log = (x + )
Pr(C1 |X)
1 Pr(C1 |X)
= e(x+)
Pr(C1 |X)
1 Pr(C1 |X) = Pr(C1 |X)e(x+)
1 = (1 + e(x+) ) Pr(C1 |X)
1
Pr(C1 |X) = (x+)
1+e

We find that many different class conditional distributions (e.g., Poisson distributions) will
give the same result. We have now found two different forms. We will call the first form
(x).

1
(x) = Pr(x) =
1 + exp( T x)
Pr(x)
log = T x
1 Pr(x)

3 Multivariate Derivation for Logistic Classification


However, were currently interested in a classifier and not regression. So, we need to convert
this real-valued continuous output into a label. Let us define a new random variable Y = 1
to represent class 1 and Y = 0 for class 0. We consider the following.

Pr(Y = 1|x) = (x)


Pr(Y = 0|x) = 1 (x)

We can combine both into a single expression; you can convince yourself that the two are
equivalent by plugging in Y = 0 and Y = 1.

Pr(y|x) = (x)y (1 (x))1y

4
3.1 Maximum Likelihood Estimate

i
We will now compute the likelihood. Before we begin, we will compute
.

i 1
=
1 + e T xi
Tx Tx
= (1 + e i
)2 (xi e i
)
T xi
xi e 1
= Tx
1+e i 1 + e T xi
= xTi (1 i )i

Assume all xi are independent, and take the product of their probabilities. We will let
D = {x1 , x2 ...xn } and assume that all priors are identical.

Pr(D|) = L(|D) = ni=1 yi i (1 i )(1yi )

As usual, the log-likelihood is more informative.

X
l(|D) = log L(|D) = yi log i + (1 yi ) log 1 i
i

Now, compute the gradient with respect to , keeping in mind that i are functions of .

X yi i (1 yi ) i
l =
i
i (1 i )

Plugging in what we have above, we have the following:

5
X yi (1 yi ) T
= ( )x (1 i )i
i
i (1 i ) i
X yi (1 i ) i (1 yi )
= ( )xTi (1 i )i
i
i (1 i )
X
= (yi (1 i ) i (1 yi ))xTi
i
X
= (yi yi i i + i yi )xTi
i
X
= (yi i )xTi
i

We can take the general stochastic gradient descent update equation (t+1) = (t) + li
and plug in our gradient.

(t+1) = (t) + (yi i )xTi

With online updates in the perceptron algorithm, we take yi = sgn( T xi ), then we have the
following:

(t)
(t+1) = (t) + (yi yi )xi

For least squares linear regression, we then have the following.

(t+1) = (t) + (yi (t)T xi )xi

We know that this iterative algorithm converges to the correct solution: It converges to the
local optimum, and for a strictly convex function, the local optimum is the global optimum.

Вам также может понравиться