Вы находитесь на странице: 1из 7

CSE 788.

04: Topics in Machine Learning

Lecture Date: April 11th, 2012

Lecture 6: Bayesian Logistic Regression


Lecturer: Brian Kulis Scribe: Ziqi Huang

Logistic Regression

Logistic Regression is an approach to learning functions of the form f : X Y or P (Y |X), in the case where Y is discrete-valued, and X =< X1 ...Xn > is any vector containing discrete or contiuous variables. For a two-class classication problem, the posterior probability of Y can be written as follows: P (Y = 1|X) = 1 1 + exp(
n i=1

i Xi )

(1)

= ( T Xi ) and P (Y = 0|X) = = exp( 1 + exp( + 1 ( T Xi )


n i=1 i Xi ) n i=1 i Xi )

(2)

where () is the logistic sigmoid function dened by (a) = 1 1 + exp(a). (3)

which is plotted in Figure1. (Note we are implicitly redening the data Xi to add an extra dimension with Figure 1: Plot of the logistic sigmoid function.

a 1, as in linear regression, and then redening appropriately.) The term sigmoid means S-shaped. This 1

Lecture 6: Bayesian Logistic Regression

type of function is sometimes also called a squashing function because it maps the whole real axis into a nite interval.It satises the following symmetry property (a) = 1 (a) (4)

Interestingly, the parametric form of P (Y |X) used by Logistic Regression is precisely the form implied by the assumption of a Gaussian Naive Bayes classier.

1.1

Form of P (Y |X) for Gaussian Naive Bayes Classier

We derive the form of P (Y |X) entailed by the assumption of a Gaussian Naive Bayes (GNB) classier. Consider a GNB based on the following modeling assumption: Y is Boolean, governed by a Bernoulli distribution, with parameter = P (Y = 1). X =< X1 ...Xn >, where each Xi is a continuous random variable. For each Xi , P (Xi |Y = yk ) is a Guassian distribution of the form N (ik , i ) (in many cases, this will simply be N (k , )). For all i and j = i, Xi and Xj are conditionally independent given Y . Note here we are assuming the standard deviations i vary from point to point, but do not depend on Y . We now derive the parametric form of P (Y |X) that follows from this set of GNB assumptions. In general, Bayes rule allows us to write P (Y = 1|X) = P (Y = 1)P (X|Y = 1) P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0) (5)

Dividing both the numerator and denominator by the numerator yields: P (Y = 1|X) = 1+ = 1+ = 1+ = 1+ 1
P (Y =0)P (X|Y =0) P (Y =1)P (X|Y =1)

(6) (7) (8) (9)

1
P (Y =0)P (X|Y =0) exp(ln P (Y =1)P (X|Y =1) )

1
P (Y =0) exp(ln P (Y =1)

(X =0) ln P (Xi |Y =1) ) P i |Y

1 exp(ln 1 +
i P (Xi |Y =0) ln P (Xi |Y =1) )

Note the nal step expresses P(Y=0) and P(Y=1) in terms of the binomial parameter .

Lecture 6: Bayesian Logistic Regression

Now consider just the summation in the denominator of equation (9). Given our assumption that P (Xi |Y = yk ) is Gaussian, we can expand this term as follows: P (Xi |Y = 0) ln P (Xi |Y = 1)
(Xi ) 1 exp( 22 i0 ) 2 2 i (Xi 2 ) 1 exp( 22 i1 ) 2 2 i
2

=
i

ln

(10)

=
i

ln exp( (
i

(Xi i1 )2 (Xi i0 )2 ) 2 2i

(11) (12) (13) (14) (15)

= =
i

(Xi i1 )2 (Xi i0 )2 ) 2 2i
2 2 (Xi 2Xi i1 + 2 ) (Xi 2Xi i0 + 2 ) i1 i0 ) 2 2i

( (
i

= =
i

2Xi (i0 i1 ) + 2 2 i0 i1 ) 2 2i 2 i0 i1 Xi + i1 ) 2 sigma2 2i

Note this expression is a linear weighted sum of the Xi s. Substituting expression (15) back into equation (9), we have P (Y = 1|X) = 1 + exp(ln Or equivalently, P (Y = 1|X) where the weights 1 ...n are given by i and 0 Then we can derive P (Y = 1|X) = ( T Xi ) And also we have P (Y = 0|X) = 1 ( T Xi ) (21) (20) = ln 1 + 2 i02 i1 2 2i (19) = i0 i1 2 i (18) = 1 1 + exp(0 +
n i=1

1
1

i(

i0 i1 Xi 2 i

2 2 i1 i0 )) 2 2i

(16)

i Xi )

(17)

To summarize, the logistic form arises naturally from a generative model. However, since the number of parameters in a generative model is often more than the number of parameters in the logistic regression model, one often prefers working directly with the logistic regression model to nd the parameters W . This is a discriminative approach to classication, as we directly model the probabilities over the class labels.

Lecture 6: Bayesian Logistic Regression

Estimating Parameters for Logistic Regression

One reasonable approach to traning Logistic Regression is to choose parameter values that maximize the conditional data likelihood. We choose parameters W arg max
W i

P (Yi |Xi , W )

where W =< 0 , 1 ...n > is the vector of parameters to be estimated, Y l denotes the observed value of Y in the lth training example, and X l denotes the observed value in the lth training example. Equivalently, we can work with the log of the conditional likelihood: W arg max
W i

ln P (Yi |Xi , W )

And
n

ln P (Yi |Xi , W )

=
i=1 n

Yi ln P (Yi = 1|Xi , W ) + (1 Yi ) ln P (Yi = 0|Xi , W ) Yi ln ( T Xi ) + (1 Yi ) ln (1 ( T Xi ))


i=1 n

(22) (23) (24)

= =
i=1

Yi ln

( T Xi ) + ln (1 ( T Xi )) (1 ( T Xi )

As usual, we can dene an error function by taking the negative logarithm of the likelihood, which gives the crossentropy error function in the form E(W ) = ln p(Y |W )
N

(25) (26)

=
n=1

yn ln yn + (1 yn ) ln (1 yn )

Unfortunately, there is no closed form solution to maximizing the likelihood with respect to W . The singularity can be avoided by inclusion of a prior and nding a MAP solution for w, or equivalently by adding a regularization term to the error function.

2.1

Iterative reweighted least squares

In the case of the linear regression models, the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. However, for logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the departure from a quadratic form is not substantial. To be precise, the error function is concave, as we shall see shortly, and hence has a unique minimum. Furthermore, the error function can be minimized by an ecient iterative technique based on the N ewton Raphson iterative optimization scheme, which uses a local quadratic approximation to the log likelihood function. W (new) Then we can derive
N

W (old) [

(W (old) )]1 f (W (old))

(27)

E(W )

(W T Xn Y )Xn
n=1 T

(28) (29)

= X XW X T Y

Lecture 6: Bayesian Logistic Regression

N 2

E(W )

=
n=1 T

T Xn Xn

(30) (31)

= X X Plug in equation (27), we can derive W (new) = W (old) (X T X)1 {X T XW (old) X T Y } = (X X)


T 1

(32) (33)

X Y

Now let us apply the Newton-Raphson update to the cross-entropy error function (26) for the logistic regression model.
N

E(W )

=
n=1 T

(yn yn )xn

(34) (35)

= X (Y Y )
N

H=

E(W )

=
n=1 T

yn (1 yn )xn xT n

(36) (37)

= X RX where R = yn (1 yn ), then we can derive W (new) = = = where z = XW (old) R1 (Y Y ). W (old) (X T RX)1 X T (Y Y ) (X RX)
T T 1 1

(38) (39) (40)

{X RXW X Rz
T

(old)

X (Y Y )}

(X RX)

2.2

Regularization in Logistic Regression

Overtting the training data is a problem that can arise in Logistic Regression, especially when data is very high dimensional and training data is sparse. One approach to reducing overtting is regularization, in which we create a modied penalized log likelihood function, which penelizes large value of W . One approach is to use the penalized log likelihood function W arg max
W

ln P (Yi |Xi , w)

W 2

(41)

Which adds a penalty proportional to the squared magnitude of W . Here is a constant that determines the strength of this penalty term.

The Bayesian Setting

Now we have some assmputions:


2 P (W ) N (0 , 0 )

Lecture 6: Bayesian Logistic Regression

P (Y |X, W ) (W T X) P (W |Y, X) P (Y |X, W )P (W ) (W T X)P (W ) Then for a new data Xnew , we can derive the predictive distribution: P (Ynew |X, Y , Xnew ) = P (Ynew |Y , Xnew )P (W |Y , X)dW (42)

Since P (Ynew |Y , Xnew ) is proportional to logistic sigmoid distribution and P (W |Y , X) is proportional to Normal distribution, there is no closed form for P (Ynew |X, Y , Xnew ). There are several approaches to approximating the predictive distribution: the Laplace approximation, variational methods, and Monte Carlo sampling are three of the main ones. Below we focus on the Laplace approximation.

The Laplace Approximation

In this section, we introduce a framework called the Laplace approximation, that aims to nd a Gaussian approximation to a probability density dened over a set of continuous variables. We assume that there is a f (x) where exp(N f (x))dx has no closed form. In the Laplace method the goal is to nd a Gaussian approximation g(z) which is centred on a mode of the distribution f (x). The rst step is to nd a mode for f (x), in other words a point x0 such that f (x0 ) = 0. A Gaussian distribution has the property that its logarithm is a quadratic function of the variables. We therefore consider a Taylor expansion of f (x) f (x) f (x0 ) + Since f (x0 ) = 0 f (x) f (x0 ) + Therefore, exp(N f (x))dx = = = exp N f (x0 ) + exp(N f (x0 ) |f (x0 )| (x x0 )2 dx 2! |f (x0 )| exp N (x x0 )2 dx 2! (45) (46) (47) f (x0 ) (x x0 )2 2! (44) f (x0 ) f (x0 ) (x x0 ) + (x x0 )2 1! 2! (43)

2 exp(N f (x0 )) N |f (x0 )|

So we can get a approximate closed form solution for f (x). Note that the approximate integral is accurate to order O(1/N ). Finally, given a distribution p(x), using the Laplace approximation we form a Gaussian approximation with mean x0 and precision |p (x0 )|, where x0 is a mode of p.

4.1

Example: P (W |Y ) P (Y |W, X)P (W )

As we introduced before, for the predictive distribution P (Ynew |X, Y , Xnew ) = = P (Ynew |W , Xnew )P (W |Y , X)dW (W T X)P (W |Y , X)dw (48) (49)

Lecture 6: Bayesian Logistic Regression

we cannot get a closed form solution. We will use the Laplace approximation to approximate the posterior P (W |Y , X) as a Gaussian. We further approximate P (Ynew |Y , Xnew ) as a probit function (a) and get an approximate solution for the predictive distribution. Because P (W |Y , X)dW is approximated as Gaussian, we know that the marginal distribution will also be Gaussian. Since (W T X) Then we derive (W T X)P (W |Y , X)dw where P (a) = = (a)P (a)da (51) = (a W T X)(a)da (50)

(a W T X)P (W |Y , X)dw then we can derive (a)N (a|, )da (k( 2 )) (52)

where k( 2 ) = (1 +

2 i 1 2 8 )

and a = = p(a)ada P (W |Y , X)W T XdW (53) (54) (55)

T = WM AP X

and also we can derive


2 a =

p(a)(a2 2 )da = P (W |Y , X)(W T X mT X )dw N


2 2

(56) (57) (58)

= X T SN X

Вам также может понравиться