Академический Документы
Профессиональный Документы
Культура Документы
Logistic Regression
Logistic Regression is an approach to learning functions of the form f : X Y or P (Y |X), in the case where Y is discrete-valued, and X =< X1 ...Xn > is any vector containing discrete or contiuous variables. For a two-class classication problem, the posterior probability of Y can be written as follows: P (Y = 1|X) = 1 1 + exp(
n i=1
i Xi )
(1)
(2)
which is plotted in Figure1. (Note we are implicitly redening the data Xi to add an extra dimension with Figure 1: Plot of the logistic sigmoid function.
a 1, as in linear regression, and then redening appropriately.) The term sigmoid means S-shaped. This 1
type of function is sometimes also called a squashing function because it maps the whole real axis into a nite interval.It satises the following symmetry property (a) = 1 (a) (4)
Interestingly, the parametric form of P (Y |X) used by Logistic Regression is precisely the form implied by the assumption of a Gaussian Naive Bayes classier.
1.1
We derive the form of P (Y |X) entailed by the assumption of a Gaussian Naive Bayes (GNB) classier. Consider a GNB based on the following modeling assumption: Y is Boolean, governed by a Bernoulli distribution, with parameter = P (Y = 1). X =< X1 ...Xn >, where each Xi is a continuous random variable. For each Xi , P (Xi |Y = yk ) is a Guassian distribution of the form N (ik , i ) (in many cases, this will simply be N (k , )). For all i and j = i, Xi and Xj are conditionally independent given Y . Note here we are assuming the standard deviations i vary from point to point, but do not depend on Y . We now derive the parametric form of P (Y |X) that follows from this set of GNB assumptions. In general, Bayes rule allows us to write P (Y = 1|X) = P (Y = 1)P (X|Y = 1) P (Y = 1)P (X|Y = 1) + P (Y = 0)P (X|Y = 0) (5)
Dividing both the numerator and denominator by the numerator yields: P (Y = 1|X) = 1+ = 1+ = 1+ = 1+ 1
P (Y =0)P (X|Y =0) P (Y =1)P (X|Y =1)
1
P (Y =0)P (X|Y =0) exp(ln P (Y =1)P (X|Y =1) )
1
P (Y =0) exp(ln P (Y =1)
1 exp(ln 1 +
i P (Xi |Y =0) ln P (Xi |Y =1) )
Note the nal step expresses P(Y=0) and P(Y=1) in terms of the binomial parameter .
Now consider just the summation in the denominator of equation (9). Given our assumption that P (Xi |Y = yk ) is Gaussian, we can expand this term as follows: P (Xi |Y = 0) ln P (Xi |Y = 1)
(Xi ) 1 exp( 22 i0 ) 2 2 i (Xi 2 ) 1 exp( 22 i1 ) 2 2 i
2
=
i
ln
(10)
=
i
ln exp( (
i
(Xi i1 )2 (Xi i0 )2 ) 2 2i
= =
i
(Xi i1 )2 (Xi i0 )2 ) 2 2i
2 2 (Xi 2Xi i1 + 2 ) (Xi 2Xi i0 + 2 ) i1 i0 ) 2 2i
( (
i
= =
i
Note this expression is a linear weighted sum of the Xi s. Substituting expression (15) back into equation (9), we have P (Y = 1|X) = 1 + exp(ln Or equivalently, P (Y = 1|X) where the weights 1 ...n are given by i and 0 Then we can derive P (Y = 1|X) = ( T Xi ) And also we have P (Y = 0|X) = 1 ( T Xi ) (21) (20) = ln 1 + 2 i02 i1 2 2i (19) = i0 i1 2 i (18) = 1 1 + exp(0 +
n i=1
1
1
i(
i0 i1 Xi 2 i
2 2 i1 i0 )) 2 2i
(16)
i Xi )
(17)
To summarize, the logistic form arises naturally from a generative model. However, since the number of parameters in a generative model is often more than the number of parameters in the logistic regression model, one often prefers working directly with the logistic regression model to nd the parameters W . This is a discriminative approach to classication, as we directly model the probabilities over the class labels.
One reasonable approach to traning Logistic Regression is to choose parameter values that maximize the conditional data likelihood. We choose parameters W arg max
W i
P (Yi |Xi , W )
where W =< 0 , 1 ...n > is the vector of parameters to be estimated, Y l denotes the observed value of Y in the lth training example, and X l denotes the observed value in the lth training example. Equivalently, we can work with the log of the conditional likelihood: W arg max
W i
ln P (Yi |Xi , W )
And
n
ln P (Yi |Xi , W )
=
i=1 n
= =
i=1
Yi ln
( T Xi ) + ln (1 ( T Xi )) (1 ( T Xi )
As usual, we can dene an error function by taking the negative logarithm of the likelihood, which gives the crossentropy error function in the form E(W ) = ln p(Y |W )
N
(25) (26)
=
n=1
yn ln yn + (1 yn ) ln (1 yn )
Unfortunately, there is no closed form solution to maximizing the likelihood with respect to W . The singularity can be avoided by inclusion of a prior and nding a MAP solution for w, or equivalently by adding a regularization term to the error function.
2.1
In the case of the linear regression models, the maximum likelihood solution, on the assumption of a Gaussian noise model, leads to a closed-form solution. However, for logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. However, the departure from a quadratic form is not substantial. To be precise, the error function is concave, as we shall see shortly, and hence has a unique minimum. Furthermore, the error function can be minimized by an ecient iterative technique based on the N ewton Raphson iterative optimization scheme, which uses a local quadratic approximation to the log likelihood function. W (new) Then we can derive
N
W (old) [
(27)
E(W )
(W T Xn Y )Xn
n=1 T
(28) (29)
= X XW X T Y
N 2
E(W )
=
n=1 T
T Xn Xn
(30) (31)
(32) (33)
X Y
Now let us apply the Newton-Raphson update to the cross-entropy error function (26) for the logistic regression model.
N
E(W )
=
n=1 T
(yn yn )xn
(34) (35)
= X (Y Y )
N
H=
E(W )
=
n=1 T
yn (1 yn )xn xT n
(36) (37)
= X RX where R = yn (1 yn ), then we can derive W (new) = = = where z = XW (old) R1 (Y Y ). W (old) (X T RX)1 X T (Y Y ) (X RX)
T T 1 1
{X RXW X Rz
T
(old)
X (Y Y )}
(X RX)
2.2
Overtting the training data is a problem that can arise in Logistic Regression, especially when data is very high dimensional and training data is sparse. One approach to reducing overtting is regularization, in which we create a modied penalized log likelihood function, which penelizes large value of W . One approach is to use the penalized log likelihood function W arg max
W
ln P (Yi |Xi , w)
W 2
(41)
Which adds a penalty proportional to the squared magnitude of W . Here is a constant that determines the strength of this penalty term.
P (Y |X, W ) (W T X) P (W |Y, X) P (Y |X, W )P (W ) (W T X)P (W ) Then for a new data Xnew , we can derive the predictive distribution: P (Ynew |X, Y , Xnew ) = P (Ynew |Y , Xnew )P (W |Y , X)dW (42)
Since P (Ynew |Y , Xnew ) is proportional to logistic sigmoid distribution and P (W |Y , X) is proportional to Normal distribution, there is no closed form for P (Ynew |X, Y , Xnew ). There are several approaches to approximating the predictive distribution: the Laplace approximation, variational methods, and Monte Carlo sampling are three of the main ones. Below we focus on the Laplace approximation.
In this section, we introduce a framework called the Laplace approximation, that aims to nd a Gaussian approximation to a probability density dened over a set of continuous variables. We assume that there is a f (x) where exp(N f (x))dx has no closed form. In the Laplace method the goal is to nd a Gaussian approximation g(z) which is centred on a mode of the distribution f (x). The rst step is to nd a mode for f (x), in other words a point x0 such that f (x0 ) = 0. A Gaussian distribution has the property that its logarithm is a quadratic function of the variables. We therefore consider a Taylor expansion of f (x) f (x) f (x0 ) + Since f (x0 ) = 0 f (x) f (x0 ) + Therefore, exp(N f (x))dx = = = exp N f (x0 ) + exp(N f (x0 ) |f (x0 )| (x x0 )2 dx 2! |f (x0 )| exp N (x x0 )2 dx 2! (45) (46) (47) f (x0 ) (x x0 )2 2! (44) f (x0 ) f (x0 ) (x x0 ) + (x x0 )2 1! 2! (43)
So we can get a approximate closed form solution for f (x). Note that the approximate integral is accurate to order O(1/N ). Finally, given a distribution p(x), using the Laplace approximation we form a Gaussian approximation with mean x0 and precision |p (x0 )|, where x0 is a mode of p.
4.1
As we introduced before, for the predictive distribution P (Ynew |X, Y , Xnew ) = = P (Ynew |W , Xnew )P (W |Y , X)dW (W T X)P (W |Y , X)dw (48) (49)
we cannot get a closed form solution. We will use the Laplace approximation to approximate the posterior P (W |Y , X) as a Gaussian. We further approximate P (Ynew |Y , Xnew ) as a probit function (a) and get an approximate solution for the predictive distribution. Because P (W |Y , X)dW is approximated as Gaussian, we know that the marginal distribution will also be Gaussian. Since (W T X) Then we derive (W T X)P (W |Y , X)dw where P (a) = = (a)P (a)da (51) = (a W T X)(a)da (50)
(a W T X)P (W |Y , X)dw then we can derive (a)N (a|, )da (k( 2 )) (52)
where k( 2 ) = (1 +
2 i 1 2 8 )
T = WM AP X
= X T SN X