Академический Документы
Профессиональный Документы
Культура Документы
Bayesian Classification
With Gaussian Processes
Christopher K.I. Williams, Member, IEEE Computer Society, and David Barber
Abstract—We consider the problem of assigning an input vector to one of m classes by predicting P(c|x) for c = 1, º, m. For a two-
-y
class problem, the probability of class one given x is estimated by s(y(x)), where s(y) = 1/(1 + e ). A Gaussian process prior is
placed on y(x), and is combined with the training data to obtain predictions for new x points. We provide a Bayesian treatment,
integrating over uncertainty in y and in the parameters that control the Gaussian process prior; the necessary integration over y is
carried out using Laplace’s approximation. The method is generalized to multiclass problems (m > 2) using the softmax function. We
demonstrate the effectiveness of the method on a number of datasets.
Index Terms—Gaussian processes, classification problems, parameter uncertainty, Markov chain Monte Carlo, hybrid Monte Carlo,
Bayesian classification.
——————————F——————————
1 INTRODUCTION
2 7 2 74
y$ x* = k x* K + s I
T 2
9
-1
t
posterior distribution over the parameters is obtained.
These calculations are facilitated by the fact that the log
likelihood l = log P('|) can be calculated analytically as
2
2 7 2 7 T
s y$ x* = C x* , x* - k x* K + s I 2 74 2
9 k2x 7 ,
-1
*
l=-
1 ~ 1 ~ n
log K - t T K -1t - log 2p , (5)
T 2 2 2
where [K]ij = C(xi, xj) and k(x*) = (C(x*, x1), º, C(x*, xn)) (see,
~ 2 ~ ~
e.g., [25]). where K = K + s I and K denotes the determinant of K .
2.1 Parameterizing the Covariance Function It is also possible to express analytically the partial deriva-
tives of the log likelihood with respect to the parameters
There are many reasonable choices for the covariance func-
tion. Formally, we are required to specify functions which ~ ~
∂l 1 ~ -1 ∂K 1 T ~ -1 ∂K ~ -1
will generate a non-negative definite covariance matrix for = - tr K + t K K t, (6)
∂q i 2 ∂q i 2 ∂q i
any set of points (x1, º, xk). From a modeling point of view,
we wish to specify covariances so that points with nearby (see, e.g., [11]).
inputs will give rise to similar predictions. We find that the Given l and its derivatives with respect to , it is straight-
following covariance function works well: forward to feed this information to an optimization package
K%& 1 Â w 2x 7 K()K + v ,
in order to obtain a local maximum of the likelihood.
0 5
d
2 In general one may be concerned about making point
C x , x ¢ = v0 exp -
K' 2
l =1
l l - xl¢
* 1 (4)
estimates when the number of parameters is large relative
to the number of data points, or if some of the parameters
where xl is the lth component of x and q = (logv0, logv1, may be poorly determined, or if there may be local maxima
logw1, º, logwd) is the vector of parameters that are in the likelihood surface. For these reasons, the Bayesian
needed to define the covariance function. Note that q is approach of defining a prior distribution over the parame-
analogous to the hyperparameters in a neural network. We ters and then obtaining a posterior distribution once the
define the parameters to be the log of the variables in (4) data ' has been seen is attractive. To make a prediction for
since these are positive scale-parameters. This covariance a new test point x* one simply averages over the posterior
function can be obtained from a network of Gaussian radial distribution P(|'), i.e.,
1344 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 12, DECEMBER 1998
fi fi
Fig. 1. p(x) is obtained from y(x) by “squashing” it through the sigmoid function s.
2
P y* ' = 7 2 72 7
P y* , ' P ' d . (7) replaced by t ⬃ Bern(s(y)). The choice of v0 in (4) will affect
how “hard” the classification is; i.e., if p(x) hovers around
For GPs, it is not possible to do this integration analytically 0.5 or takes on the extreme values of 0 and 1.
in general, but numerical methods may be used. If is of Previous and related work to this approach is discussed
sufficiently low dimension, then techniques involving grids in Section 3.3.
in -space can be used. As in the regression case, there are now two problems to
If is high dimensional, it is very difficult to locate the address
regions of parameter space which have high-posterior den- 1) making predictions with fixed parameters and
sity by gridding techniques or importance sampling. In this 2) dealing with parameters.
case, Markov chain Monte Carlo (MCMC) methods may be
used. These work by constructing a Markov chain whose We shall discuss these issues in turn.
equilibrium distribution is the desired distribution P(|'); 3.1 Making Predictions With Fixed Parameters
the integral in (7) is then approximated using samples from
To make predictions when using fixed parameters, we
2 7
the Markov chain.
Two standard methods for constructing MCMC methods would like to compute p$ * = p * P p * t dp * , which requires
are the Gibbs sampler and Metropolis-Hastings algorithms us to find P(p*|t) = P(p(x*)|t) for a new input x*. This can be
(see, e.g., [5]). However, the conditional parameter distri-
done by finding the distribution P(y*|t) (y* is the activation
butions are not amenable to Gibbs sampling if the covari-
ance function has the form given by (4), and the Metropolis- of p*), and then using the appropriate Jacobian to transform
Hastings algorithm does not utilize the derivative informa- the distribution. Formally, the equations for obtaining
tion that is available, which means that it tends to have an P(y*|t) are identical to (1), (2), and (3). However, even if we
inefficient random-walk behavior in parameter-space. Fol- use a GP prior so that P(y*, y) is Gaussian, the usual expres-
lowing the work of Neal [15] on Bayesian treatment of neu-
ral networks, Williams and Rasmussen [28] and Rasmussen 3 8 ’ p 21 - p 7
sion for P t y =
ti
i i i
1 - ti
for classification data
[17] have used the Hybrid Monte Carlo (HMC) method of (where the ts take on values of zero or one) means that the
Duane et al. [4] to obtain samples from P(|'). The HMC marginalization to obtain P(y*|t) is no longer analytically
algorithm is described in more detail in Appendix D. tractable.
Faced with this problem, there are two routes that we
3 GAUSSIAN PROCESSES FOR TWO-CLASS can follow:
CLASSIFICATION 1) to use an analytic approximation to the integral in (1),
(2), and (3) or
For simplicity of exposition, we will first present our
2) to use Monte Carlo methods, specifically MCMC
method as applied to two-class problems; the extension to
methods, to approximate it.
multiple classes is covered in Section 4.
By using the logistic transfer function to produce an out- Below, we consider an analytic approximation based on
put which can be interpreted as p(x), the probability of the Laplace’s approximation; some other approximations are
input x belonging to class 1, the job of specifying a prior discussed in Section 3.3.
over functions p can be transformed into that of specifying In Laplace’s approximation, the integrand P(y*, y|t, ) is
a prior over the input to the transfer function, which we approximated by a Gaussian distribution centered at a
shall call the activation, and denote by y (see Fig. 1). For the maximum of this function with respect to y*, y with an in-
two-class problem we can use the logistic function p(x) = verse covariance matrix given by -——logP(y*, y|t, ).
s(y(x)), where s(y) = 1/(1 + e- ). We will denote the prob-
y
Finding a maximum can be carried out using the Newton-
ability and activation corresponding to input xi by pi and yi, Raphson iterative method on y, which then allows the ap-
respectively. Fundamentally, the GP approaches to classifi- proximate distribution of y* to be calculated. Details of the
cation and regression problems are similar, except that the maximization procedure can be found in Appendix A.
2
error model which is t ⬃ N(y, s ) in the regression case is
WILLIAMS AND BARBER: BAYESIAN CLASSIFICATION WITH GAUSSIAN PROCESSES 1345
3.2 Integration Over the Parameters [16], and discussions can also be found in texts such as [8]
To make predictions we integrate the predicted probabili- and [25].
ties over the posterior P(|t) µ P(t|)P(q), as we saw in There are two differences between the nonparametric
Section 2.2. For the regression problem P(t|q) can be cal- GLM method as it is usually described and a Bayesian
culated exactly using P t = 2 7 3 82 7
P t y P y dy , but this in-
treatment. Firstly, for fixed parameters the nonparametric
GLM method ignores the uncertainty in y* and, hence, the
tegral is not analytically tractable for the classification need to integrate over this (as described in Section 3.1).
problem. Let Y = log P(t|y) + log P(y). Using P ti yi = 3 8 The second difference relates to the treatment of the pa-
rameters . As discussed in Section 2.2, given parameters ,
4
ti yi - log 1 + e
yi
9 , we obtain one can either attempt to obtain a point estimate for the
parameters or to carry out an integration over the posterior.
 log41 + e y 9
n
T
Y=t y- i Point estimates may be obtained by maximum likelihood
i =1 estimation of , or by cross-validation or generalized cross-
1 1 n validation (GCV) methods, see e.g., [25], [8]. One problem
- y T K -1y - log K - log 2p . (8) with CV-type methods is that if the dimension of is large,
2 2 2
~ then it can be computationally intensive to search over a
By using Laplace’s approximation about the maximum y region/grid in parameter-space looking for the parameters
we find that that maximize the criterion. In a sense, the HMC method
2 7 16 ~ - 1 log K -1 + W + n log 2p .
log P t ¦Y y
2 2
(9)
described above is doing a similar search, but using gradi-
1
ent information, and carrying out averaging over the pos-
We denote the right-hand side of this equation by log terior distribution of parameters. In defense of (G)CV
Pa(t|) (where a stands for approximate). methods, we note Wahba’s comments (e.g., in [26], referring
The integration over -space also cannot be done ana- back to [24]) that these methods may be more robust
lytically, and we employ a Markov Chain Monte Carlo against an unrealistic prior.
method. Following Neal [15] and Williams and Rasmus- One other difference between the kinds of non-
sen [28] we have used the Hybrid Monte Carlo (HMC) parametric GLM models usually considered and our
method of Duane et al [4] as described in Appendix D. We method is the exact nature of the prior that is used. Often,
use log Pa(t|) as an approximation for log P(t|), and use the roughness penalties used are expressed in terms of a
broad Gaussian priors on the parameters. penalty on the kth derivative of y(x), which gives rise to a
power law power spectrum for the prior on y(x). There can
3.3 Previous and Related Work also be differences over parameterization of the covariance
Our work on Gaussian processes for regression and classifi- function; for example it is unusual to find parameters like
cation developed from the observation in [15] that a large those for ARD introduced in (4) in nonparametric GLM
class of neural network models converge to GPs in the limit models. On the other hand, Wahba et al [26] have consid-
of an infinite number of hidden units. The computational ered a smoothing spline analysis of variance (SS-ANOVA)
Bayesian treatment of GPs can be easier than for neural decomposition. In Gaussian process terms, this builds up a
networks. In the regression case, an infinite number of prior on y as a sum of priors on each of the functions in the
weights are effectively integrated out, and one ends up decomposition
dealing only with the (hyper)parameters. Results from [17]
show that Gaussian processes for regression are comparable
05
yx =m+ Â ya 2xa 7 + Â yab 4xa , xb 9 + K . (10)
a a ,b
in performance to other state-of-the-art methods.
Nonparametric methods for classification problems can The important point is that functions involving all orders of
be seen to arise from the combination of two different interaction (from univariate functions, which on their own
strands of work. Starting from linear regression, McCul- give rise to an additive model) are included in this sum, up
lagh and Nelder [12] developed generalized linear models to the full interaction term which is the only one that we are
(GLMs). In the two-class classification context, this gives using. From a Bayesian point of view, questions as to the
rise to logistic regression. The other strand of work was kinds of priors that are appropriate is an interesting mod-
the development of nonparametric smoothing for the re- eling issue.
gression problem. Viewed as a Gaussian process prior There has also been some recent work which is related
over functions this can be traced back at least as far as the to the method presented in this paper. In Section 3.1, we
work of Kolmogorov and Wiener in the 1940s. Gaussian mentioned that it is necessary to approximate the integral
process prediction is well known in the geostatistics field in (1), (2), and (3) and described the use of Laplace’s
(see, e.g., [3]) where it is known as “kriging”. Alterna- approximation.
tively, by considering “roughness penalties” on functions, Following the preliminary version of this paper pre-
one can obtain spline methods; for recent overviews, see sented in [1], Gibbs and MacKay [7] developed an alterna-
[25] and [8]. There is a close connection between the GP tive analytic approximation, by using variational methods
and roughness penalty views, as explored in [9]. By com- to find approximating Gaussian distributions that bound
bining GLMs with nonparametric regression one obtains the marginal likelihood P(t|) above and below. These
what we shall call a nonparametric GLM method for clas-
sification. Early references to this method include [21] and 1. It would be possible to obtain derivatives of the CV-score with respect
to , but this has not, to our knowledge, been used in practice.
1346 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 20, NO. 12, DECEMBER 1998
approximate distributions are then used to predict P(y*|t, ) yci yci¢¢ = d c , c ¢ K ci, i ¢ , (12)
2 7
and thus p$ x* . For the parameters, Gibbs and MacKay esti-
where K c
i , i¢
is the i, i’ element of the covariance matrix for
mated by maximizing their lower bound on P(t|).
It is also possible to use a fully MCMC treatment of the the cth class. Each individual correlation matrix Kc has the
classification problem, as discussed in the recent paper of form given by (4) for the two-class case. We shall use a
Neal [14]. His method carries out the integrations over the separate set of parameters for each class. The use of m in-
posterior distributions of y and simultaneously. It works dependent processes to perform the classification is redun-
by generating samples from P(y, |') in a two-stage proc- dant, but forcing the activations of one process to be, say,
ess. Firstly, for fixed , each of the n individual yis are up- zero would introduce an arbitrary asymmetry into the
dated sequentially using Gibbs sampling. This “sweep” prior.
takes time O(n ) once the matrix K- has been computed (in
2 1
3
For simplicity, we introduce the augmented vector notation,
4 9
time O(n )), so it actually makes sense to perform quite a
few Gibbs sampling scans between each update of the pa- y + = y11 , K , y1n , y1* , y 21 , K , y 2n , y 2* , K , y m
1 *
, K , y c1 , K , y m ,
rameters, as this probably makes the Markov chain mix *
faster. Secondly, the parameters are updated using the Hy- where, as in the two-class case, y c denotes the activation
brid Monte Carlo method. To make predictions, one aver- corresponding to input x* for class c; this notation is also
ages over the predictions made by each y, sample. used to define t+ and +. In a similar manner, we define y, t,
and by excluding the values corresponding to the test
4 GPS FOR MULTIPLE-CLASS CLASSIFICATION 4
point x*. Let y * = y1* , y 2* , K , y m
*
. 9
The extension of the preceding framework to multiple With this definition of the augmented vectors, the GP
classes is essentially straightforward, although notationally prior takes the form
more complex.
2
2 7 %& 1 T +
4 9 y ()* ,
-1
'
Throughout we employ a one-of-m class coding scheme, P y + µ exp - y K (13)
2 + +
and use the multiclass analogue of the logistic function—
the softmax function—to describe the class probabilities. where, from (12), the covariance matrix K + is block diago-
The probability that an instance labeled by i is in class c is + + +
nal in the matrices, K1 , K , K m . Each individual matrix K c
i
denoted by p c , so that an upper index to denotes the expresses the correlations of activations within class c.
example number, and a lower index the class label. As in the two-class case, to use Laplace’s approximation
Similarly, the activations associated with the probabilities we need to find the mode of P(y+|t). The procedure is de-
i
are denoted by yc . Formally, the softmax link function scribed in Appendix C. As for the two-class case, we make
relates the activations and probabilities through predictions for (x*) by averaging the softmax function over
the Gaussian approximation to the posterior distribution of
exp y ci y*. At present, we simply estimate this integral using 1,000
p ic =
 c ¢ exp y ci ¢ draws from a Gaussian random vector generator.
i
which automatically enforces the constraint  c p c = 1. The
targets are similarly represented by tci , and are specified
5 EXPERIMENTAL RESULTS
using a one-of-m coding. When using the Newton-Raphson algorithm, was ini-
i i tialized each time with entries 1/m, and iterated until the
The log likelihood takes the form / Â i , c tc ln p c , which
mean relative difference of the elements of W between con-
for the softmax link function gives secutive iterations was less than 10- .
4
For the HMC algorithm, the same step size e is used for
/  tci yci - ln  exp p ic¢ . (11) all parameters, and should be as large as possible while
i,c c¢ keeping the rejection rate low. We have used a trajectory
As for the two class case, we shall assume that the GP prior made up of L = 20 leapfrog steps, which gave a low cor-
operates in activation space; that is we specify the correla- relation between successive states. The priors over pa-
i
tions between the activations yc . rameters were set to be Gaussian with a mean of -3 and a
One important assumption we make is that our prior standard deviation of 3. In all our simulations, a step size
knowledge is restricted to correlations between the activa- e = 0.1 produced a low rejection rate (< 5 percent). The
tions of a particular class. Whilst there is no difficulty in parameters corresponding to the wls were initialized to -2
extending the framework to include interclass correlations, and that for v0 to 0. The sampling procedure was run for
we have not yet encountered a situation where we felt able 200 iterations, and the first third of the run was discarded;
to specify such correlations. Formally, the activation corre- this “burn-in” is intended to give the parameters time to
lations take the form, come close to their equilibrium distribution. Tests carried
3
out using the R-CODA package on the examples in Section
5.1 suggested that this was indeed effective in removing
2. That is, the class is represented by a vector of length m with zero en- 3. Available from the Comprehensive R Archive Network at
tries everywhere except for the correct component which contains 1. http://www.ci.tuwien.ac.at.
WILLIAMS AND BARBER: BAYESIAN CLASSIFICATION WITH GAUSSIAN PROCESSES 1347
TABLE 2 TABLE 3
NUMBER OF TEST ERRORS ON THE PIMA INDIAN DIABETES TASK PERCENTAGE OF TEST ERROR
FOR THE FORENSIC GLASS PROBLEM
Method Pima Indian diabetes
Neural Network 75+ Method Forensic Glass
Linear Discriminant 67 Neural Network (4HU) 23.8%
Logistic Regression 66 Linear Discriminant 36%
MARS (degree = 1) 75 MARS (degree = 1) 32.2%
PP regression (4 ridge functions) 75 PP regression (5 ridge functions) 35%
Gaussian Mixture 64 Gaussian Mixture 30.8%
Gaussian Process (Laplace 68 Decision Tree 32.2%
Approximation, HMC) Gaussian Process (LA, MPL) 23.3%
Gaussian Process (Laplace 69 Gaussian Process (Neal’s method) 31.8%
Approximation, MPL) See Ripley [18] for details of the methods.
Gaussian Process (Neal’s method) 68
Comparisons are taken from from Ripley [18] and Ripley [20], respectively. The
neural network had one hidden unit and was trained with maximum likelihood;
the results were worse for nets with two or more hidden units (Ripley, [18]).
6 DISCUSSION Fig. 2. Plot of the log w parameters for the Pima dataset. The circle
In this paper, we have extended the work of Williams and indicates the posterior marginal mean obtained from the HMC run
(after burn-in), with one standard deviation error bars. The square
Rasmussen [28] to classification problems, and have dem- symbol shows the log w-parameter values found by maximizing the
onstrated that it performs well on the datasets we have penalized likelihood. The variables are: 1) the number of pregnan-
tried. We believe that the kinds of Gaussian process prior cies, 2) plasma glucose concentration, 3) diastolic blood pressure,
4) triceps skin fold thickness, 5) body mass index, 6) diabetes pedi-
we have used are more easily interpretable than models gree function, 7) age. For comparison, Wahba et al. [26] using gen-
(such as neural networks) in which the priors are on the eralized linear regression, found that variables 1, 2, 5, and 6 were
parameterization of the function space. For example, the the most important.
posterior distribution of the ARD parameters (as illustrated
in Fig. 2 for the Pima Indians diabetes problem) indicates we tested our methods. Computational methods used to
the relative importance of various inputs. This interpret- speed up the quadratic programming problem for SVMs
ability should also facilitate the incorporation of prior may also be useful for the GP classifier problem. We are
knowledge into new problems. also investigating the use of different covariance functions
There are quite strong similarities between GP classifiers and improvements on the approximations employed.
and support-vector machines (SVMs) [23]. The SVM uses a
covariance kernel, but differs from the GP approach by us-
ing a different data fit term (the maximum margin), so that APPENDIX A
the optimal y is found using quadratic programming. The MAXIMIZING P(y+|t): TWO-CLASS CASE
comparison of these two algorithms is an interesting direc- We describe how to find iteratively the vector y + so that
tion for future research. P(y + |t) is maximized. This material is also covered in
A problem with methods based on GPs is that they re- [8, Section 5.3.3] and [25, Section 9.2]. We provide it here for
quire computations (trace, determinants and linear solu- completeness and so that the terms in (9) are well-defined.
tions) involving n ¥ n matrices, where n is the number of
Let y+ denote (y*, y), the complete set of activations. By
training examples, and hence run into problems on large
datasets. We have looked into methods using Bayesian nu- Bayes’ theorem log P(y+|t) = log P(t|y) + log P(y+) - log P(t),
merical techniques to calculate the trace and determinant and let Y+ = log P(t|y) + log P(y+). As P(t) does not depend
[22], [6], although we found that these techniques did not
on y+ (it is just a normalizing factor), the maximum of
work well for the (relatively) small size problems on which
WILLIAMS AND BARBER: BAYESIAN CLASSIFICATION WITH GAUSSIAN PROCESSES 1349
P(y+|t) is found by maximizing Y+ with respect to y+. likelihood estimator for a model with a finite number of
3 8
Using log P ti yi = ti yi - log 1 + e 4 9 , we obtain yi parameters. This is because the dimension of the problem
grows with the number of data points. However, if we con-
sider the “infill asymptotics” (see, e.g., [3]), where the num-
y - Â log41 + e 9 - y K y
n
T 1 yi T -1
Y+ = t + + + ber of data points in a bounded region increases, then a local
2
i =1 average of the training data at any point x will provide a
1 n+1 tightly localized estimate for p(x) and hence y(x) (this rea-
- log K + - log 2p , (14) soning parallels more formal arguments found in [29]).
2 2
Thus, we would expect the distribution P(y) to become
where K+ is the covariance matrix of the GP evaluated at
more Gaussian with increasing data.
x1, º xn, x*. Y is defined similarly in (8). K+ can be parti-
tioned in terms of an n ¥ n matrix K, a n ¥ 1 vector k and a
scalar k*, viz. APPENDIX B
K k DERIVATIVES OF logPa(t|)wrt
K+ =
k T
k*
.
(15)
For both the HMC and MPL methods, we require the de-
As y* only enters into (14) in the quadratic prior term and rivative of la = logPa(t|u) with respect to components of u,
has no data point associated with it, maximizing Y+ with for example qk. This derivative will involve two terms, one
respect to y+ can be achieved by first maximizing Y with due to explicit dependencies of
respect to y and then doing the further quadratic optimiza-
tion to determine y*. To find a maximum of Y, we use the
Newton-Raphson iteration y = y - (——Y)- —Y. Differen-
new 1 16
~ - 1 log K -1 + W + n log 2p
la = Y y
2 2
tiating (8) with respect to y we find on qk, and also because a change in will cause a change in
—Y = (t - ) - K y -1
(16)
~ . However, as y
y ~ is chosen so that —Y y 16
~ = 0, we obtain
y=y
-1
——Y = - K- - W,
1
(17) ∂l a ∂l 1 n ∂ log K + W ∂~
yi
where the “noise” matrix is given by W = diag(p1(1 - p1), .., ∂q k
= a
∂q k explicit -
2 Â ∂~
yi ∂q k
. (20)
i =1
pn(1 - pn)). This results in the iterative equation, The dependence of |K- + W| on y ~
1
arises through the de-
-1 -1 -1 ~
pendence of W on , and, hence, ~ . By differentiating
y
0 5
new
y = (K + W) W(y + W (t - )). (18) ~ = K t -
~ , one obtains
y
To avoid unnecessary inversions, it is usually more con- ~
venient to rewrite this in the form
∂y
∂q k
0
= I + KW
-1 ∂K
∂q k
5 0t - ~5 , (21)
new -1
y = K(I + W K) (Wy + (t - )). (19)
and, hence, the required derivative can be calculated.
Note that -——Y is always positive definite, so that the op-
timization problem is convex.
~ for y, y can easily be APPENDIX C
Given a converged solution y
0 5 0 5 MAXIMIZING P(y+|t): MULTIPLE-CLASS CASE
*
T -1 ~ T ~ , as K -1y~ = t - ~
found using y = k K y * = k t -
The GP prior and likelihood, defined by (13) and (11), de-
4 90
-1 -1
from (16). var(y*) is given by K+ + W+ , where W+ fine the posterior distribution of activations, P(y+|t). As in
50 5
n+1 n+1
Appendix A we are interested in a Laplace approximation
is the W with a zero appended in the (n + 1)th diagonal po- to this posterior, and therefore need to find the mode with
sition. Given the mean and variance of y* it is then easy to respect to y+. Dropping unnecessary constants, the multi-
find p$ * = 2 7
p * P p * t dp * , the mean of the distribution of class analogue of (14) is
by introducing the matrix P which is a mn ¥ n matrix of the “inertia” so that random-walk behavior in -space can be
4 4
form ’ = diag
1 n
p 1 .. p 1 9, K , diag4 1 n
p m .. p m 99 . Using this nota- avoided. The total energy, H, of the system is the sum of the
T
kinetic energy, K = p p/2 and the potential energy, E. The
tion, we can write the noise matrix in the form of a diagonal potential energy is defined such that p(|D) µ exp(-E),
matrix and an outer product, i.e., E = - logP(t|) - log P(). We sample from the joint
4 9
W = - diag p 11 .. p 1n , .. , p 1m .. p nm + ’ ’ T . (23)
distribution for and p given by P(, p) µ exp(-E - K); the
marginal of this distribution for is the required posterior.
As in the two-class case, we note that -——Y is again posi- A sample of parameters from the posterior can therefore
tive definite, so that the optimization problem is convex. be obtained by simply ignoring the momenta.
The update equation for iterative optimization of Y Sampling from the joint distribution is achieved by two steps:
with respect to the activations y then follows the same 1) finding new points in phase space with near-identical
form as that given by (18). The advantage of the repre- energies H by simulating the dynamical system using
sentation of the noise matrix in (23) is that we can then a discretised approximation to Hamiltonian dynamics
invert matrices and find their determinants using the and
identities, 2) changing the energy H by Gibbs sampling the mo-
mentum variables.
(A + ’’ )- = A- - A- ’(In + ’ A- ’)- ’ A-
T 1 1 1 T 1 1 T 1
(24)
Hamilton’s first-order differential equations for H are
and approximated using the leapfrog method which requires
det(A + ’’ ) = det(A)det(In + ’ A- ’),
T T 1
(25) the derivatives of E with respect to . Given a Gaussian
prior on , logP() is straightforward to differentiate. The
where A = K -1
+ diag 4 p 11 .. p nm 9 . As A is block-diagonal, it derivative of logPa(t|) is also straightforward, although
implicit dependencies of y ~ (and, hence, p ~ ) on q must be
can be inverted blockwise. Thus, rather than requiring
taken into account as described in Appendix B. The calcu-
determinants and inverses of a mn ¥ mn matrix, we only
lation of the energy can be quite expensive as for each new
need to carry out expensive matrix computations on n ¥
, we need to perform the maximization required for
n matrices. The resulting update equations for y are then
Laplace’s approximation, (9). This proposed state is then
of the same form as given in (18), where the noise matrix
accepted or rejected using the Metropolis rule depending
and covariance matrices are now in their multiple class
on the final energy H* (which is not necessarily equal to the
form.
initial energy H because of the discretization).
Essentially, these are all the results needed to generalize
the method to the multiple-class problem. Although, as we
mentioned above, the time complexity of the problem does APPENDIX E
3
not scale with the m , but rather m (due to the identities in SIMULATION SET-UP FOR NEAL’S CODE
(24), (25)), calculating the function and its gradient is still We used the fbm software available from http://www.cs.utoronto.
rather expensive. We have experimented with several ca/radford/fbm.software.html. For example, the commands used
methods of mode finding for the Laplace approximation. to run the Pima example are
The advantage of the Newton iteration method is its fast
gp-spec pima1.log 7 1 - - 0.1 / 0.05:0.5
quadratic convergence. An integral part of each Newton
x0.2:0.5:1
step is the calculation of the inverse of a matrix M acting
model-spec pima1.log binary
upon a vector, i.e., M- b. In order to speed up this particu-
1
pima1.log 7 1 2 / pima1.tr@1:200.
lar step, we used a conjugate gradient (CG) method to solve mypima.te@1:332.
iteratively the corresponding linear system Mz = b. As we gp-gen pima1.log fix 0.5 1
repeatedly need to solve the system (because W changes as mc-spec pima1.log repeat 4 scan-values
y is updated), it saves time not to run the CG method to 200 heatbath hybrid 6 0.5
convergence each time it is called. In our experiments, the gp-mc pima1.log 500
n -9
CG algorithm was terminated when 1 n  i = 1 ri < 10 , which follow closely the example given in Neal’s docu-
where r = Mz - b. mentation.
The calculation of the derivative of logPa(t|q) wrt q in The gp-spec command specifies the form of the Gaus-
the multiple-class case is analogous to the two-class case sian process, and in particular the priors on the parameters
described in Appendix B. v0 and the ws (see (4)). The expression 0.05:0.5 specifies a
Gamma-distribution prior on v0, and x0.2:0.5:1 specifies
a hierarchical Gamma prior on the ws. Note that a “jitter”
APPENDIX D of 0.1 is also specified on the prior covariance function;
HYBRID MONTE CARLO this improves conditioning of the covariance matrix.
HMC works by creating a fictitious dynamical system in The mc-spec command gives details of the MCMC up-
which the parameters are regarded as position variables, dating procedure. It specifies four repetitions of 200 scans of
and augmenting these with momentum variables p. The the y values followed by six HMC updates of the parameters
purpose of the dynamical system is to give the parameters (using a step-size adjustment factor of 0.5). gp-mc specifies
that this is sequence is carried out 500 times.
WILLIAMS AND BARBER: BAYESIAN CLASSIFICATION WITH GAUSSIAN PROCESSES 1351
We aimed for a rejection rate of around 5 percent. If this [19] B.D. Ripley, “Statistical Aspects of Neural Networks,” O.E. Barn-
dorff-Nielsen, J.L. Jensen, and W.S. Kendall, eds., Networks and
was exceeded, the stepsize reduction factor was reduced Chaos—Statistical and Probabilistic Aspects, pp. 40-123. Chapman
and the simulation run again. and Hall, 1993.
[20] B.D. Ripley, “Flexible Non-Linear Approaches to Classification,”
V. Cherkassy, J.H. Friedman, and H. Wechsler, eds., From Statistics
ACKNOWLEDGMENTS to Neural Networks, pp. 105-126. Springer, 1994.
[21] B.W. Silverman, “Density Ratios, Empirical Likelihood and Cot
We thank Prof. B. Ripley for making available the Lepto- Death,” Applied Statistics, vol. 27, no. 1, pp. 26-33, 1978.
grapsus crabs, Pima Indian diabetes, and Forensic Glass [22] J. Skilling, “Bayesian Numerical Analysis,” W.T. Grandy, Jr. and P.
datasets. This work was partially supported by EPSRC Milonni, eds., Physics and Probability. Cambridge Univ. Press, 1993.
[23] V.N. Vapnik, The Nature of Statistical Learning Theory. New York,
grant GR/J75425, Novel Developments in Learning Theory for NY: Springer Verlag, 1995.
Neural Networks, and much of the work was carried out at [24] G. Wahba, “A Comparison of GCV and GML for Choosing the
Aston University. The authors gratefully acknowledge the Smoothing Parameter in the Generalized Spline Smoothing Prob-
hospitality provided by the Isaac Newton Institute for lem,” Annals of Statistics, vol. 13, pp. 1,378-1,402, 1985.
[25] G. Wahba, Spline Models for Observational Data. Soc. Industrial and
Mathematical Sciences (Cambridge, UK) where this paper Applied Mathematics, 1990. CBMS-NSF Regional Conf. Series in
was written. We thank Mark Gibbs, David MacKay, and Applied Mathematics.
Radford Neal for helpful discussions, and the anonymous [26] G. Wahba, C. Gu, Y. Wang, and R. Chappell, “Soft Classification,
a.k.a., Risk Estimation, via Penalized Log Likelihood and
referees for their comments which helped improve the Smoothing Spline Analysis of Variance,” D.H. Wolpert, ed., The
paper. Mathematics of Generalization. Addison-Wesley, 1995. Proc. vol. XX
in the Santa Fe Institute Studies in the Sciences of Complexity.
[27] C.K.I. Williams, “Computing With Infinite Networks,” M.C.
REFERENCES Mozer, M.I. Jordan, and T. Petsche, eds., Advances in Neural Infor-
mation Processing Systems 9. MIT Press, 1997.
[1] D. Barber and C.K.I Williams, “Gaussian Processes for Bayesian
[28] C.K.I. Williams and C.E. Rasmussen, “Gaussian Processes for
Classification via Hybrid Monte Carlo,” M.C. Mozer, M.I. Jordan,
Regression,” D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo,
and T. Petsche, eds., Advances in Neural Information Processing Sys-
eds., Advances in Neural Information Processing Systems 8, pp. 514-
tems 9. MIT Press, 1997.
520. MIT Press, 1996.
[2] M.K. Cowles and B.P. Carlin, ”Markov-Chain Monte-Carlo Con-
[29] S.J. Yakowitz and F. Szidarovszky, “A Comparison of Kriging
vergence Diagnostics—A Comparative Review,” J. Am. Statistics
With Nonparametric Regression Methods,” J. Multivariate Analy-
Assoc., vol. 91, pp. 883-904, 1996.
sis, vol. 16, pp. 21-53, 1985.
[3] N.A.C. Cressie, Statistics for Spatial Data. New York, NY: Wiley,
1993.
Chris Williams studied physics at Cambridge,
[4] S. Duane, A.D. Kennedy, B.J. Pendleton, and D. Roweth, “Hybrid
graduating in 1982 and continued on to do Part
Monte Carlo,” Physics Letters B, vol. 195, pp. 216-222, 1987.
III Maths (1983). He then received an MSc in
[5] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin, Bayesian Data
water resources at the University of Newcastle
Analysis. London: Chapman and Hall, 1995.
upon Tyne before going to work in Lesotho,
[6] M. Gibbs and D.J.C. MacKay, “Efficient Implementation of Gaus-
Southern Africa in low-cost sanitation. In 1988,
sian Processes,” Draft manuscript, available from http://wol.ra.phy.
he returned to academia, studying neural net-
cam.ac.uk/mackay/homepage.html., 1997.
works/artificial intelligence with Geoff Hinton at
[7] M. Gibbs and D.J.C. MacKay, “Variational Gaussian Process Clas-
the University of Toronto (MSc 1990, PhD 1994).
sifiers,” Draft manuscript, available via http://wol.ra.phy.cam.ac.uk/
He was a member of the Neural Computing
mackay/homepage.html., 1997.
Research Group at Aston University from 1994
[8] P. J. Green and B. W. Silverman, Nonparametric Regression and Gen-
to 1998, and is currently a lecturer in the Division of Informatics at the
eralized Linear Models. London: Chapman and Hall, 1994.
University of Edinburgh.
[9] G. Kimeldorf and G. Wahba, “A Correspondence Between Baye-
His research interests cover a wide range of theoretical and practi-
sian Estimation of Stochastic Processes and Smoothing by
cal issues in neural networks, statistical pattern recognition, computer
Splines,” Annals of Math. Statistics, vol. 41, pp. 495-502, 1970.
vision, and artificial intelligence.
[10] D.J.C. MacKay, “Bayesian Methods for Backpropagation Net-
works,” J.L. van Hemmen, E. Domany, and K. Schulten, eds.,
David Barber completed the University of Cam-
Models of Neural Networks II. Springer, 1993.
bridge mathematics tripos in 1990. Following a
[11] K.V. Mardia and R.J. Marshall, “Maximum Likelihood Estimation
year at Heidelberg University, Germany, he re-
for Models of Residual Covariance in Spatial Regression,” Biome-
ceived an MSc in neural networks at King’s Col-
trika, vol. 71, no. 1, pp. 135-146, 1984.
lege, London. Subsequently, he went to the Uni-
[12] P. McCullagh, and J. Nelder, Generalized Linear Models. Chapman
versity of Edinburgh, completing a PhD in the
and Hall, 1983.
theoretical physics department with a study of
[13] M. Møller, “A Scaled Conjugate Gradient Algorithm for Fast Su-
the statistical mechanics of machine learning.
pervised Learning,” Neural Networks, vol. 6, no. 4, pp. 525-533,
After a two-year research position at Aston Uni-
1993.
versity, he took up his current research position
[14] R.M. Neal, “Monte Carlo Implementation of Gaussian Process
at the University of Nijmegen, Holland. His main
Models for Bayesian Regression and Classification,” Technical
interests include machine learning, Bayesian techniques, and statisti-
Report 9702, Dept. of Statistics, Univ. of Toronto, 1997. Available
cal mechanics.
from http://www.cs.toronto.edu/~radford/.
[15] R.M. Neal, Bayesian Learning for Neural Networks. New York,
Springer, 1996. Lecture Notes in Statistics 118.
[16] F. O’Sullivan, B.S. Yandell, and W.J. Raynor, “Automatic Smooth-
ing of Regression Functions in Generalized Linear Models,” J. Am.
Statistical Assoc., vol. 81, pp. 96-103, 1986.
[17] C.E. Rasmussen, Evaluation of Gaussian Processes and Other Methods
for Non-Linear Regression. PhD thesis, Dept. of Computer Science,
Univ. of Toronto, 1996. Available from http://www.cs.utoronto.ca/~carl/.
[18] B. Ripley, Pattern Recognition and Neural Networks. Cambridge, UK:
Cambridge Univ. Press, 1996.