Вы находитесь на странице: 1из 100

Aalto University School of Science and Technology

Department of Biomedical Engineering and Computational Science (BECS)


Espoo, Finland July 6, 2011
Gaussian processes for Bayesian analysis
User guide for Matlab toolbox GPstuff
Version 3.1
Jarno Vanhatalo, Jaakko Riihimki,
Jouni Hartikainen, and Aki Vehtari
If you use GPstuff, please use reference:
Jarno Vanhatalo, Jaakko Riihimki, Jouni Hartikainen and Aki Vehtari (2011).
Bayesian Modeling with Gaussian Processes using the MATLAB Toolbox GP-
stuff, submitted.
(This manual is subject to some modications.)
contact: Aki.Vehtari@tkk.
ii
iii
Revision history
July 2010 First printing For GPstuff 2.0
September 2010 New functionalities: mean functions for
FULL GP, derivative observations for
squared exponential covariance in FULL
GP.
For GPstuff 2.1
April 2011 New syntax. For GPstuff 3.1
iv
Contents
1 Introduction 1
1.1 Bayesian modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Gaussian process models . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Gaussian process prior . . . . . . . . . . . . . . . . . . . . . 3
2 Inference and prediction 7
2.1 Conditional posterior of the latent function . . . . . . . . . . . . . . . 8
2.1.1 The posterior mean and covariance . . . . . . . . . . . . . . 8
2.1.2 Gaussian observation model . . . . . . . . . . . . . . . . . . 8
2.1.3 Laplace approximation . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Expectation propagation algorithm . . . . . . . . . . . . . . . 10
2.1.5 Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . 10
2.2 Marginal likelihood given parameters . . . . . . . . . . . . . . . . . 11
2.3 Marginalization over parameters . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Maximum a posterior estimate of parameters . . . . . . . . . 12
2.3.2 Grid integration . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Central composite design . . . . . . . . . . . . . . . . . . . . 14
3 GPstuff basics: regression and classication 17
3.1 Gaussian process regression: demo_regression1 . . . . . . . . . . . . 17
3.1.1 Constructing the model . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 MAP estimate for the hyperparameters . . . . . . . . . . . . 19
3.1.3 Marginalization over parameters with grid integration . . . . . 20
3.1.4 Marginalization over parameters with MCMC . . . . . . . . . 21
3.2 Gaussian process classication . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Constructing the model . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Inference with Laplace approximation . . . . . . . . . . . . . 22
3.2.3 Inference with expectation propagation . . . . . . . . . . . . 24
3.2.4 Inference with MCMC . . . . . . . . . . . . . . . . . . . . . 24
4 Sparse Gaussian processes 27
4.1 Compactly supported covariance functions . . . . . . . . . . . . . . . 27
4.2 FIC and PIC sparse approximations . . . . . . . . . . . . . . . . . . 29
4.3 DTC, SOR, VAR approximations . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Comparison of sparse GP models . . . . . . . . . . . . . . . 33
4.3.2 Sparse GP models with non-Gaussian likelihoods . . . . . . . 33
v
vi CONTENTS
5 Model assessment and comparison 35
5.1 Marginal likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Predictive performance estimates . . . . . . . . . . . . . . . . . . . . 35
5.3 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.1 Leave-one-out cross-validation . . . . . . . . . . . . . . . . . 36
5.3.2 k-fold cross-validation . . . . . . . . . . . . . . . . . . . . . 36
5.4 DIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Model assessment demos . . . . . . . . . . . . . . . . . . . . . . . . 38
5.5.1 demo_modelassessment1 . . . . . . . . . . . . . . . . . . . . 38
5.5.2 demo_modelassessment2 . . . . . . . . . . . . . . . . . . . . 39
6 Covariance functions 41
6.1 Neural network covariance function . . . . . . . . . . . . . . . . . . 41
6.2 Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2.1 Additive models demo: demo_periodic . . . . . . . . . . . . 43
6.2.2 Additive models with sparse approximations . . . . . . . . . 45
6.3 Additive covariance functions with selected variables . . . . . . . . . 46
6.4 Product of covariance functions . . . . . . . . . . . . . . . . . . . . 47
7 Special observation models 51
7.1 Robust regression with Student-t likelihood . . . . . . . . . . . . . . 51
7.1.1 Regression with Student-t distribution . . . . . . . . . . . . . 52
7.2 Models for spatial epidemiology . . . . . . . . . . . . . . . . . . . . 54
7.2.1 Disease mapping with Poisson likelihood: demo_spatial1 . . . 55
7.2.2 Disease mapping with negative Binomial likelihood . . . . . . 56
7.3 Log-Gaussian Cox process . . . . . . . . . . . . . . . . . . . . . . . 57
7.4 Binomial observation model . . . . . . . . . . . . . . . . . . . . . . 58
7.5 Derivative observations in GP regression . . . . . . . . . . . . . . . . 59
7.5.1 GP regression with derivatives: demo_derivativeobs . . . . . 60
8 Mean functions 63
8.1 Explicit basis functions . . . . . . . . . . . . . . . . . . . . . . . . . 63
8.1.1 Mean functions in GPstuff: demo_regression_meanf . . . . . 65
A Function list 67
A.1 GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.2 Diagnostic tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.4 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
A.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B Covariance functions 73
C Observation models 77
D Priors 79
E Transformation of hyperparameters 83
Acknowledgments
The coding of the GPstuff-toolbox started in 2006 based on the MCMCStuff-toolbox
(1998-2006) http://www.lce.hut./research/mm/mcmcstuff/ which was based on Netlab-
toolbox (1996-2001) <http://www.mathworks.com/matlabcentral/leexchange/2654-netlab>.
The GPstuff-toolbox has been developed by BECS Bayes group, Aalto University.
The main authors of the GPstuff have been Jarno Vanhatalo, Jaakko Riihim?i, Jouni
Hartikainen and Aki Vehtari, but the package contains code written by many more
people. In the Bayesian methodology group at the Department of Biomedical Engi-
neering and Computational Science (BECS) at Aalto University these persons are (in
alphabetical order): Toni Auranen, Roberto Calandra, Pasi Jyl?ki, Tuomas Nikoskinen,
Eero Pennala, Heikki Peura, Ville Pietil?nen, Markus Siivola, Simo S?kk? and Ville
Tolvanen. People outside Aalto University are (in alphabetical order): Christopher M.
Bishop, Timothy A. Davis, Dirk-Jan Kroon, Ian T. Nabney, Radford M. Neal and Carl
E. Rasmussen. We want to thank them all for sharing their code under a free software
license.
The research leading to GPstuff was funded by Helsinki University of Technology,
Aalto University, the Academy of Finland, Finnish Funding Agency for Technology
and Innovation (TEKES), and the Graduate School in Electronics and Telecommunica-
tions and Automation (GETA). Jarno Vanhatalo also thanks the Finnish Foundation for
Economic and Technology Sciences KAUTE, Finnish Cultural Foundation, and Emil
Aaltonen Foundation for supporting his post graduate studies during which the code
package was prepared.
vii
viii CONTENTS
Preface
This is a manual for software package GPstuff, which is a collection of Matlab func-
tions to build and analyse Bayesian models build over Gaussian processes. The pur-
pose of the manual is to help people to use the software in their own work and possibly
modify and extend the features. The manual consist of two short introductory sections
about Bayesian inference and Gaussian processes, which introduce the topic and the
notation used throughout the book. The theory is not extensively covered and readers
not familiar with it are suggested to see the references for a more complete discussion.
After the introductory part, the Chapter 2 is devoted for inference techniques. Gaus-
sian process models lead to analytically unsolvable models for which reason efcient
approximative methods are essential. The techniques implemented in GPstuff are in-
troduced in general level and references are given to direct the reader for more detailed
discussion on the methods.
The rest of the manual discusses the basic models implemented in the package and
demonstrates their usage. This discussion begins from the Chapter 3 which considers
simple Gaussian process regression and classication problems. These examples serve
also as examples how to use the basic functions in the GPstuff and all the rest of the
examples build over the considerations in this chapter. The rest of the chapters concen-
trate on more special model constructions, such as sparse Gaussian processes, additive
models, and various observations models. Also, functions for model assessment and
comparison are discussed. All these considerations are more or less individual exam-
ples that can be read if needed. The essential parts that everyone should know are
covered by the end of the chapter 3.
Chapter 4 discusses how various sparse approximations can be used with GPstuff.
The Chapter 5 discusses methods provided for model assessment and comparison. The
Chapter 6 tells more about some covariance functions and how to combine them. The
Chapter 7 reviews some special observation models implemented in GPstuff. The
Chapter 8 reviews how non-zero mean functios can be used in GPstuff. The Appendix
A lists the functions in the toolbox (Contents.m). The Appendices B-D collects techni-
cal details and lists of covariance functions, observation models and prior distributions
implemented. The Appendix E discusses the reparameterization used and its effect to
the implementation.
ix
x CONTENTS
Chapter 1
Introduction
1.1 Bayesian modeling
Building a Bayesian model, denoted by M, starts with an observation model which
contains the description of the data generating process, or its approximation. The ob-
servation model is denoted by p(D[, M), where stands for the parameters and D
the observations. The observation model quanties a conditional probability for data
given the parameters, and when looked as a function of parameters it is called likeli-
hood. If the parameter values were known, the observation model would contain all
the knowledge of the phenomenon and could be used as such. If the observations con-
tain randomness, sometimes called noise, one would still be uncertain of the future
observations, but could not reduce this uncertainty since everything that can be known
exactly would be encoded in the observation model. Usually, the parameter values are
not known exactly but there is only limited knowledge on their possible values. This
prior information is formulated mathematically by the prior probability p([M), which
reects our beliefs and knowledge about the parameter values before observing data.
Opposed to the aleatory uncertainty encoded in the observation model the epistemic
uncertainty present in the prior can be reduced by gathering more information on the
phenomenon (for a more illustrative discussion on the differences between these two
sources of uncertainty see (OHagan, 2004)). Bayesian inference is the process of up-
dating our prior knowledge based on new observations in other words it is the process
for reducing the epistemic uncertainty.
The cornerstone of Bayesian inference is the Bayes theorem which denes the
conditional probability of the parameters after observing the data
p([D, M) =
p(D[, M)p([M)
p(D[M)
. (1.1)
This is called the posterior distribution and it contains all information about parameter
conveyed from the data D by the model. The normalization constant
p(D[M) =
_
p(D[, M)p([M)d (1.2)
is equal to the conditional probability that the data came from the model M given
our model assumptions. It is also called the marginal likelihood for the model. The
model, M, stands for all the hypotheses and assumptions that are made about the phe-
nomenon. It embodies the functional forms of the observation model and the prior,
1
2 CHAPTER 1. INTRODUCTION
which are always tied together, as well as our subjective assumptions used to dene
these mathematical abstractions. Because everything is conditioned on M, it is a re-
dundant symbol and as such omitted from this on. Usually we are not able to dene
correct model and most of the time we have only limited ability to encode our prior
beliefs in the mathematical formulation. Still, many models turn out to be useful in
practice.
The true power of Bayesian approach comes from the possibility to construct and
analyze hierarchical models. In hierarchical models, prior probabilities are appointed
also for the parameters of the prior. Let us write the prior as p([), where denotes the
parameters of the prior distribution, hyperparameters. By setting a hyperprior, p(), for
the hyperparameters we obtain a hierarchical model structure where the xed parameter
values move further away from the data. This allows more exible models and leads
to more vague prior, which is benecial if the modeler is unsure of the specic form
of the prior. In theory the hierarchy could be extended as far as one desires but there
are practical limits on how many levels of hierarchy are reasonable (Goel and Degroot,
1981).
The models implemented in GPstuff are hierarchical models where the parameter
is replaced by a latent function f(x). The observation model is build such that an indi-
vidual observation depends on a function value at a certain input location x. The latent
function f(x) is given a Gaussian process (GP) prior (Rasmussen and Williams, 2006),
whose properties are dened by a mean and covariance function, and the hyperparam-
eters related to them. The hierarchy is continued to third level by giving a hyperprior
for the covariance function parameters. The assumption is that there is a functional de-
scription for the studied phenomenon, which we are not aware of, and the observations
are noisy realizations of this underlying function. The power of this construction lies
in the exibility and non-parametric form of the GP prior. We can use simple paramet-
ric observation models that describe the assumed observation noise. The assumptions
about the functional form of the phenomenon are encoded in the GP prior. Since GP is
a non-parametric model we do not need to x the functional form of the latent function,
but we can give implicit statements of it. These statements are encoded in the mean
and covariance function, which determine, for example, the smoothness and variability
of the function.
1.2 Gaussian process models
The general form of the models in GPstuff can be written as follows:
observation model: y [ f ,
n

i=1
p(y
i
[f
i
, ) (1.3)
GP prior: f(x)[ GP(m(x), k(x, x

[)) (1.4)
hyperprior: , p()p(). (1.5)
Here y = [y
1
, ..., y
n
]
T
is a vector of observations (target values) at (input) locations
X = x
i
= [x
i,1
, ..., x
i,d
]
T

n
i=1
. f(x) : 1
d
1 is a latent function with value
f
i
= f(x
i
) at input location x
i
. A boldface notation will denote a set of latent variables
in a vector f = [f
1
, ..., f
n
]
T
. Here, the inputs are real valued vectors but in general other
inputs, such as strings or graphs, are possible as well. denotes the parameters in the
covariance function k(x, x

[), and denotes the parameters in the observation model


p(y [ f , ). The notation will be slightly abused since p(y [ f , ) is also used for the
1.2. GAUSSIAN PROCESS MODELS 3
likelihood function. For simplicity of presentation the mean function is considered
zero, m(x) 0, throughout this paper. In Section 8.1.1 we describe how to use non-
zero mean functions in GPstuff. As will be seen, the class of models described by
the equations (1.3)-(1.5) is rather rich. Even though the observation model is assumed
to factorize given the latent variables f , the correlations between the observations are
incorporated into the model via the GP prior, and the marginalized observation model
p(y [, ) =
_
d f p(f [)

n
i=1
p(y
i
[f
i
, ) is no longer factorizable.
Gaussian processes are not certainly a new invention. Early examples of their usage
can be found, for example, in time series analysis and ltering (Wiener, 1949), and
geostatistics (e.g. Matheron, 1973). GPs are still widely and actively used in these elds
and useful overviews are provided by Cressie (1993), Grewal and Andrews (2001),
Diggle and Ribeiro (2007), and Gelfand et al. (2010). OHagan (1978) was one of the
rsts to consider Gaussian processes in a general probabilistic modeling context. He
provided a general theory of Gaussian process prediction and utilized it for a number
of regression problems. This general regression framework was later rediscovered as
an alternative for neural network models (Williams and Rasmussen, 1996; Rasmussen,
1996), and extended for other problems than regression (Neal, 1997; Williams and
Barber, 1998). This machine learning perspective is comprehensively summarized by
Rasmussen and Williams (2006).
1.2.1 Gaussian process prior
The advantage of using GPs is that we can conduct the inference directly in the func-
tion space. GP prior implies that any set of function values f , indexed by the input
coordinates X, have a multivariate Gaussian prior distribution
p(f [X, ) = N(f [0, K
f,f
), (1.6)
where K
f,f
is the covariance matrix. Notice, that the distribution over functions will be
denoted by GP(, ), whereas the distribution over a nite set of latent variables will be
denoted by N(, ). The covariance matrix is constructed from a covariance function,
[K
f,f
]
i,j
= k(x
i
, x
j
[), which characterizes the correlation between different points
in the process E[f(x
i
), f(x
j
)] = k(x
i
, x
j
[). Covariance function encodes the prior
assumptions of the latent function, such as the smoothness and scale of the variation,
and can be chosen freely as long as the covariance matrices produced are symmetric
and positive semi-denite (v
T
K
f,f
v 0, v 1
n
). An example of a stationary
covariance function is the squared exponential
k
se
(x
i
, x
j
[) =
2
se
exp
_

1
2
d

k=1
(x
i,k
x
j,k
)
2
/l
2
k
_
, (1.7)
where = [
2
se
, l
1
, ..., l
d
] is a vector of parameters. Here,
2
se
is the scaling parame-
ter, and l
k
is the length-scale, which governs how fast the correlation decreases as the
distance increases in the direction k. The squared exponential is only one example of
possible covariance functions. Discussion on other common covariance functions is
given, for example, by (Diggle and Ribeiro, 2007; Finkenstdt et al., 2007; Rasmussen
and Williams, 2006). All the covariance functions implemented in GPstuff are summa-
rized in the Chapter 8.
By denition, the marginal distribution of any subset of latent variables can be
constructed by simply taking the appropriate submatrix of the covariance and subvector
4 CHAPTER 1. INTRODUCTION
of the mean. Imagine that we want to predict the values

f at new input locations

X.
The joint prior for latent variables f and

f is
_
f

f
_
[X,

X, N
_
0,
_
K
f,f
K
f,

f
K

f,f
K

f,

f
__
, (1.8)
where K
f,f
= k(X, X[), K
f,

f
= k(X,

X[) and K

f,

f
= k(

X,

X[). Here, the co-
variance function k(, ) denotes also vector and matrix valued functions k(x, X) :
1
d
1
d?n
1
1n
, and k(X, X) : 1
dn
1
dn
1
nn
. The marginal distribu-
tion of

f is p(

f [

X, ) = N(

f [0, K

f,

f
) like the marginal distribution of f given in (1.6).
This is illustrated in Figure 1.1. The conditional distribution of a set of latent variables
given another set of latent variables is Gaussian as well. For example, the distribution
of

f given f is

f [ f , X,

X, N(K

f,f
K
-1
f,f
f , K

f,

f
K

f,f
K
-1
f,f
K
f,

f
), (1.9)
which can be interpreted as the posterior predictive distribution for

f after observ-
ing the function values at locations X. The posterior distribution (1.9) generalizes to
a Gaussian process with mean function m( x[) = k( x, X[) K
-1
f,f
f and covariance
k( x, x

[) = k( x, x

[) k( x, X[) K
-1
f,f
k(X, x

[), which dene the posterior dis-


tribution of the latent function f( x) on arbitrary input vector x. The posterior GP is
illustrated in Figure 1.2.
1.2. GAUSSIAN PROCESS MODELS 5
0 1 2 3 4 5
2
1
0
1
2
x
1
x
2
x
3
x
4
input, x
f
u
n
c
t
i
o
n

v
a
l
u
e

f
(
x
)
Random draws from GP
0 0.2 0.4
2
1
0
1
2
f
u
n
c
t
i
o
n

v
a
l
u
e

f
(
x
i
)
p(f(x
i
)|)
Marginal density
f(x
1
)
f
(
x
2
)
Marginal of [f
1
,f
2
]
T
2 0 2
2
1
0
1
2
f(x
1
)
f
(
x
3
)
Marginal of [f
1
,f
3
]
T
2 0 2
2
1
0
1
2
f(x
1
)
f
(
x
4
)
Marginal of [f
1
,f
4
]
T
2 0 2
2
1
0
1
2
Figure 1.1: An illustration of a Gaussian process. The upper left gure presents three
functions drawn randomly from a zero mean GP with squared exponential covariance
function. The hyperparameters are l = 1 and
2
= 1 and the grey shading repre-
sents central 95% probability interval. The upper right subgure presents the marginal
distribution for a single function value. The lower subgures present three marginal
distributions between two function values at distinct input locations shown in the up-
per left subgure by dashed line. It can be seen that the correlation between function
values f(x
i
) and f(x
j
) is the greater the closer x
i
and x
j
are to each others.
6 CHAPTER 1. INTRODUCTION
0 1 2 3 4 5
2
1
0
1
2
x
1
x
2
x
3
x
4
input, x
f
u
n
c
t
i
o
n

v
a
l
u
e

f
(
x
)
Random draws from GP
0 2 4
2
1
0
1
2
f
u
n
c
t
i
o
n

v
a
l
u
e

f
(
x
i
)
p(f(x
i
)|)
Marginal density
p(f
1
)
p(f
2
)
p(f
3
)
p(f
4
)
f(x
1
)
f
(
x
2
)
Marginal of [f
1
,f
2
]
T
0.2 0 0.2
1.6
1.4
1.2
1
f(x
1
)
f
(
x
3
)
Marginal of [f
1
,f
3
]
T
0.5 0 0.5
0.5
0
0.5
f(x
1
)
f
(
x
4
)
Marginal of [f
1
,f
4
]
T
1 0 1
1
0.5
0
0.5
1
Figure 1.2: A conditional (posterior) GP p(

f[ f , ). The observations f = [f(0.7) =
1, f(1.3) = 1, f(2.4) = 0, f(3.9) = 2]
T
are plotted with circles in the upper left sub-
gure and the prior GP is illustrated in the gure 1.1. When comparing the subgures
to the equivalent ones in Figure 1.1 we can see clear distinction between the marginal
and the conditional GP. Here, all the function samples travel through the observations,
the mean is no longer zero and the covariance is non-stationary.
Chapter 2
Inference and prediction
As was stated on previous section, the essential task in Bayesian inference is solve
the posterior distribution of quantities of interest. In an ideal situation, the posterior
distributions could be solved analytically but most of the time we need to approximate
it. GPstuff is built so that the rst inference step is to form (either analytically or
approximately) the conditional posterior of the latent variables given the parameters
p(f [D, , ) =
p(y [ f , )p(f [X, )
_
p(y [ f , )p(f [X, )d f
, (2.1)
where the integral over f in the denominator is difcult for non-Gaussian likelihoods.
Section 2.1 discusses approximation of the posterior of f rst generally and then presents
exact posterior in case of Gaussian likelihood and Laplace, Expectation Propagation
(EP) and Markov chain Monte Carlo (MCMC) approximations in case of non-Gaussian
likelihood. The approximations use the unnormalized posterior
p(f [D, , ) p(y [ f , )p(f [, X, ), (2.2)
and the normalization is computed from the Gaussian approximation itself (Laplace,
EP) or the normalization is not needed at all (MCMC).
Section 2.3 treats the problem of marginalizing over the parameters to obtain the
marginal posterior distribution for the latent variables
p(f [D) =
_
p(f [D, , )p(, [D)dd. (2.3)
The posterior predictive distributions can be obtained similarly by rst evaluating
the conditional posterior predictive distribution, for example p(

f[D, , , x), and then
marginalizing over the parameters. The joint predictive distribution p( y[D, , , x)
would require integration over possibly high dimensional distribution p(

f [D, , , x)
but usually we are interested only on the predictive distribution for each y
i
separately.
Since the observation model is factorizable this requires only one dimensional integrals
(except for multioutput models)
p( y
i
[D, x
i
, , ) =
_
p( y
i
[

f
i
, )p(

f
i
[D, x
i
, , )d

f
i
. (2.4)
For many observation models, which do not have free parameters , this integral re-
duces to marginalization over

f
i
only.
7
8 CHAPTER 2. INFERENCE AND PREDICTION
2.1 Conditional posterior of the latent function
2.1.1 The posterior mean and covariance
If the parameters are considered xed, GPs marginalization and conditionalization
properties can be exploited in the prediction. Assume that we have found the condi-
tional posterior distribution p(f [D, , ), which in general, is not Gaussian. We can
then evaluate the posterior predictive mean simply by using the expression of the con-
ditional mean E

f | f ,,
[f( x)] = k( x, X) K
-1
f,f
f (see equation (1.9) and the text below
it). Since this holds for any set of latent variables

f , we obtain a parametric posterior
mean function
m
p
( x[D, , ) =
_
E

f | f ,,
[f( x)]p(f [D, , )d f = k( x, X[) K
-1
f,f
E
f |D,,
[f ]. (2.5)
The posterior predictive covariance between any set of latent variables,

f , can be eval-
uated from
Cov

f |D,,
[

f ] = E
f |D,,
_
Cov

f | f
[

f ]
_
+ Cov
f |D,,
_
E

f | f
[

f ]
_
, (2.6)
where the rst term simplies to the conditional covariance in equation (1.9) and the
second term can be written as k( x, X) K
-1
f,f
Cov
f |D,,
[f ] K
-1
f,f
k(X, x

). Plugging
these in the equation and simplifying gives us the posterior covariance function
k
p
( x, x

[D, , ) = k( x, x

[)k( x, X[)
_
K
-1
f,f
K
-1
f,f
Cov
f |D,,
[f ] K
-1
f,f
_
k(X, x

[).
(2.7)
From this on the posterior predictive mean and covariance will be denoted m
p
( x) and
k
p
( x, x

).
Even if the exact posterior p(

f[D, , ) is not available in closed form, we can
still approximate its posterior mean and covariance functions if we can approximate
E
f |D,,
and Cov
f |D,,
[f ]. Acommon practice to approximate the posterior p(f [D, , )
is either with MCMC (e.g. Neal, 1997, 1998; Diggle et al., 1998; Kuss and Ras-
mussen, 2005; Christensen et al., 2006) or by giving an analytic approximation to it
(e.g. Williams and Barber, 1998; Gibbs and Mackay, 2000; Minka, 2001; Csat and
Opper, 2002; Rue et al., 2009). The analytic approximations considered here assume
a Gaussian form in which case it is natural to approximate the predictive distribution
with a Gaussian as well. In this case the equations (2.5) and (2.7) give its mean and
covariance. The Gaussian approximation can be justied if the conditional posterior
is unimodal, which it is, for example, if the likelihood is log concave (this can eas-
ily be seen by evaluating the Hessian of the posterior p(f [D, )), and there is enough
data so that the posterior will be close to Gaussian. A pragmatic justication for using
Gaussian approximation is that many times it sufces to approximate well the mean
and variance of the latent variables. These on the other hand fully dene Gaussian
distribution and one can approximate the integrals over f by using the Gaussian form
for their conditional posterior. Detailed considerations on the approximation error and
the asymptotic properties of the Gaussian approximation are presented, for example,
by Rue et al. (2009) and Vanhatalo et al. (2010).
2.1.2 Gaussian observation model
A special case of an observation model, for which the conditional posterior of the latent
variables can be evaluated analytically, is the Gaussian distribution, y
i
N(f
i
,
2
),
2.1. CONDITIONAL POSTERIOR OF THE LATENT FUNCTION 9
where the parameter represents the noise variance
2
. Since both the likelihood and
the prior are Gaussian functions of the latent variable, we can integrate analytically
over f to obtain
p(y [X, ,
2
) = N(y [0, K
f,f
+
2
I). (2.8)
Setting this in the denominator of the equation (2.1), gives a Gaussian distribution also
for the conditional posterior of the latent variables
f [D, ,
2
N(K
f,f
(K
f,f
+
2
I)
1
y, K
f,f
K
f,f
(K
f,f
+
2
I)
1
K
f,f
). (2.9)
Since the conditional posterior of f is Gaussian, the posterior process, or distribu-
tion p(

f[D), is also Gaussian. By placing the mean and covariance from (2.9) in the
equations (2.5) and (2.7) we obtain the predictive distribution

f[D, ,
2
GP
_
m
p
( x), k
p
( x, x

)
_
, (2.10)
where the mean m
p
( x) = k( x, X)(K
f,f
+
2
I)
1
y and covariance k
p
( x, x

) = k( x, x

)
k( x, X)(K
f,f
+
2
I)
1
k(X, x

). The predictive distribution for new observations y


can be obtained by integrating p( y[D, ,
2
) =
_
p( y[

f ,
2
)p(

f [D, ,
2
)d

f . The re-
sult is, again, Gaussian with mean E

f |D,
[

f ] and covariance Cov

f |D,
[

f ] +
2
I.
2.1.3 Laplace approximation
In the Laplace approximation, the mean is approximated by the posterior mode of f and
the covariance by the curvature of the log posterior at the mode. The approximation
is constructed from the second order Taylor expansion of log p(f [ y, , ) around the
mode

f , which gives a Gaussian approximation to the conditional posterior
p(f [D, , ) q(f [D, , ) = N(f [

f , ), (2.11)
where

f = arg max
f
p(f [D, , ) and
1
is the Hessian of the negative log con-
ditional posterior at the mode (Gelman et al., 2004; Rasmussen and Williams, 2006):

1
= log p(f [D, , )[
f =

f
= K
-1
f,f
+W, (2.12)
where Wis a diagonal matrix with entries W
ii
=
f
i

f
i
log p(y[f
i
, )[
f
i
=

f
i
. Note
that actually the Taylor expansion is made for the unnormalized posterior and the nor-
malization term is computed from the Gaussian approximation.
Here, the approximation scheme is called the Laplace method following Williams
and Barber (1998), but essentially the same approximation is named Gaussian approx-
imation by Rue et al. (2009) in their Integrated nested Laplace approximation (INLA)
scheme for Gaussian Markov random eld models.
The posterior mean of f( x) can be approximated from the equation (2.5) by replac-
ing the posterior mean E
f |D,
[f ] by

f . The posterior covariance is approximated sim-
ilarly by using (K
-1
f,f
+W)
1
in the place of Cov
f |D,
[f ]. After some rearrangements
and using K
-1
f,f

f = log p(y [ f )[
f =

f
, the approximate posterior predictive distribution
is

f[D, ,
2
GP
_
m
p
( x), k
p
( x, x

)
_
, (2.13)
where the mean and covariance are m
p
( x) = k( x, X)log p(y [ f )[
f =

f
and k
p
( x, x

) =
k( x, x

) k( x, X)(K
f,f
+W)
1
k(X, x

). The approximate conditional predictive


10 CHAPTER 2. INFERENCE AND PREDICTION
densities of new observations y
i
can now be evaluated, for example, with quadrature
integration over each

f
i
separately
p( y
i
[D, , )
_
p( y
i
[

f
i
, )q(

f
i
[D, , )d

f
i
. (2.14)
2.1.4 Expectation propagation algorithm
The Laplace method constructs a Gaussian approximation at the posterior mode and
approximates the posterior covariance via the curvature of the log density at that point.
The expectation propagation (EP) algorithm (Minka, 2001), for its part, tries to mini-
mize the Kullback-Leibler divergence from the true posterior to its approximation
q(f [D, , ) =
1
Z
EP
p(f [)
n

i=1
t
i
(f
i
[

Z
i
,
i
,
2
i
), (2.15)
where the likelihood terms have been replaced by site functions t
i
(f
i
[

Z
i
,
i
,
2
i
) =

Z
i
N(f
i
[
i
,
2
i
) and the normalizing constant by Z
EP
. EP algorithm updates the site pa-
rameters

Z
i
,
i
and
2
i
sequentially. At each iteration, rst the ith site is removed from
the ith marginal posterior to obtain a cavity distribution q
i
(f
i
) = q(f
i
[D, )/t
i
(f
i
).
Second step is to nd a Gaussian q(f
i
) to which the Kullback-Leibler divergence from
the cavity distribution multiplied by the exact likelihood for that site is minimized, that
is q(f
i
) = arg min
q
KL(q
i
(f
i
)p(y
i
[f
i
)[[q(f
i
)). This is equivalent to matching the
rst and second moment of the two distributions (Seeger, 2008). The site terms

Z
i
are
scaling parameters which ensure that Z
EP
= p(D[, ). After the moments are solved,
the parameters of the local approximation t
i
are updated so that the new marginal pos-
terior q
i
(f
i
)t
i
(f
i
) matches with the moments of q(f
i
). For last, the parameters of the
approximate posterior (2.15) are updated to give
q(f [D, , ) = N(f [ K
f,f
(K
f,f
+

)
1
, K
f,f
K
f,f
(K
f,f
+

)
1
K
f,f
), (2.16)
where

= diag[
2
1
, ...,
2
n
] and = [
1
, ...,
n
]
T
. The iterations are continued until
the convergence. The predictive mean and covariance are again obtained from equa-
tions (2.5) and (2.7) analogically to the Laplace approximation.
From the equations (2.9), (2.11), and (2.16) it can be seen that the Laplace and EP
approximations are similar to the exact solution with the Gaussian observation model.
The diagonal matrices W
1
and

correspond to the noise variance
2
I and, thus,
the two approximations can be seen as Gaussian approximations for the likelihood
(Nickisch and Rasmussen, 2008).
2.1.5 Markov chain Monte Carlo
The accuracy of the approximations considered this far is limited by the Gaussian form
of the approximating function. An approach, which gives exact solution in the limit
of an innite computational time, is the Monte Carlo integration (Robert and Casella,
2004). This is based on sampling from the unnormalized posterior
1
Z
p(f [D, , ?)
p(y [ f , )p(f [, X, ) and using the samples to represent the posterior distribution. The
posterior marginals can be visualized with histograms and posterior statistics approxi-
mated with sample means. For example, the posterior expectation of f using M sam-
ples is
E
f |D,,
[f ]
1
M
M

i=1
f
(i)
, (2.17)
2.2. MARGINAL LIKELIHOOD GIVEN PARAMETERS 11
0.8 0.55 0.3
(a) Disease mapping
10 5 0
(b) Classication
Figure 2.1: Illustration of the Laplace approximation (solid line), EP (dashed line) and
MCMC (histogram) for the conditional posterior of a latent variable p(f
i
[D, ) in two
applications. On the left, a disease mapping problem with Poisson observation model
(used in Vanhatalo et al., 2010) where the Gaussian approximation works well. On the
right, a classication problem with probit likelihood (used in Vanhatalo and Vehtari,
2010) where the posterior is skewed and the Gaussian approximation is quite poor.
where f
(i)
is the ith sample from the conditional posterior. The problem with Monte
Carlo methods is how to draw samples from arbitrary distributions. The challenge can
be overcome with Markov chain Monte Carlo methods (Gilks et al., 1996), where one
constructs a Markov chain whose stationary distribution is the posterior distribution
p(f [D, , ?) and uses the Markov chain samples to obtain Monte Carlo estimates. A
rather efcient sampling algorithm is hybrid Monte Carlo (HMC) (Duane et al., 1987;
Neal, 1996), which utilizes the gradient information of the posterior distribution to di-
rect the sampling in interesting regions. Signicant improvement in mixing of the sam-
ple chain of the latent variables can be obtained by using the variable transformation
discussed in (Christensen et al., 2006; Vanhatalo and Vehtari, 2007). After having the
posterior sample of latent variables, we can sample from the posterior predictive distri-
bution of any set of

f simply by sampling with each f
(i)
one

f
(i)
from p(

f [ f
(i)
, , ),
which is given in the equation (1.9). Similarly, we can obtain a sample of y by drawing
one y
(i)
for each

f
(i)
from p( y[

f
(i)
, , ).
2.2 Marginal likelihood given parameters
In Bayesian inference we integrate over the unknown latent values to get the marginal
likelihood. In Gaussian case this is straightforward since it has an analytic solution (see
equation (2.8)) and log marginal likelihood is
log p(D[, ) =
n
2
log(2)
1
2
log [ K
f,f
+
2
I[
1
2
y
T
(K
f,f
+
2
I)
1
y, (2.18)
If the observation model is not Gaussian the marginal likelihood needs to be ap-
proximated. The Laplace approximation to the marginal likelihood (denominator in
the equation (2.1)) is constructed, for example, by writing
p(D[, ) =
_
p(y[ f , )p(f [X, )d f =
_
exp(g(f ))d f , (2.19)
after which a second order Taylor expansion around

f is done for g(f ). This gives a
12 CHAPTER 2. INFERENCE AND PREDICTION
Gaussian integral over f multiplied by a constant, and results in the approximation
log p(D[, ) log q(D[, )
1
2

f
T
K
-1
f,f

f + log p(y[

f , )
1
2
log [B[, (2.20)
where [B[ = [I + W
1/2
K
f,f
W
1/2
[. Rue et al. (2009) use the same approximation
when they dene p([D) p(y, f , , )/q(f [D, , )[
f =

f
, where the denominator is
the Laplace approximation in equation (2.11) (see also Tierney and Kadane, 1986).
EPs marginal likelihood approximation is the normalization constant
Z
EP
=
_
p(f [X, )
n

i=1

Z
i
N(f
i
[
i
,
2
i
)df
i
(2.21)
in equation (2.15). This is a Gaussian integral multiplied by a constant

n
i=1

Z
i
, giving
log Z
EP
=
1
2
log [K +

[
1
2

T
_
K +

_
1
+C
EP
, (2.22)
where C
EP
collects the terms that are not explicit functions of or (there is an implicit
dependence through the iterative algorithm, though).
Note that in actual implementation the Laplace and EP marginal likelihood approx-
imations are obtained as byproduct when computing the conditional posterior of f .
2.3 Marginalization over parameters
The previous section treated methods to evaluate exactly (the Gaussian case) or approx-
imately (Laplace and EP approximations) the log marginal likelihood given parameters.
In this section we describe approaches for estimating parameters or integrating numer-
ically over them.
2.3.1 Maximum a posterior estimate of parameters
In full Bayesian approach we should integrate over all unknowns. Given we have
integrated over the latent variables, it often happens that the posterior of the parameters
is peaked or predictions are unsensitive to small changes in parameter values. In such
case we can approximate the integral over p(, [D) with maximum a posterior (MAP)
estimate

,

= arg max
,
p(, [D) = arg min
,
[log p(D[, ) log p(, )] . (2.23)
In this approximation, the parameter values are given a point mass one at the posterior
mode, and the latent function marginal is approximated as p(f [D) p(f [D,

,

).
The log marginal likelihood, and thus also the log posterior, is differentiable with
respect to the parameters, which allows gradient based optimization. The gradients of
the Laplace approximated log marginal likelihood (2.20) can be solved analytically,
too. In EP the parameters C
EP
,

and can be considered constants when differen-
tiating the function with respect to the parameters (Seeger, 2005), for which reason
gradient based optimization is possible also with EP.
The advantage of MAP estimate is that it is relatively easy and fast to evaluate. Ac-
cording to our experience good optimization algorithms need usually at maximum tens
of optimization steps to nd the mode. However, it may underestimate the uncertainty
in parameters.
2.3. MARGINALIZATION OVER PARAMETERS 13
2.3.2 Grid integration
Grid integration is based on weighted sum of points evaluated on grid
p(f [D)
M

i=1
p(f [D,
i
)p(
i
[D)
i
. (2.24)
Here = [
T
,
T
]
T
and
i
denotes the area weight appointed to an evaluation point

i
.
The rst step in exploring log p([D) is to nd its posterior mode as described in
the previous section. After this we evaluate the negative Hessian of log p([D) at the
mode, which would be the inverse covariance matrix for if the density were Gaussian.
If Pis the inverse Hessian (the approximate covariance) with eigendecomposition P =
UCU
T
, then can be dened as
(z) =

+UC
1/2
z, (2.25)
where z is a standardized variable. If p([D) were a Gaussian density, then z would be
zero mean normal distributed. This re-parametrization corrects for scale and rotation
and simplies the integration (Rue et al., 2009). The exploration of log p([D) is
started from the mode

and continued so that the bulk of the posterior mass is included
in the integration. The grid points are set evenly along the directions z, and thus the
area weights
i
are equal. This is illustrated in Figure 2.2(a) (see also Rue et al., 2009;
Vanhatalo et al., 2010).
The grid search is feasible only for a small number of parameters since the number
of grid points grows exponentially with the dimension of the parameter space d. For
example, the number of the nearest neighbors of the mode increases as O(3
d
), which
results in 728 grid points already for d = 6. If also the second neighbors are included,
the number of grid points increases as O(5
d
), resulting in 15624 grid points for six
parameters.
2.3.3 Monte Carlo integration
Monte Carlo integration works better than the grid integration in large parameter spaces
since its error decreases with a rate that is independent of the dimension (Robert and
Casella, 2004). There are two options to nd a Monte Carlo estimate for the marginal
posterior p(f [D). The rst option is to sample only the parameters from their marginal
posterior p([D) or from its approximation (see Figure 2.2(b)). In this case, the pos-
terior marginals are approximated with mixture distributions as in the grid integration.
The other option is to run a full MCMC for all the parameters in the model. That is,
we sample both the parameters and the latent variables and estimate the needed poste-
rior statistics by sample estimates or by histograms (Neal, 1997; Diggle et al., 1998).
The full MCMC is performed by alternate sampling from the conditional posteriors
p(f [D, ) and p([D, f ). Sampling both the parameters and latent variables is usually
awfully slow since there is a strong correlation between them (Vanhatalo and Vehtari,
2007; Vanhatalo et al., 2010). Sampling from the (approximate) marginal, p([D), is a
much easier task since the parameter space is smaller.
The parameters can be sampled from their marginal posterior (or its approxima-
tion) either with HMC, slice sampling (SLS) (Neal, 2003) or via importance sampling
(Geweke, 1989). In importance sampling, we use a Normal or Student-t proposal dis-
tribution with mean

and covariance Papproximated with the negative Hessian of the
14 CHAPTER 2. INFERENCE AND PREDICTION
log posterior and approximate the integral with
p(f [D)
1

M
i=1
w
i
M

i=1
q(f [D,
i
)w
i
, (2.26)
where w
i
= q(
(i)
)/g(
(i)
) are the importance weights. Importance sampling is ade-
quate only if the importance weights do not vary substantially. Thus, the goodness of
the Monte Carlo integration can be monitored using the importance weights. The worst
scenario occurs when the importance weights are small with high probability and with
small probability get very large values (that is the tails of q are much wider than those
of g). Problems can be detected by monitoring the cumulative normalized weights and
the estimate of the effective sample size M
eff
= 1/

M
i=1
w
2
i
, where w
i
= w
i
/

w
i
(Geweke, 1989; Gelman et al., 2004; Vehtari and Lampinen, 2002). In some situa-
tions the naive Gaussian or Student-t proposal distribution is not adequate since the
posterior distribution q([D) may be non-symmetric or the covariance estimate P is
poor. In these situations, we use the scaled Student-t proposal distribution, proposed
by Geweke (1989). In this approach, the scale of the proposal distribution is adjusted
along each main direction dened by Pso that the importance weights are maximized.
Although Monte Carlo integration is more efcient than grid integration for high
dimensional problems, it also has its downside. For most examples, few hundred inde-
pendent samples are enough for reasonable posterior summaries (Gelman et al., 2004),
which seems achievable. The problem, however, is that we are not able to draw inde-
pendent samples from the posterior. Even with a careful tuning of Markov chain sam-
plers the autocorrelation is usually so large that the required sample size is in thousands,
which makes Monte Carlo integration computationally very demanding compared to,
for example, the MAP estimate.
2.3.4 Central composite design
Rue et al. (2009) suggest a central composite design (CCD) for choosing the repre-
sentative points from the posterior of the parameters when the dimensionality of the
parameters, d, is moderate or high. In this setting, the integration is considered as a
quadratic design problem in a d dimensional space with the aim at nding points that
allow to estimate the curvature of the posterior distribution around the mode. The de-
sign used by Rue et al. (2009) and Vanhatalo et al. (2010) is the fractional factorial
design (Sanchez and Sanchez, 2005) augmented with a center point and a group of 2d
star points. In this setting, the design points are all on the surface of a d-dimensional
sphere and the star points consist of 2d points along each axis, which is illustrated in
Figure 2.2(c). The integration is then a nite sum with special weights summarized by
the equation (2.24).
The design points are searched after transforming into z-space, which is assumed
to be a standard Gaussian variable. The integration weights can then be determined
from the statistics of a standard Gaussian variable E[z
T
z] = d, E[z] = 0 and E[1] = 1,
which result in
=
_
(n
p
1) exp
_

df
2
0
2
_
_
f
2
0
1
_
_
1
(2.27)
for the points on the sphere and
0
= 1 for the central point (see appendix Vanhatalo
et al., 2010, for a more detailed derivation). The CCD integration speeds up the com-
putations considerably compared to the grid search or Monte Carlo integration since
2.3. MARGINALIZATION OVER PARAMETERS 15
log(lengthscale)
l
o
g
(
m
a
g
n
i
t
u
d
e
)
z
1
z
2
2.8 3 3.2 3.4
3
2.5
2
1.5
1
0.5
(a) Grid based.
log(lengthscale)
l
o
g
(
m
a
g
n
i
t
u
d
e
)
2.8 3 3.2 3.4
3
2.5
2
1.5
1
0.5
(b) Monte Carlo
log(lengthscale)
l
o
g
(
m
a
g
n
i
t
u
d
e
)
2.8 3 3.2 3.4
3
2.5
2
1.5
1
0.5
(c) Central composite design
Figure 2.2: Illustration of the grid based, Monte Carlo and central composite design
integration. Contours showthe posterior density q(log()[D) and the integration points
are marked with dots. The left gure shows also the vectors z along which the points
are searched in the grid integration and central composite design. The integration is
conducted over q(log()[D) rather than q([D) since the former is closer to Gaussian.
Reproduced from (Vanhatalo et al., 2010)
the number of the design points grows very moderately and, for example, for d = 6
one needs only 45 points. The accuracy of the CCD is between the MAP estimate
and the full integration with the grid search or Monte Carlo. Rue et al. (2009) report
good results with this integration scheme, and it has worked well in our experiments
as well. CCD tries to incorporate the posterior variance of the parameters in the infer-
ence and seems to give good approximation in high dimensions. Since CCD is based
on the assumption that the posterior of the parameter is (close to) Gaussian, the densi-
ties p(
i
[D) at the points on the circumference should be monitored in order to detect
serious discrepancies from this assumption. These densities are identical if the poste-
rior is Gaussian, and thereby great variability on their values indicates that CCD has
failed. The posterior of the parameters may be far from a Gaussian distribution but for
a suitable transformation the approximation may work well. For example, the covari-
ance function parameters should be transformed through logarithm as discussed in the
section 3.1.
16 CHAPTER 2. INFERENCE AND PREDICTION
Chapter 3
GPstuff basics: regression and
classication
In this chapter, we illustrate the use of GPstuff package with two classical examples,
regression and classication. The regression task has Gaussian observation model and
forms an important special case of the GP models since it is the only model where we
are able to marginalize over the latent variables analytically. Thus, it serves as a good
starting point for demonstrating the use of the software package. The classication
problem is a usual text book example of tasks with non-Gaussian observation model
where we have to utilize the approximate methods discussed in the previous chapter.
This chapter serves also as a general introduction to the GPstuff software package.
In the next few sections, we will introduce and discuss many of the functionalities of
the package that will be present in more advanced models as well.
3.1 Gaussian process regression: demo_regression1
The demonstration programdemo_regression1 considers a simple regression prob-
lem, where we wish to infer a latent function f(x) given measurements corrupted with
i.i.d. Gaussian noise as y
i
= f(x
i
) +
i
, where
i
N(0,
2
n
). In this particular
example the input vectors x
i
are two-dimensional. This results in the overall model
y [
2
N(f ,
2
I), (3.1)
f(x)[ GP(0, k(x, x

[)) , (3.2)
,
2
p()p(
2
). (3.3)
We will show how to construct the model with a squared exponential covariance func-
tion and how to conduct the inference.
3.1.1 Constructing the model
In GPstuff, the model is constructed in the following three steps:
create structures that dene likelihood and covariance function
dene priors for the parameters
17
18 CHAPTER 3. GPSTUFF BASICS: REGRESSION AND CLASSIFICATION
gp =
type: FULL
lik: [1x1 struct]
cf: {[1x1 struct]}
infer_params: covariance+likelihood
jitterSigma2: 0
lik =
type: Gaussian
sigma2: 0.0400
p: [1x1 struct]
fh: [1x1 struct]
gpcf =
type: gpcf_sexp
lengthScale: [1.1000 1.2000]
magnSigma2: 0.0400
p: [1x1 struct]
fh: [1x1 struct]
Figure 3.1: The GP, likelihood and covariance function structure in
demo_regression1.
create a GP structure, which stores all the above
These three steps are done as follows:
lik = lik_gaussian(sigma2, 0.2^2);
gpcf = gpcf_sexp(lengthScale, [1.1 1.2], magnSigma2, 0.2^2)
pn = prior_logunif();
lik = lik_gaussian(lik, sigma2_prior, pn);
pl = prior_t();
pm = prior_sqrtunif();
gpcf = gpcf_sexp(gpcf, lengthScale_prior, pl, magnSigma2_prior, pm);
gp = gp_set(type, FULL, lik, lik, cf, {gpcf});
Here lik_gaussian initializes the likelihood function corresponding to Gaus-
sian observation model (lik_gaussian provides code for both the likelihood func-
tion and observation model, but for naming simplicity, lik-prex is used) and its
parameter values. Function gpcf_sexp initializes the covariance function and its
parameter values. lik_gaussian returns structure lik and gpcf_sexp returns
gpcf that contain all the information needed in the evaluations (function handles, pa-
rameter values etc.). The next ve lines create the prior structures for the parameters
of the likelihood and the covariance function, which are set in the likelihood and co-
variance function structures. The prior for noise variance is usual uninformative log-
uniform. The priors for length-scale and magnitude are the Student-t distribution which
works as a weakly informative prior (Gelman, 2006) (note that prior is set to the square
root of the magnSigma2, i.e., magnitude). The last line creates the GP structure by
giving it the likelihood and covariance function and setting the type to FULL which
means that the model is a regular GP without sparse approximations (see section 4 for
other types). Type FULL is the default, and could be left out in this case.
In the GPstuff toolbox, the Gaussian process structure is the fundamental unit,
around which all the other blocks of the model are collected. This is illustrated in Fig-
ure 3.1. In the gp structure type denes the type of the model, likelih denes the
likelihood structure, cf contains the covariance function structure, infer_params
denes which parameters are inferred (covariance, likelihood etc.) and will be dis-
cussed in detail in the chapter 4, and jitterSigma2 is a small constant which is
added in the diagonal of the covariance matrix to make it numerically more stable. If
3.1. GAUSSIAN PROCESS REGRESSION: DEMO_REGRESSION1 19
there were more than one covariance function, they would be handled additively. See
section 6.2 for details. The likelihood and covariance structure are similar to the GP
structure. The likelihood structure contains the type of the likelihood, its parameters
(here sigma2), prior structure (p) and structure holding function handles for internal
use (fh). The covariance structure contains the type of the covariance, its parameters
(here lengthScale and magnSigma2), prior structure (p) and structure holding
function handles for internal use (fh).
Using the constructed GP structure, we can evaluate basic summaries such as co-
variance matrices, make predictions with the present parameter values etc. For exam-
ple, the covariance matrices K
f,f
and C = K
f,f
+
2
noise
I for three two-dimensional
input vectors can be evaluated as follows:
example_x = [-1 -1 ; 0 0 ; 1 1];
[K, C] = gp_trcov(gp, example_x)
K =
0.0400 0.0054 0.0000
0.0054 0.0400 0.0054
0.0000 0.0054 0.0400
C =
0.0800 0.0054 0.0000
0.0054 0.0800 0.0054
0.0000 0.0054 0.0800
Here K is K
f,f
and C is K
f,f
+
2
noise
I.
3.1.2 MAP estimate for the hyperparameters
MAP estimate for the parameters can be obtained with function gp_optim, which
works as a wrapper for usual gradient based optimization functions. It can be used as
follows:
opt=optimset(TolFun,1e-3,TolX,1e-3,Display,iter);
gp=gp_optim(gp,x,y,optimf,@fminscg,opt,opt);
gp_optim takes a GP structure, training input x, training target y (which are dened
in demo_regression1) and options, and returns a GP structure with optimized hy-
per parameter values. By default gp_optim uses fminscg function, but gp_optim
can use also, for example, fminlbfgs or fminunc. Optimization options are set
with optimset function. All the estimated parameter values can be easily checked
using the function gp_pak, which packs all the parameter values from all the covari-
ance function structures in one vector, usually using log-transformation (other trans-
formations are also possible). The second output argument of gp_pak lists the labels
for the parameters:
[w,s] = gp_pak(gp);
disp(exp(w)), disp(s)
It is also possible to set the parameter vector of the model to desired values using
gp_unpak.
gp = gp_unpak(gp,w);
It is possible to control which parameters optimized in two ways. 1) If any parame-
ter is given an empty prior it is considered xed. 2) It is possible to use gp_set to set
option infer_params to control which groups of parameters covariance, likelihood
or inducing are meant to be inferred.
To make predictions for new locations x, given the training data (x, y), we can use
the gp_pred function, which returns the posterior predictive mean and variance for
each f( x) (see equation (2.10)). This is illustrated below where we create a regular
grid where the posterior mean and variance are computed. The posterior mean m
p
( x)
and the training data points are shown in Figure 3.2.
20 CHAPTER 3. GPSTUFF BASICS: REGRESSION AND CLASSIFICATION
2
0
2
2
0
2
4
2
0
2
4


predicted surface
data point
(a) The predictive mean and training data.
0 0.5 1 1 0.5
(b) The marginal posterior predictive distribu-
tions p(f
i
|D).
Figure 3.2: The predictive mean surface, training data, and the marginal posterior for
two latent variables in demo_regression1. Histograms show the MCMC solution
and the grid integration solution is drawn with a line.
0.5 1 1.5
0
50
100
150
Lengths 1
0.5 1
0
50
100
150
Lengths 2
2 4 6
0
50
100
150
200
magnitude
0.03 0.06
0
50
100
150
Noise variance
Figure 3.3: The posterior distribution of the hyperparameters together with MAP esti-
mate (black cross). The results are from demo_regression1.
[xt1,xt2]=meshgrid(-1.8:0.1:1.8,-1.8:0.1:1.8);
xt=[xt1(:) xt2(:)];
[Eft_map, Varft_map] = gp_pred(gp, x, y, xt);
3.1.3 Marginalization over parameters with grid integration
To integrate over the parameters we can use any method described in the section 2.3.
The grid integration is performed with the following line:
[gp_array, P_TH, th, Eft_ia, Varft_ia, fx_ia, x_ia] = ...
gp_ia(gp, x, y, xt, int_method, grid);
gp_ia returns an array of GPs (gp_array) for parameter values th ([
i
]
M
i=1
) with
weights P_TH ([p(
i
[D)
i
]
M
i=1
). Since we use the grid method the weights are propor-
3.2. GAUSSIAN PROCESS CLASSIFICATION 21
tional to the marginal posterior and
i
1i (see section 2.3.2). The last four outputs
are returned only if the prediction locations xt (representing

X) are given. Ef_ia
and Varf_ia contain the predictive mean and variance at the prediction locations.
The last two output arguments can be used to plot the predictive distribution p(

f
i
[D)
as demonstrated in Figure 3.2. x_ia contains a regular grid of values

f
i
and fx_ia
contains p(

f
i
[D) at those values.
The gp_ia optimizes the hyperparameters to their posterior mode, evaluates the
Hessian P
1
with nite differences, and makes a grid search starting from the mode.
Since we give the prediction inputs p (representing

X) the integration method returns
also the predictive distributions for those locations. Otherwise gp_ia would return
only the rst three arguments, which contain the array of GPs (gp_array) for hyper-
parameter values th ([
i
]
M
i=1
) with weights P_TH ([p(
i
[D)
i
]
M
i=1
). Since we use the
grid method the weights are proportional to the marginal posterior and
i
1i (see
section 2.3.2). Ef_ia and Varf_ia contain the predictive mean and variance at the
prediction locations p. The last two output arguments can be used to plot the predictive
distribution p(

f
i
[D) at input location x
i
(which is the ith row of p) as demonstrated in
the Figure 3.2. x_ia contains a regular grid of values

f
i
and fx_ia contains p(

f
i
[D)
at those values.
3.1.4 Marginalization over parameters with MCMC
The main function for conducting Markov chain sampling is gp_mc, which loops
through all the specied samplers in turn and saves the sampled parameters in a record
structure. In later sections, we will discuss models where also latent variables are sam-
pled, but now we concentrate on the covariance function parameters. Sampling and
predictions can be made as follows:
hmc_opt = hmc2_opt;
[rfull,g,opt] = gp_mc(gp, x, y, nsamples, 400, repeat, 5, hmc_opt, hmc_opt);
rfull = thin(rfull, 50, 2);
[Eft_mc, Varft_mc] = gpmc_preds(rfull, x, y, xt);
The gp_mc function generates nsamples Markov chain samples. Between consec-
utive samples repeat (5) sampler iterations are done. At each iteration gp_mc runs
the actual samplers. For example, giving the option hmc_opt tells that gp_mc
should run the hybrid Monte Carlo sampler with sampling options stored in the struc-
ture hmc_opt. The default sampling options for HMC are set by hmc2_opt func-
tion. The function thin removes the burn-in from the sample chain (here 50) and
thins the chain with user dened parameter (here 2). This way we can decrease the
autocorrelation between the remaining samples. GPstuff provides also other diagnostic
tools for Markov chains and in detailed MCMC analysis we should use several start-
ing points and monitor the convergence carefully (Gelman et al., 2004; Robert and
Casella, 2004). Figure 3.3 shows an example of approximating a posterior distribution
of parameters with Monte Carlo sampling. The function gpmc_preds returns the
conditional predictive mean and variance for each sampled parameter value. These are
E
p(f |X,D,
(s)
)
[

f ], s = 1, ..., M and Var


p(f |X,D,
(s)
)
[

f ], s = 1, ..., M, where M is the


number of samples. Marginal predictive mean and variance can be computed directly
with gp_pred.
3.2 Gaussian process classication
We will now consider a binary GP classication (see demo_classific) with ob-
servations, y
i
1, 1, i = 1, ..., n, associated with inputs X = x?
n
i=1
. The
22 CHAPTER 3. GPSTUFF BASICS: REGRESSION AND CLASSIFICATION
observations are considered to be drawn from a Bernoulli distribution with a success
probability p(y
i
= 1[x
i
). The probability is related to the latent function via a sig-
moid function that transforms it to a unit interval. GPstuff provides a probit and logit
transformation, which lead to observation models
p
probit
(y
i
[f(x
i
)) = (y
i
f(x
i
)) =
_
y
i
f(x
i
)

N(z[0, 1)dz (3.4)


p
logit
(y
i
[f(x
i
)) =
1
1 + exp(y
i
f(x
i
))
. (3.5)
Since the likelihood is not Gaussian we need to use approximate inference methods,
discussed in the section 2.1. We will conduct the inference with Laplace approxi-
mation, EP and full MCMC. With the two analytic approximations we optimize the
parameters to their MAP estimates and use those point estimates for prediction. In
MCMC approach, we sample both the parameters and the latent variables. For a de-
tailed discussion on the differences between MCMC, Laplace and EP approaches in
classication setting see (Kuss and Rasmussen, 2005; Nickisch and Rasmussen, 2008).
3.2.1 Constructing the model
The model construction for the classication follows closely the steps presented in the
previous section. The model is constructed as follows:
lik = lik_probit();
gpcf = gpcf_sexp(lengthScale, [0.9 0.9], magnSigma2, 10);
gp = gp_set(lik, lik, cf, {gpcf}, jitterSigma2, 1e-9);
The above lines rst initialize the likelihood function, the covariance function and the
GP structure. The default uniform priors for length-scale and magnitude are used.
The model construction and the inference with logit likelihood would be similar with
lik_logit. A small jitter value is added to the diagonal of the training covariance
to make certain matrix operations more stable.
3.2.2 Inference with Laplace approximation
The MAP estimate for the parameters can be found using gp_optim as
gp = gp_set(gp, latent_method, Laplace);
gp = gp_optim(gp,x,y,opt,opt);
The rst line denes which inference method is used for the latent variables. It
initializes the Laplace algorithm and sets needed elds in the GP structure. The default
method for latent variables is Laplace, so this line could be omitted. The next line uses
default fminscg optimization function with same the options as above.
gp_pred can be used to obtain the mean and variance for the latent variables and
new observations together with the predictive probability for a test observation.
[Eft_la, Varft_la, Eyt_la, Varyt_la, pyt_la] = ...
gp_pred(gp, x, y, xt, yt, ones(size(xt,1),1) );
The rst four input arguments are the same as in the section 3, the fth and sixth ar-
guments are a parameter-value pair. Parameter yt tells that we give test observations
related to xt as optional inputs for gp_pred. Here we want to evaluate the probability
to observe class 1 and thus we give a vector of ones as test observations. The proba-
bility densities for test observations are returned in the fth output argument pyt. The
training data together with predicted class probability contours is visualized in Figure
3.4(a).
3.2. GAUSSIAN PROCESS CLASSIFICATION 23
1 0.5 0 0.5
0
0.5
1
(a) Laplace approximation.
1 0.5 0 0.5
0
0.5
1
(b) Expectation propagation.
1 0.5 0 0.5
0
0.5
1
(c) Markov chain Monte Carlo.
Figure 3.4: The class probability contours for Laplace approximation, EP, and MCMC
solution. The strongest line in the middle is the 50% probability (the decision bound-
ary) and the thinner lines are 2.5%, 25%, 75% and 97.5% probability contours. It is
seen that the decision boundary is approximated similarly with all the methods but
there is a larger difference elsewhere.
1510 5 0 5 0 5 10
Figure 3.5: The marginal posterior for two latent variables in demo_classific1.
Histogram is the MCMC solution, dashed line is the Laplace approximation and full
line the EP estimate. The left gure is for latent variable at location [1.0, 0.2] and the
right gure at [0.9, 0.9].
24 CHAPTER 3. GPSTUFF BASICS: REGRESSION AND CLASSIFICATION
3.2.3 Inference with expectation propagation
The EP approximation is used similarly as the Laplace approximation. We only need
to set the latent method to EP:
gp = gp_set(gp, latent_method, EP);
gp = gp_optim(gp,x,y,opt,opt);
[Eft_ep, Varft_ep, Eyt_ep, Varyt_ep, pyt_ep] = ...
gp_pred(gp, x, y, xt, yt, ones(size(xt,1),1) );
The EP solution for the predicted class probabilities is visualized in the Figure 3.4(b),
and the Figure 3.5 shows the marginal predictive posterior for two latent variables.
3.2.4 Inference with MCMC
The MCMC solution with non-Gaussian likelihood is found similarly as with the Gaus-
sian model discussed earlier. The major difference is that now we need to sample also
the latent variables. We need to set the latent method to MCMC:
gp = gp_set(gp, latent_method, MCMC, jitterSigma2, 1e-4);
For MCMC we need to add a larger jitter value to the diagonal to avoid numerical prob-
lems. By default sampling from the latent value distribution p(f [, D) is done using
@scaled_mh, which implements scaled Metropolis Hastings algorithm (Neal, 1998).
Other sampler provided by the GPstuff is a scaled HMC scaled_hmc (Vanhatalo and
Vehtari, 2007). Below, the actual sampling is performed similarly as with the Gaussian
likelihood.
% Set the parameters for MCMC...
hmc_opt=hmc2_opt;
hmc_opt.steps=10;
hmc_opt.stepadj=0.05;
hmc_opt.nsamples=1;
latent_opt.display=0;
latent_opt.repeat = 20;
latent_opt.sample_latent_scale = 0.5;
hmc2(state, sum(100
*
clock))
[r,g,opt]=gp_mc(gp, x, y, hmc_opt, hmc_opt, latent_opt, latent_opt, nsamples, 1, repeat, 15);
% re-set some of the sampling options
hmc_opt.steps=4;
hmc_opt.stepadj=0.05;
latent_opt.repeat = 5;
hmc2(state, sum(100
*
clock));
% Sample
[rgp,g,opt]=gp_mc(gp, x, y, nsamples, 400, hmc_opt, hmc_opt,
latent_opt, latent_opt, record, r);
% Make predictions
[Efs_mc, Varfs_mc, Eys_mc, Varys_mc, Pys_mc] = ...
gpmc_preds(rgp, x, y, xt, yt, ones(size(xt,1),1));
Pyt_mc = mean(Pys_mc,2);
The HMC options are set into the hmc_opt structure in a similar manner as in the
regression example. Since we are sampling also the latent variables we need to give op-
tions for their sampler as well. These options are set into the latent_opt structure.
The options specic to gp_mc are given with parameter-value pairs nsamples,
1, repeat, 15. The above lines demonstrate also how the sampling can be
continued from an old sample chain. The rst call for gp_mc returns a record struc-
ture with only one sample. This record is then given as an optional parameter for
gp_mc in the second call for it. The sampling is continued from the previously sam-
pled parameter values. The sampling options are also modied between the two suc-
cessive sampling phases. The modied options are then given to gp_mc. The line
3.2. GAUSSIAN PROCESS CLASSIFICATION 25
hmc2(state, sum(100
*
clock)); re-sets the state of the random number
generators in the HMC sampler. The last two lines evaluate the predictive statistics
similarly to the EP and Laplace approximations. However, now the statistics are matri-
ces whose columns contain the result for one MCMC sample each. The gp_mc func-
tion handles the sampling so that it rst samples the latent variables from p(f [, D)
using the scaled Metropolis Hastings after which it samples the hyperparameters from
p([ f , D). This is repeated until nsamples samples are drawn.
The MCMC solution for the predicted class probabilities is visualized in Figure
3.4(c), and Figure 3.5 shows the marginal predictive posterior for two latent variables.
It can be seen that there are little differences between the three different approxima-
tions. Here MCMC is the most accurate, then comes EP and Laplace is the worst.
However, the inference times line up in the opposite order. The difference between
the approximations is not always this large. For example, in spatial epidemiology with
Poisson observation model (see Section 7.2), Laplace and EP approximations work, in
our experience, practically as well as MCMC.
26 CHAPTER 3. GPSTUFF BASICS: REGRESSION AND CLASSIFICATION
Chapter 4
Sparse Gaussian processes
The challenges with using Gaussian process models are the fast increasing computa-
tional time and memory requirements. The evaluation of the inverse and determinant
of the covariance matrix in the log marginal likelihood (or its approximation) and its
gradient scale as O(n
3
) in time, which restricts the implementation of GP models to
moderate size data sets. For this reason there are number of sparse Gaussian processes
introduced in the literature.
4.1 Compactly supported covariance functions
A compactly supported covariance function gives zero correlation between data points
whose distance exceeds a certain threshold leading to a sparse covariance matrix. The
challenge in constructing CS covariance functions is to guarantee their positive de-
niteness. A globally supported covariance function cannot be cut arbitrarily to obtain
a compact support, since the resulting function would not, in general, be positive def-
inite. Sans and Schuh (1987) provide one of the early implementations of spatial
prediction with CS covariance functions. Their functions are build by selfconvoluting
nite support symmetric kernels (such as a linear spline). These are, however, special
functions for one or two dimensions. Wu (1995) introduced radial basis functions with
a compact support and Wendland (1995) developed them further. Later, for example,
Gaspari and Cohn (1999), Gneiting (1999, 2002), and Buhmann (2001) have worked
more on the subject. The CS functions implemented in GPstuff are Wendlands piece-
wise polynomials k
pp,q
(Wendland, 2005), such as
k
pp,2
=

2
pp
3
(1 r)
j+2
+
_
(j
2
+ 4j + 3)r
2
+ (3j + 6)r + 3
_
, (4.1)
where j = d/2| +3 and r
2
=

d
k=1
(x
i,k
x
j,k
)
2
/l
2
k
. These functions correspond to
processes that are q times mean square differentiable and are positive denite up to an
input dimension d. Thus, the degree of the polynomial has to be increased alongside the
input dimension. The dependence of CS covariance functions to the input dimension is
very fundamental. There are no radial compactly supported functions that are positive
denite on every 1
d
but they are always restricted to a nite number of dimensions
(see e.g. Wendland, 1995, theorem 9.2).
The key idea of using CS covariance functions is that, roughly speaking, one uses
only the nonzero elements of the covariance matrix in the calculations. This may speed
27
28 CHAPTER 4. SPARSE GAUSSIAN PROCESSES
up the calculations substantially since in some situations only a fraction of the elements
of the covariance matrix are non-zero. In practice, efcient sparse matrix routines are
needed (Davis, 2006). These are nowadays a standard utility in many statistical com-
puting packages, such as MATLAB or R, or available as an additional package for
them. The CS covariance functions have been rather widely studied in the geostatistics
applications. The early works concentrated on their theoretical properties and aimed to
approximate the known globally supported covariance functions (Gneiting, 2002; Fur-
rer et al., 2006; Moreaux, 2008). There the computational speed-up is obtained using
efcient linear solvers for the prediction equation

f = K

f,f
(K
f,f
+
2
)
1
y. The pa-
rameters are tted to either the empirical covariance or a globally supported covariance
function. Kaufman et al. (2008) study the maximum likelihood estimates for tapered
covariance functions (i.e. products of globally supported and CS covariance functions)
where the magnitude can be solved analytically and the length-scale is optimized using
a line search in one dimension. The benets from a sparse covariance matrix have been
immediate since the problems collapse to solving sparse linear systems. However, uti-
lizing the gradient of the log posterior of the parameters needs some extra sparse matrix
tools.
The problematic part is the trace in the derivative of the log marginal likelihood,
for example,

log p(y[X, ,
2
) =
1
2
y
T
(K
f,f
+
2
I)
1
(K
f,f
+
2
I)

(K
f,f
+
2
I)
1
y

1
2
tr
_
(K
f,f
+
2
I)
1
(K
f,f
+
2
I)

_
. (4.2)
The trace would require the full inverse of the covariance matrix if evaluated naively.
Luckily, Takahashi et al. (1973) introduced an algorithm whereby we can evaluate a
sparsied version of the inverse of a sparse matrix. This can be utilized in the gra-
dient evaluations as described by Vanhatalo and Vehtari (2008). The same problem
was considered by Storkey (1999) who used the covariance matrices of Toeplitz form,
which are fast to handle due their banded structure. However, constructing Toeplitz
covariance matrices is not possible in two or higher dimensions without approxima-
tions. EP algorithm requires also special considerations with CS covariance functions.
The posterior covariance in EP (2.16) does not remain sparse, and thereby it has to be
expressed implicitly during the updates. This issue is discussed in (Vanhatalo et al.,
2010; Vanhatalo and Vehtari, 2010).
Compactly supported covariance functions in GPstuff
The demo demo_regression_ppcs contains a regression example with fairly large
data that contain US annual precipitation summaries from year 1995 for 5776 obser-
vation stations. The data was previously used by Vanhatalo and Vehtari (2008) and
can be downloaded from (http://www.image.ucar.edu/Data/). The model is the same as
described in the section 3.1 but now we use the piecewise polynomial covariance func-
tion k
pp,2
(gpcf_ppcs2). GPstuff utilizes the sparse matrix routines from SuiteS-
parse written by Tim Davis (http://www.cise.u.edu/research/sparse/SuiteSparse/) and
this package should be installed before using the compactly supported covariance func-
tions.
The user interface of GPstuff makes no difference between globally and compactly
supported covariance functions but the code is optimized so that it uses sparse matrix
routines whenever the covariance matrix is sparse. Thus, we can construct the model,
4.2. FIC AND PIC SPARSE APPROXIMATIONS 29
(a) The nonzero elements of
K
f,f
.


500
1000
1500
2000
2500
(b) The posterior predictive mean surface.
Figure 4.1: The nonzero elements of K
f,f
with k
pp,2
function, and the posterior predic-
tive mean of the latent function in the US precipitation data set.
nd the MAP estimate for the parameters and predict to new input locations in a famil-
iar way:
pn = prior_t(nu, 4, s2, 0.3);
lik = lik_gaussian(sigma2, 1, sigma2_prior, pn);
pl2 = prior_gamma(sh, 5, is, 1);
pm2 = prior_sqrtt(nu, 1, s2, 150);
gpcf2 = gpcf_ppcs2(nin, nin, lengthScale, [1 2], magnSigma2, 3, ...
lengthScale_prior, pl2, magnSigma2_prior, pm2);
gp = gp_set(lik, lik, cf, {gpcf}, jitterSigma2, 1e-8);
gp = gp_optim(gp,x,y,opt,opt);
Eft = gp_pred(gp, x, y, xt);
With this data the covariance matrix is rather sparse since only about 5% of its elements
are non-zero. The following lines show how to evaluate the sparsity of the covariance
matrix and how to plot the non-zero structure of the matrix. The structure of the co-
variance matrix is plotted after the AMD permutation (Davis, 2006) in Figure 4.1.
K = gp_trcov(gp,x);
nnz(K) / prod(size(K))
p = amd(K);
spy(K(p,p), k)
In the section 7.2.2 we discuss a demo with non-Gaussian likelihood and compactly
supported covariance function.
4.2 FIC and PIC sparse approximations
Snelson and Ghahramani (2006) proposed a sparse pseudo-input Gaussian process
(SPGP), which Quionero-Candela and Rasmussen (2005) named later fully indepen-
dent conditional (FIC). The original idea in SPGP was that the sparse approximation
is used only in the training phase and predictions are conducted using the exact co-
variance matrix, where the word training comes to the name. If the approximation
is used also for the predictions, the word training should drop out leading to FIC. In
this case, FIC can be seen as a non-stationary covariance function on its own (Snelson,
2007). The partially independent conditional (PIC) sparse approximation is an ex-
tension of FIC (Quionero-Candela and Rasmussen, 2005; Snelson and Ghahramani,
2007), and they are both treated here following Quionero-Candela and Rasmussen
30 CHAPTER 4. SPARSE GAUSSIAN PROCESSES
(2005). The approximations are based on introducing an additional set of latent vari-
ables u = u
i

m
i=1
, called inducing variables. These correspond to a set of input
locations X
u
, called inducing inputs. The latent function prior is approximated as
X, X
u
, u, )p(u[X
u
, )du, (4.3)
where q(f [X, X
u
, u, ) is an inducing conditional. The above decomposition leads to
the exact prior if the true conditional f [X, X
u
, u, N(K
f,u
K
-1
u,u
u, K
f,f
K
f,u
K
-1
u,u
K
u,f
)
is used. However, in FIC framework the latent variables are assumed to be condition-
ally independent given u, in which case the inducing conditional factorizes q(f [X, X
u
, u, )=

q
i
(f
i
[X, X
u
, u, ). In PIC latent variables are set in blocks which are conditionally
independent of each others, given u, but the latent variables within a block have a multi-
variate normal distribution with the original covariance. The approximate conditionals
of FIC and PIC can be summarized as
q(f [X, X
u
, u, , M) = N(f [ K
f,u
K
-1
u,u
u, mask
_
K
f,f
K
f,u
K
-1
u,u
K
u,f
[M
_
), (4.4)
where the function = mask ([M), with matrix Mof ones and zeros, returns a ma-
trix of size M and elements
ij
= []
ij
if M
ij
= 1 and
ij
= 0 otherwise. An
approximation with M = I corresponds to FIC and an approximation where M is
block diagonal corresponds to PIC. The inducing inputs are given a zero-mean Gaus-
sian prior u[, X
u
N(0, K
u,u
) so that the approximate prior over latent variables is
q(f [X, X
u
, , M) = N(f [0, K
f,u
K
-1
u,u
K
u,f
+), (4.5)
The matrix K
f,u
K
-1
u,u
K
u,f
is of rank m and is a rank n (block) diagonal matrix.
The prior covariance above can be seen as a non-stationary covariance function of
its own where the inducing inputs X
u
and the matrix M are free parameters similar
to parameters, which can be optimized alongside (Snelson and Ghahramani, 2006;
Lawrence, 2007).
The computational savings are obtained by using the Woodbury-Sherman-Morrison
lemma (e.g. Harville, 1997) to invert the covariance matrix in (4.5) as
(K
f,u
K
-1
u,u
K
u,f
+)
1
=
1
VV
T
, (4.6)
where V =
1
K
f,u
chol[(K
u,u
+K
u,f

1
K
f,u
)
1
]. There is a similar result also
for the determinant. With FIC the computational time is dominated by the matrix
multiplications, which need time O(m
2
n). With PIC the cost depends also on the sizes
of the blocks in . If the blocks were of equal size b b, the time for inversion of
would be O(n/b b
3
) = O(nb
2
). With blocks at most the size of the number of
inducing inputs, that is b = m, the computational cost in PIC and FIC are similar. PIC
approaches FIC in the limit of a block size one and the exact GP in the limit of a block
size n (Snelson, 2007).
FIC sparse approximation in GPstuff
The same data that were discussed in the section 3.1 is analyzed with sparse approxi-
mations in the demo demo_regression_sparse1. The sparse approximation is
always a property of the GP structure and we can construct the model similarly to the
full GP models:
lik = lik_gaussian(sigma2, 0.2^2);
gpcf = gpcf_sexp(lengthScale, [1 1], magnSigma2, 0.2^2);
4.2. FIC AND PIC SPARSE APPROXIMATIONS 31
2
1
0
1
2
2
1
0
1
2
4
2
0
2
4
(a) FIC sparse approximation.
2
1
0
1
2
2
1
0
1
2
4
2
0
2
4
(b) PIC sparse approximation.
2
1
0
1
2
2
1
0
1
2
4
2
0
2
4
(c) Variational sparse approximation.
2
1
0
1
2
2
1
0
1
2
4
2
0
2
4
(d) DTC/SOR sparse approximation.
Figure 4.2: The posterior predictive mean of the latent function in the
demo_sparseApprox data set obtained with FIC, PIC, variational and DTC/SOR
sparse approximations. The red crosses show the optimized inducing inputs and the
block areas for PIC are colored underneath the latent surface.
[u1,u2]=meshgrid(linspace(-1.8,1.8,6),linspace(-1.8,1.8,6));
X_u = [u1(:) u2(:)];
gp_fic = gp_set(type, FIC, lik, lik, cf, {gpcf}, ...
X_u, X_u, jitterSigma2, 1e-4);
The difference is that we have to dene the type of the sparse approximation, here
FIC, and set the inducing inputs X_u in the GP structure. The posterior predictive
mean and the inducing inputs are shown in Figure 4.2.
Since the inducing inputs are considered as extra parameters common to all of the
covariance functions (there may be more than one covariance function in additive mod-
els) they are set in the GP structure instead of the covariance function structure. If we
want to optimize the inducing inputs alongside the parameters, we need to use gp_set
to set option infer_params to include inducing. Sometimes, for example in
spatial problems, it is better to x the inducing inputs (see Vanhatalo et al., 2010) or
it may be more efcient to optimize the parameters and inducing inputs separately, so
that we iterate the separate optimization steps until convergence.
32 CHAPTER 4. SPARSE GAUSSIAN PROCESSES
PIC sparse approximation in GPstuff
In PIC, in addition to dening the inducing inputs, we need to appoint every data point
in a block. The block structure is common to all covariance functions, similarly to
the inducing inputs, for which reason the block information is stored in the GP struc-
ture. With this data set we divide the two dimensional input space into 16 equally
sized square blocks and appoint the training data into these according to the input co-
ordinates. This and the initialization of the GP structure are done as follows:
% Initialize the inducing inputs in a regular grid over the input space
[u1,u2]=meshgrid(linspace(-1.8,1.8,6),linspace(-1.8,1.8,6));
X_u = [u1(:) u2(:)];
% Initialize test points
[xt1,xt2]=meshgrid(-1.8:0.1:1.8,-1.8:0.1:1.8);
xt=[xt1(:) xt2(:)];
% set the data points into clusters. Here we construct two cell arrays.
% trindex contains the block index vectors for training data. That is
% x(trindex{i},:) and y(trindex{i},:) belong to the ith block.
% tstindex contains the block index vectors for test data. That is test
% inputs p(tstindex{i},:) belong to the ith block.
%
b1 = [-1.7 -0.8 0.1 1 1.9];
mask = zeros(size(x,1),size(x,1));
trindex={}; tstindex={};
for i1=1:4
for i2=1:4
ind = 1:size(x,1);
ind = ind(: , b1(i1)<=x(ind,1) & x(ind,1) < b1(i1+1));
ind = ind(: , b1(i2)<=x(ind,2) & x(ind,2) < b1(i2+1));
trindex{4
*
(i1-1)+i2} = ind;
ind2 = 1:size(p,1);
ind2 = ind2(: , b1(i1)<=p(ind2,1) & p(ind2,1) < b1(i1+1));
ind2 = ind2(: , b1(i2)<=p(ind2,2) & p(ind2,2) < b1(i2+1));
tstindex{4
*
(i1-1)+i2} = ind2;
end
end
% Create the PIC GP data structure and set the inducing inputs and block indeces
gpcf = gpcf_sexp(lengthScale, [1 1], magnSigma2, 0.2^2);
lik = lik_gaussian(sigma2, 0.2^2);
gp_pic = gp_set(type, PIC, lik, lik, cf, {gpcf}, ...
X_u, X_u, tr_index, trindex, jitterSigma2, 1e-4);
Now the cell array trindex contains the block index vectors for the training data. It
means that, for example, the inputs and outputs x(trindex{i},:) and y(trindex{i},:)
belong to the ith block. The optimization of parameters and inducing inputs is done
the same way as with FIC or a full GP model. In prediction, however, we have to give
one extra input, tstindex, for gp_pred. This denes how the prediction inputs are
appointed in the blocks in a same manner as trindex appoints the training inputs.
Eft_pic = gp_pred(gp_pic, x, y, xt, tstind, tstindex);
Figure 4.2 shows the predicted surface. One should notice that the PICs prediction
is discontinuous whereas the prediction with FIC and full GP are continuous. The dis-
continuities take place in the block boundaries and are a result of discontinuous covari-
ance function that PIC resembles. This issue is discussed in more detail by Vanhatalo
et al. (2010).
4.3 Deterministic training conditional, subset of regres-
sors and variational sparse approximation
The deterministic training conditional is based on the works by Csat and Opper (2002)
and Seeger et al. (2003) and is earlier called Projected Latent Variables (see Quionero-
4.3. DTC, SOR, VAR APPROXIMATIONS 33
Candela and Rasmussen, 2005, for more details). The approximation can be con-
structed similarly as FIC and PIC by dening the inducing conditional, which in the
case of DTC is
q(f [X, X
u
, u, ) = N(f [ K
f,u
K
-1
u,u
u, 0). (4.7)
This implies that the approximate prior over latent variables is
q(f [X, X
u
, ) = N(f [0, K
f,u
K
-1
u,u
K
u,f
). (4.8)
The deterministic training conditional is not strictly speaking a proper GP since it uses
different covariance function for the latent variables appointed to the training inputs and
for the latent variables at the prediction sites,

f. The prior covariance for

f is the true
covariance K

f,

f
instead of K

f,u
K
-1
u,u
K
u,

f
. This does not affect the predictive mean
since the cross covariance Cov[f ,

f ] = K
f,u
K
-1
u,u
K
u,

f
, but it gives a larger predictive
variance. An older version of DTC is the subset of regressors (SOR) sparse approx-
imation which utilizes K

f,u
K
-1
u,u
K
u,

f
. However, this resembles a singular Gaussian
distribution and thus the predictive variance may be negative. DTCtries to x this prob-
lem by using K

f,

f
(see Quionero-Candela and Rasmussen, 2005). DTC and SOR are
identical in other respects than in the predictive variance evaluation. In spatial statis-
tics, SOR has been used by Banerjee et al. (2008) with a name Gaussian predictive
process model.
The approximate prior of the variational approximation by Titsias (2009) is ex-
actly the same as that of DTC. The difference between the two approximations is that
in variational setting the inducing inputs and covariance function parameters are op-
timized differently. The inducing inputs and parameters can be seen as variational
parameters that should be chosen to maximize the variational lower bound between the
true GP posterior and the sparse approximation. This leads to optimization of modied
log marginal likelihood
V (, X
u
) = log[N(y [0,
2
I +Q
f,f
)]
1
2
2
tr(K
f,f
K
f,u
K
-1
u,u
K
u,f
) (4.9)
with Gaussian observation model. With non-Gaussian observation model, the varia-
tional lower bound is similar but
2
I is replaced by W
1
(Laplace approximation) or

(EP).
Variational, DTC and SOR sparse approximation in GPstuff
The variational, DTC and SOR sparse approximations are constructed similarly to FIC.
Only the type of the GP changes:
gp_var = gp_set(type, VAR, lik, lik, cf, {gpcf}, X_u, X_u);
gp_dtc = gp_set(type, DTC, lik, lik, cf, {gpcf}, X_u, X_u);
gp_sor = gp_set(type, SOR, lik, lik, cf, {gpcf}, X_u, X_u);
4.3.1 Comparison of sparse GP models
Figure 4.2 shows the predictive mean of all the sparse approximations (the mean of
SOR is the same as that of DTC). It should be noticed that variational approximation
is closest to the full GP solution in Figure 3.2. The FIC and PIC approximations are
most close to the full solution. FIC works rather differently on one corner of the region
whereas the latent surface predicted by PIC contains discontinuities. DTC suffers most
on the borders of the region.
34 CHAPTER 4. SPARSE GAUSSIAN PROCESSES
The sparse GP approximations are compared also in the demo demo_regression_sparse2,
which demonstrates the differences between FIC, variational and DTC in other context.
See also the discussions on the differences between these sparse approximations given
by (Quionero-Candela and Rasmussen, 2005; Snelson, 2007; Titsias, 2009; Alvarez
et al., 2010).
4.3.2 Sparse GP models with non-Gaussian likelihoods
The extension of sparse GP models to non-Gaussian likelihoods is very straightforward
in GPstuff. User can dene the sparse GP just as described in the previous two sections
and then continue with the construction of likelihood exactly the same way as with a
full GP. The Laplace approximation, EP and integration methods can be used with the
same commands as with full GP. This is demonstrated in demo_spatial1.
Chapter 5
Model assessment and
comparison
There are various means to assess the goodness of the model and its predictive perfor-
mance and GPstuff provides built in functionalities to many common test statistics. In
this chapter, we will briey discuss the model comparison and assesment in general and
introduce a few basic methods that can be conducted routinely with GPstuffs tools.
5.1 Marginal likelihood
Marginal likelihood is often used for model selection (see, e.g. Kass and Raftery, 1995).
It corresponds to ML II or with model priors to MAP II estimate in the model space,
selecting the model with the highest marginal likelihood or highest marginal posterior
probability. The use of this in model selection is as justied as using a MAP point
estimate for the parameters. It works well if the posterior is concentrated to a single
model, that is, if single model produce similar predictions as Bayesian model avarage
model obtained by integrating over the model space.
In GPstuff, if MAP estimate and integration approximation (IA) are almost the
same, marginal likelihood can be used as a quick estimate to compare models, but we
recommend using cross-validation for more thorough model assessment and selection.
5.2 Predictive performance estimates
In prediction problems it is natural to assess the predictive performance of the model
by focusing on the models predictive distribution (Good, 1952; Bernardo and Smith,
2000). The posterior predictive distribution of an output y
n+1
given the newinput x
n+1
and the training data D = (x
i
, y
i
); i = 1, 2, . . . , n is obtained by marginalizing over
the unknown latent variable and parameters given the model M
p(y
n+1
[ x
n+1
, D, M) =
_
p(y
n+1
[ x
n+1
, , , D, M)p([ x
n+1
, D, M)dd, (5.1)
where p(y
n+1
[ x
n+1
, , , D, M) =
_
p(y
n+1
[f
n+1
, )p(f
n+1
[ x
n+1
, , D, M)df
n+1
.
In the following, we will assume that knowing x
n+1
does not give more information
about or , that is, p(, [ x
n+1
, D, M) = p(, [D, M).
35
36 CHAPTER 5. MODEL ASSESSMENT AND COMPARISON
To estimate the predictive performance of the model we would like to compare
the posterior predictive distribution to future observations from the same process that
generated the given set of training data D. Agreement or discrepancy between the pre-
dictive distribution and the observations can be measured with a utility or loss function,
u(y
n+1
, x
n+1
, D, M). Preferably, the utility u would be application-specic, measur-
ing the expected benet (or cost) of using the model. Good generic utility function is
the log-score which is called log predictive likelihood, log p(y
n+1
[ x
n+1
, D, M), when
used for predictive density. It measures how well the model estimates the whole predic-
tive distribution (Bernardo, 1979) and is thus especially useful in model comparison.
Usually since future observations are not yet available, we need to estimate the
expected utility by taking the expectation over the future data distribution
u = E
(x
n+1
,y
n+1
)
[u(y
n+1
, x
n+1
, D, M)] . (5.2)
There are several methods for estimating (5.2). GPstuff provides two commonly
used approaches: cross-validation and deviance information criterion.
5.3 Cross-validation
Cross-validation (CV) is an approach to compute predictive performance estimate by
re-using observed data. As the distribution of (x
n+1
, y
n+1
) is unknown, we assume
that it can be reasonably well approximated using the training data (x
i
, y
i
); i =
1, 2, . . . , n. To avoid the double conditioning on the training data and simulate the
fact that the future observations are not in the training data, the ith observation (x
i
, y
i
)
in the training data is left out, and then the predictive distribution for y
i
is computed
with a model that is tted to all of the observations except (x
i
, y
i
). By repeating this for
every point in the training data, we get a collection of leave-one-out cross-validation
(LOO-CV) predictive densities
p(y
i
[ x
i
, D
\i
, M); i = 1, 2, . . . , n, (5.3)
where D
\i
denotes all the elements of D except (x
i
, y
i
). To get the expected utility
estimate, these predictive densities are compared to the actual y
i
s using the utility u,
and the expectation is taken over i
u
LOO
= E
i
_
u(y
i
, x
i
, D
\i
, M)

. (5.4)
The right hand side terms are conditioned on n 1 data points, making the estimate
almost unbiased.
5.3.1 Leave-one-out cross-validation
For GP with a Gaussian noise model and given covariance parameters, the LOO-CV
predictions can be computed using an analytical solution (Sundararajan and Keerthi,
2001), which is implemented in gp_loopred.
5.3.2 k-fold cross-validation
The LOO-CV is computationally feasible only for the Gaussian likelihood and xed
parameters with gp_looe. To reduce computation time, in k-fold-CV, we use only k
5.4. DIC 37
(e.g. k = 10) k-fold-CV distributions p([D
(\s(i))
, M) and get a collection of predic-
tive densities
p(y
i
[ x
i
, D
\s(i)
, M); i = 1, 2, . . . , n, (5.5)
where s(i) is a set of data points as follows: the data are divided into k groups so that
their sizes are as nearly equal as possible and s(i) is the set of data points in the group
where the ith data point belongs. The expected utility estimated by the k-fold-CV is
then
u
CV
= E
i
_
u(y
i
, x
i
, D
\s(i)
, M)

. (5.6)
Since the k-fold-CV predictive densities are based on smaller training data sets
D
\s(i)
than the full data set D, the expected utility estimate is slightly biased. This bias
can be corrected using a rst order bias correction (Burman, 1989):
u
tr
= E
i
[u(y
i
, x
i
, D, M)] (5.7)
u
cvtr
= E
j
_
E
i
[u(y
i
, x
i
, D
\s
j
, M)]

; j = 1, . . . , k (5.8)
u
CCV
= u
CV
+ u
tr
u
cvtr
, (5.9)
where u
tr
is the expected utility evaluated with the training data given the training data
and u
cvtr
is the average of the expected utilities evaluated with the training data given
the k-fold-CV training sets.
GPstuff provides gp_kfcv, which computes k-fold-CVand bias-corrected k-fold-
CV with log-score and root mean squared error (RMSE). The function gp_kfcv pro-
vides also basic variance estimates for the predictive performance estimates. First the
mean expected utility u
j
for each k folds is computed. u
j
s tend to be closer to Gaus-
sian (due to the central limit theorem) and then the variance of the expected utility is
computed as (see, e.g., Dietterich, 1998)
Var[ u] Var
j
[ u
j
]/k. (5.10)
Although some information is lost by rst taking the sub-expectations, the estimate is
useful indicator of the related uncertainty. See (Vehtari and Lampinen, 2002) for more
details on estimating the uncertainty in performance estimates.
5.4 DIC
Deviance information criterion (DIC) is another very popular model selection criterion
(Spiegelhalter et al., 2002). DICestimates aslo the predictive performance, but replaces
the predictive distribution with plug-in predictive distribution, where plug-in estimate

is used, and uses deviance as the loss function. With parametric models without any
hierarchy it is usually written as
p
eff
= E
|D
[D(y, )] D(y, E
|D
[]) (5.11)
DIC = E
|D
[D(y, )] +p
eff
, (5.12)
where p
eff
is the effective number of parameters and D = 2 log(p(y [)) is the de-
viance. Since our models are hierarchical we need to decide the parameters on focus
(see Spiegelhalter et al., 2002, for discussion on this). The parameters on the focus
38 CHAPTER 5. MODEL ASSESSMENT AND COMPARISON
are those over which the expectations are taken when evaluating the effective number
of parameters and DIC. In the above equations, the focus is in the parameters and in
the case of a hierarchical GP model of GPstuff the latent variables would be integrated
out before evaluating DIC. If we have a MAP estimate for the parameters, we may be
interested to evaluate DIC statistics with the focus on the latent variables. In this case
the above formulation would be
p
D
() = E
f |D,
[D(y, f )] D(y, E
f |D,
[f ]) (5.13)
DIC = E
f |D,
[D(y, f )] +p
D
(). (5.14)
Here the effective number of parameters is denoted differently with p
D
() since now
we are approximating the effective number of parameters in f conditionally on , which
is different from the p
eff
. p
D
() is a function of the parameters and it measures to what
extent the prior correlations are preserved in the posterior of the latent variables given
. For non-informative data p
D
() = 0 and the posterior is the same as the prior. The
greater p
D
() is the more the model is tted to the data and large values compared
to n indicate potential overt. Also, large p
D
() indicates that we cannot assume
that the conditional posterior approaches normality by central limit theorem. Thus,
p
D
() can be used for assessing the goodness of the Laplace or EP approximation
for the conditional posterior of the latent variables as discussed by Rue et al. (2009)
and Vanhatalo et al. (2010). The third option is to evaluate DIC with focus on all the
variables, [f , ]. In this case the expectations are over p(f , [D).
5.5 Model assessment demos
The model assessment methods are demonstrated with the functions demo_modelassessmentt1
and demo_modelassessment2. The former compares the sparse GP approxima-
tions to the full GP with regression data and the latter compares the logit and probit
likelihoods in GP classication.
5.5.1 demo_modelassessment1
Assume that we have built our regression model with a Gaussian noise and used opti-
mization method to nd the MAP estimate for the parameters. We evaluate the effective
number of latent variables and DIC statistics.
p_eff_latent = gp_peff(gp, x, y);
[DIC_latent, p_eff_latent2] = gp_dic(gp, x, y, focus, latent);
where p
eff
is evaluated with two different approximations. Since we have the MAP
estimate for the parameters the focus is on the latent variables. In this case we can also
use gp_peff which returns the effective number of parameters approximated as
p
D
() n tr(K
-1
f,f
(K
-1
f,f
+
2
n
I)
1
) (5.15)
(Spiegelhalter et al., 2002). When the focus is on the latent variables, the function
gp_dic evaluates the DIC statistics and the effective number of parameters as de-
scribed by the equations (5.13) and (5.14). The k-fold-CV expected utility estimate
can be evaluated as follows.
cvres = gp_kfcv(gp, x, y);
5.5. MODEL ASSESSMENT DEMOS 39
The gp_kfcv takes the ready made model structure gp and the training data x and y.
The function divides the data into k groups, conducts inference separately for each of
the training groups and evaluates the expected utilities with the test groups. Since no
optional parameters are given the inference is conducted using MAP estimate for the
parameters. The default division of the data is into 10 groups. The expected utilities
and their variance estimates are stored in the structure cvres as follows:
cvres =
mlpd_cv: 0.0500
Var_lpd_cv: 0.0014
mrmse_cv: 0.2361
Var_rmse_cv: 1.4766e-04
mabs_cv: 0.1922
Var_abs_cv: 8.3551e-05
gp_kfcv returns also other statistics if more information is needed and the function
can be used to save the results automatically. However, these functionalities are not
considered here. Readers interested on detailed analysis should read the help text for
gp_kfcv.
Now we will turn our attention to other inference methods than MAP estimate for
the parameters. Assume we have a record structure fromgp_mc function with Markov
chain samples of the parameters stored in it. In this case, we have two options how to
evaluate the DIC statistics. We can set the focus on the parameters or all the parameters
(that is parameters and latent variables). The two versions of DIC and effective number
of parameters are evaluated as follows:
rgp = gp_mc(gp, x, y, opt);
[DIC, p_eff] = gp_dic(rgp, x, y, focus, param);
[DIC2, p_eff2] = gp_dic(rgp, x, y, focus, all);
Here the rst line performs the MCMC sampling with options opt. The last two lines
evaluate the DIC statistics. With Markov chain samples, we cannot use the gp_peff
function to evaluate p
D
() since that is a special function for models with xed param-
eters. The k-fold-CV is conducted with MCMC methods as easily as with the MAP
estimate. The only difference is that we have to dene that we want to use MCMC and
to give the sampling options for gp_kfcv. These steps are done as follows:
opt.nsamples= 100;
opt.repeat=4;
opt.hmc_opt = hmc2_opt;
opt.hmc_opt.steps=4;
opt.hmc_opt.stepadj=0.05;
opt.hmc_opt.persistence=0;
opt.hmc_opt.decay=0.6;
opt.hmc_opt.nsamples=1;
hmc2(state, sum(100
*
clock));
cvres = gp_kfcv(gp, x, y, inf_method, MCMC, opt, opt);
With integration approximation evaluating the DIC and k-fold-CV statistics are
similar to the MCMC approach. The steps required are:
opt.int_method = grid;
opt.step_size = 2;
opt.optimf=@fminscg;
gp_array = gp_ia(gp, x, y, opt);
models{3} = full_IA;
[DIC(3), p_eff(3)] = gp_dic(gp_array, x, y, focus, param);
[DIC2(3), p_eff2(3)] = gp_dic(gp_array, x, y, focus, all);
% Then the 10 fold cross-validation.
cvres = gp_kfcv(gp, x, y, inf_method, IA, opt, opt);
This far we have demonstrated how to use DIC and k-fold-CV functions with full
GP. The function can be used with sparse approximations exactly the same way as with
full GP and this is demonstrated in demo_modelassessment1 for FIC and PIC.
40 CHAPTER 5. MODEL ASSESSMENT AND COMPARISON
5.5.2 demo_modelassessment2
The functions gp_peff, gp_dic and gp_kfcv work similarly for non-Gaussian
likelihoods as for a Gaussian one. The only difference is that the integration over
the latent variables is done approximately. The way the latent variables are treated
is dened in the eld latent_method of the GP structure and this is initialized
when constructing the model as discussed in the section 3.2. If we have conducted
the analysis with Laplace approximation and MAP estimate for the parameters, we can
evaluate the DIC and k-fold-CV statistics as follows:
p_eff_latent = gp_peff(gp, x, y);
[DIC_latent, p_eff_latent2] = gp_dic(gp, x, y, focus, latent);
% Evaluate the 10-fold cross validation results.
cvres = gp_kfcv(gp, x, y);
These are exactly the same lines as presented earlier with the Gaussian likelihood. The
difference is that the GP structure gp and the data x and y are different and the inte-
grations over latent variables in gp_dic and gp_kfcv are done with respect to the
approximate conditional posterior q(f [

, D) (which was assumed to be Laplace ap-


proximation). The effective number of parameters returned by gp_peff is evaluated
as in the equation (5.15) with the modication that
2
n
I is replaced by Win the case
of Laplace approximation and

1
in the case of EP.
If expectation propagation is used for inference the model assessment is conducted
similarly as with Laplace approximation. Also the MCMC and IA solutions are evalu-
ated identically to the Gaussian case. For this reason the code is not repeated here.
Chapter 6
Covariance functions
In the previous chapters we have not paid much attention on the choice of the covari-
ance function. However, GPstuff has rather versatile collection of covariance functions,
which can be combined in numerous ways. The different functions are collected in the
Appendix B. This chapter demonstrates some of the functions and ways to combine
them.
6.1 Neural network covariance function
A good example of covariance function that has very different properties than the stan-
dard stationary covariance functions such as squared exponential or M?ern covariance
functions is the neural network covariance function. In this section we will demonstrate
its use in two simple regression problems. The squared exponential covariance function
is taken as a reference and the code is found from the demo_neuralnetCov.
6.2 Additive models
In many practical situations, a GP prior with only one covariance function may be too
restrictive since such a construction can model effectively only one phenomenon. For
example, the latent function may vary rather smoothly across the whole area of interest,
but at the same time it can have fast local variations. In this case, a more reasonable
model would be
f(x) = g(x) +h(x), (6.1)
where the latent value function is a sum of two functions, slow and fast varying. By
placing a separate GP prior for both of the functions g and h we obtain an additive prior
f(x)[ GP(0, k
g
(x, x

) +k
h
(x, x

)). (6.2)
The marginal likelihood and posterior distribution of the latent variables are as before
with K
f,f
= K
g,g
+ K
h,h
. However, if we are interested on only, say, phenomenon g,
we can consider the h part of the latent function as correlated noise and evaluate the
predictive distribution for g, which with the Gaussian observation model would be
g( x)[D, GP
_
k
g
( x, X)(K
f,f
+
2
I)
1
y, k
g
( x, x

) k
g
( x, X)(K
f,f
+
2
I)
1
k
g
(X, x

)
_
.
(6.3)
41
42 CHAPTER 6. COVARIANCE FUNCTIONS
(a) Neural network solution. (b) Squared exponential solution.
Figure 6.1: GP latent mean predictions (using a MAP estimate) with neural network
or squared exponential covariance functions. The 2D data is generated from a step
function.


GP 95% CI
GP mean
observations
true latent function
(a) Neural network solution.


GP 95% CI
GP mean
observations
true latent function
(b) Squared exponential solution.
Figure 6.2: GP solutions (a MAP estimate) with neural network or squared exponential
covariance functions.
6.2. ADDITIVE MODELS 43
With non-Gaussian likelihood, the Laplace and EP approximations for this are similar
since only
2
I and (K
f,f
+
2
I)
1
y change in the approximations.
The multiple length-scale model can be formed also using specic covariance func-
tions. For example, A rational quadratic covariance function (gpcf_rq) can be seen
as a scale mixture of squared exponential covariance functions (Rasmussen and Williams,
2006), and could be useful for data that contain both local and global phenomena.
However, using sparse approximations with the rational quadratic would prevent it
from modeling local phenomena . The additive model (6.2) suits better for sparse GP
formalism since it enables to combine FIC with CS covariance functions.
As discussed in section 4.2, FIC can be interpreted as a realization of a special kind
of covariance function. By adding FIC with CS covariance function, for example (4.1),
one can construct a sparse additive GP prior which implies the latent variable prior
f [ X, X
u
, N(0, K
f,u
K
-1
u,u
K
u,f
+

). (6.4)
This prior will be referred as CS+FIC. Here, the matrix

= +k
pp,q
(X, X) is sparse
with the same sparsity structure as in k
pp,q
(X, X) and it is fast to use in computa-
tions and cheap to store. CS+FIC can be extended to have more than one component.
However, it should be remembered that FIC works well only for long length-scale phe-
nomena and the computational benets of CS functions are lost if their length-scale
gets too large (Vanhatalo et al., 2010). For this reason the CS+FIC should be con-
structed so that possible long length-scale phenomena are handled with FIC part and
the short length-scale phenomena with CS part. The implementation of the CS+FIC
model follows closely the implementation of FIC and PIC (for details see Vanhatalo
and Vehtari, 2008, 2010).
In the following sections we will demonstrate the additive models with two prob-
lems. First we will consider full GP with covariance function that is a sum of periodic
and squared exponential covariance function. This GP prior is demonstrated for a
Gaussian and non-Gaussian likelihood. The second demo concentrates on sparse GPs
in additive models. The FIC, PIC and CS+FIC sparse models are demonstrated with
data set that contains both long and short length-scale phenomena.
6.2.1 Additive models demo: demo_periodic
In this section we will discuss the demonstration program demo_periodic. This
demonstrates the use of a periodic covariance function gpcf_periodic with two
data sets, the Mauna Loa CO2 data (see, for example, Rasmussen and Williams, 2006)
and the monthly Finnish drowning statistics 2002-2008. The rst data is a regression
problem with Gaussian noise whereas the second consist of count data that is modeled
with Poisson observation model. Here, we will describe only the regression problem
the other data can be examined by running the demo.
We will analyze the Mauna Loa CO2 data with two additive models. The rst
one utilizes covariance function that is a sum of squared exponential and piece-wise
polynomial k
se
(x, x

) + k
pp,2
(x, x

). The solution of this model that shows the long


and short length-scale phenomena separately is visualized in Figure 6.3 together with
the original data. This model interpolates the underlying function well but as will be
demonstrated later its predictive properties into the future are not so good. Better pre-
dictive performance is obtained by adding up two squared exponential and one periodic
covariance function k
se
(x, x

[
1
) + k
se
(x, x

[
2
) + k
periodic
(x, x

), which is build as
follows.
44 CHAPTER 6. COVARIANCE FUNCTIONS
1958 1981 2004
310
350
380
year
C
O
2

c
o
n
s
e
n
t
r
a
t
i
o
n

(
p
p
m
v
)
(a) The Mauna Loa CO
2
data.
1958 1981 2004
5
0
5
year
1958 1981 2004
310
350
380
year
(b) The long and short (left scale) term component
(right scale).
Figure 6.3: The Mauna Loa CO
2
data and prediction for the long term and short term
component demo_regression2.
gpcf1 = gpcf_sexp(lengthScale, 67
*
12, magnSigma2, 66
*
66);
gpcfp = gpcf_periodic(lengthScale, 1.3, magnSigma2, 2.4
*
2.4);
gpcfp = gpcf_periodic(gpcfp, period, 12,optimPeriod,1,lengthScale_exp, 90
*
12, decay, 1);
lik = lik_gaussian(sigma2, 0.3);
gpcf2 = gpcf_sexp(lengthScale, 2, magnSigma2, 2);
pl = prior_t(s2, 10, nu, 3);
pn = prior_t(s2, 10, nu, 4);
gpcf1 = gpcf_sexp(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pl);
gpcf2 = gpcf_sexp(gpcf2, lengthScale_prior, pl, magnSigma2_prior, pl);
gpcfp = gpcf_periodic(gpcfp, lengthScale_prior, pl, magnSigma2_prior, pl);
gpcfp = gpcf_periodic(gpcfp, lengthScale_sexp_prior, pl, period_prior, pn);
lik = lik_gaussian(lik, sigma2_prior, pn);
gp = gp_set(lik, lik, cf, {gpcf1, gpcfp, gpcf2})
An additive model is constructed similarly to a model with just one covariance
function. The only difference is that now we give more than one covariance func-
tion structure for the gp_init. The inference with additive model is conducted
exactly the same way as in the demo_regression1. The below lines summa-
rize the hyperparameter optimization and conduct the prediction for the whole pro-
cess f?and two components g and h whose covariance functions are k
se
(x, x

[
1
) and
k
se
(x, x

[
2
) +k
periodic
(x, x

) respectively. The prediction for latent functions that are


related to only subset of the covariance functions used for training is done by setting
an optional argument predcf. This argument tells which covariance functions are
used for the prediction. If we want to use more than one covariance function for pre-
diction, and there are more than two covariance functions in the model, the value for
predcf is a vector.
opt = optimset(TolFun, 1e-3, TolX, 1e-3, Display, iter);
gp = gp_optim(gp,x,y,opt,opt);
xt=[1:800];
[Ef_full, Varf_full, Ey_full, Vary_full] = gp_pred(gp, x, y, xt);
[Ef_full1, Varf_full1] = gp_pred(gp, x, y, x, predcf, 1);
[Ef_full2, Varf_full2] = gp_pred(gp, x, y, x, predcf, [2 3]);
The two components Ef_full1 and Ef_full2 above are basically identical to
the component shown in Figure 6.3, which shows that there is no practical difference in
the interpolation performance between the two models considered in this demo. How-
ever, the additive model with periodic component has much more predictive power
6.2. ADDITIVE MODELS 45
200 400 600 800
20
0
20
40
(a) Two additive components.
200 400 600 800
20
0
20
40
60
(b) Three additive components with periodic
covariance function.
Figure 6.4: The Mauna Loa CO
2
data. Prediction with two different models. On the
left model with covariance function k
se
(x, x

) + k
pp,2
(x, x

) and on the right a model


with covariance function k
se
(x, x

[
1
) +k
se
(x, x

[
2
) +k
periodic
(x, x

). It can be seen
that the latter has more predictive power.
into the future. This is illustrated in Figure 6.4 where one can see that the prediction
with non-periodic model starts to decrease towards prior mean very quickly, and does
not extrapolate the period, whereas the periodic model extrapolates the almost linear
increase and periodic behaviour. The MCMC or grid integration approach for the ad-
ditive model is identical to regression with only one covariance function and is not
repeated here.
6.2.2 Additive models with sparse approximations
The Mauna Loa CO2 data set is studied with sparse additive Gaussian processes in
the demo demo_regression2. There the covariance is k
se
(x, x

) + k
pp,2
(x, x

)
since periodic covariance does not work well with sparse approximations. The model
construction and inference is conducted similarly as in the previous section so we will
not repeat it here. However, it is worth mentioning few things that should be noticed
when running the demo.
PIC works rather well for this data set whereas FIC fails to recover the fast varying
phenomenon. The reason for this is that the inducing inputs are too sparsely located
so that FIC can not reveal the short length-scale phenomenon. In general, FIC is able
to model only phenomena whose length-scale is long enough compared to the average
distance between adjacent inducing inputs (see Vanhatalo et al., 2010, for details). PIC
on the other hand is able to model also fast varying phenomena inside the blocks. Its
drawback, however, is that the correlation structure is discontinuous which may result
in discontinuous predictions. The CS+FIC model corrects these deciencies.
In FIC and PIC the inducing inputs are parameters of every covariance function,
which means that all the correlations are circulated through the inducing inputs and
the shortest length-scale the GP is able to model is dened by the locations of the
inducing inputs. The CS+FIC sparse GP is build differently. There the CS covariance
function does not utilize inducing inputs but evaluates the correlations exactly. This
enables the GP model to capture both the long and short length-scale phenomena. The
GPstuff package is coded so that if the GP structure is dened to be CS+FIC all the
CS functions are treated outside FIC approximation. Thus, the CS+FIC model requires
that there is at least one CS covariance function and one globally supported function
(such as squared exponential). If there are more than two covariance functions in the
46 CHAPTER 6. COVARIANCE FUNCTIONS
GP structure all the globally supported functions utilize inducing inputs and all the CS
functions are added to

6.3 Additive covariance functions with selected variables


In the demo (demo_regression_additive), we demonstrate how covariance
functions can be modied so that they are functions of only a subset of inputs. We
will consider modelling an articial 2D regression data with additive covariance func-
tions where the individual covariance functions use only either the rst or second input
variable. That is the covariance is k
1
(x
1
, x

1
[
1
) +k
2
(x
2
, x

2
[
2
), where the covariance
functions are of type k
1
(x
1
, x

1
[
1
) : 1 1 instead of k(x, x

[) : 1
2
1, which
has been the usual case in previous demos. Remember the notation x = [x
1
, ..., x
d
]
T
and x

= [x

1
, ..., x

d
]
T
. Also solutions from the covariance function that uses both in-
put variables are shown for comparison. In the regression we assume a Gaussian noise.
The hyperparameter values are set to their MAP estimate. The models considered in
this demo utilize the following six covariance functions:
constant and linear
constant and squared exponential for the rst input and linear for the second
input
squared exponential for the rst input and squared exponential for the second
input
squared exponential
neural network for the rst input and neural network for the second input
neural network.
We will demonstrate how to construct the rst, second and fth model.
A linear covariance function with constant term can be constructed in GPstuff as
% constant covariance function
gpcf_c = gpcf_constant(constSigma2, 1);
gpcf_c = gpcf_constant(gpcf_c, constSigma2_prior, pt);
% linear covariance function
gpcf_l = gpcf_linear(coeffSigma2_prior, pt);
Gaussian process using this linear covariance function is constructed as previously
gp = gp_set(lik, lik, cf, {gpcf_c gpcf_l}, jitterSigma2, jitter);
In this model, the covariance function is c +k
linear
(x, x

[), x 1
2
, which means that
the components of x are coupled in the in the covariance function k
linear
.. The constant
term (gpcf_const) is denoted by c.
The second model is more exible. It contains a squared exponential, which is
a function of the rst input dimension x
1
, and a linear covariance function, which
is a function of the second input dimension x
2
. The additive covariance function is
k
se
(x
1
, x

1
[
1
) +k
linear
(x
2
, x

2
[
2
) which is a mapping from 1
2
to 1. With the squared
exponential covariance function, the inputs to be used can be selected using a metric
structure as follows:
% Covariance function for the first input variable
gpcf_s1 = gpcf_sexp(magnSigma2, 0.15, magnSigma2_prior, pt);
% create metric structure:
metric1 = metric_euclidean({[1]},lengthScales,[0.5], lengthScales_prior, pt);
% set the metric to the covariance function structure:
gpcf_s1 = gpcf_sexp(gpcf_s1, metric, metric1);
6.4. PRODUCT OF COVARIANCE FUNCTIONS 47
Here we construct the covariance function structure just as before. We also set a prior
structure for the magnitude pt. To modify squared exponential covariance function
so that it depends only on a subset of inputs is done using a metric structure. In this
example, we use metric_euclidean which allows user to group the inputs so that
all the inputs in one group are appointed to the same length-scale. The metric structure
has function handles which then evaluate, for example, the distance with this modied
euclidean metric. For example, for inputs x, x

1
4
the modied distance could be
r =
_
(x
1
x

1
)
2
/l
2
1
+ ((x
2
x

2
)
2
+ (x
3
x

3
)
2
) /l
2
2
+ (x
4
x

4
)
2
/l
2
3
(6.5)
where the second and third input dimension are given the same length-scale. This is dif-
ferent fromthe previously used, r =
_

d
i=1
(x
i
x

i
)
2
/l
2
i
and r =
_

d
i=1
(x
i
x

i
)
2
/l
2
,
that can be dened by the covariance function itself. The metric structures can be used
with any stationary covariance function, that is with functions of type k(x, x

) = k(r).
The reason why this property is implemented by using one extra structure is that this
way user does not need to modify the covariance function when redening the distance.
Only a new metric le needs to be created. It should be noticed, though, that not all
metrics lead to positive denite covariances with all covariance functions. For exam-
ple, the squared exponential
2
exp(r
2
) is not positive denite with metric induced
by L
1
norm r =

d
i=1
[ x
i
x

i
[/l
i
whereas the exponential
2
exp(r) is.
A metric structure cannot be used with the linear or neural network covariance
functions, since they are not stationary but a smaller set of inputs can be chosen by
using the eld selectedVariables. selectedVariables can be used also for sta-
tionary covariance functions as a shorthand to set a metric structure to use the selected
variables. With metric is also possible to do more elaborate input models as discussed
in next section.
In this demo, we select only the second input variable as
gpcf_l2 = gpcf_linear(selectedVariables, [2]);
gpcf_l2 = gpcf_linear(gpcf_l2, coeffSigma2_prior, pt);
The result with this model is shown in Figure 6.5(b).
The neural network covariance function is another non-stationary covariance func-
tion with which the metric structure can not be used. However, a smaller set of input
variables can be chosen similarly as with the linear covariance function using the eld
selectedVariables. In this demo, we consider additive neural network covariance func-
tions, each having one input variable k
nn
(x
1
, x

1
[
1
) + k
nn
(x
2
, x

2
[
2
). In GPstuff this
can be done as
gpcf_nn1 = gpcf_neuralnetwork(weightSigma2, 1, biasSigma2, 1, selectedVariables, [1]);
gpcf_nn1 = gpcf_neuralnetwork(gpcf_nn1, weightSigma2_prior, pt, biasSigma2_prior, pt);
gpcf_nn2 = gpcf_neuralnetwork(weightSigma2, 1, biasSigma2, 1, selectedVariables, [2]);
gpcf_nn2 = gpcf_neuralnetwork(gpcf_nn2, weightSigma2_prior, pt, biasSigma2_prior, pt);
gp = gp_init(lik, lik, cf, {gpcf_nn1,gpcf_nn2}, jitterSigma2, jitter);
The result from this and other six models are shown in Figure 6.5.
6.4 Product of covariance functions
A product of two or more covariance functions is a valid covariance function as well.
Such constructions may be useful in situations where the phenomenon is known to be
separable. Combining covariance functions into product form k
1
(x, x

) k
2
(x, x

)...
is straightforward in GPstuff. There is a special covariance function gpcf_prod for
this purpose. For example, multiplying exponential and M?ern covariance functions is
done as follows:
48 CHAPTER 6. COVARIANCE FUNCTIONS
2
1
0
1
2
2
1
0
1
2
10
5
0
5
10
x
1
The predicted underlying function (constant + linear)
x
2
(a) c + k
linear
(x, x

|)
2
1
0
1
2
2
1
0
1
2
5
0
5
x
1
The predicted underlying function (sexp for 1. input + linear for 2. input )
x
2
(b) k
se
(x
1
, x

1
|
1
) + k
linear
(x
2
, x

2
|
2
)
2
1
0
1
2
2
1
0
1
2
3
2
1
0
1
2
3
x
1
The predicted underlying function (additive sexp)
x
2
(c) k
se
(x
1
, x

1
|
1
) + k
se
(x
2
, x

2
|
2
)
2
1
0
1
2
2
1
0
1
2
3
2
1
0
1
2
3
x
1
The predicted underlying function (sexp)
x
2
(d) k
se
(x, x

|
1
)
2
1
0
1
2
2
1
0
1
2
3
2
1
0
1
2
3
x
1
The predicted underlying function (additive neural network)
x
2
(e) k
nn
(x
1
, x

1
|
1
) + k
nn
(x
2
, x

2
|
2
)
2
1
0
1
2
2
1
0
1
2
4
2
0
2
4
x
1
The predicted underlying function (neural network)
x
2
(f) k
nn
(x, x

|
1
)
Figure 6.5: GP latent mean predictions (using a MAP estimate) with different additive
and non-additive covariance functions. The 2D toy data is generated from an additive
process.
6.4. PRODUCT OF COVARIANCE FUNCTIONS 49
gpcf1 = gpcf_exp();
gpcf2 = gpcf_matern32();
gpcf = gpcf_prod(cf, {gpcf1, gpcf2});
Above we rst initialized the two functions to be multiplied and in the third line we
constructed a covariance function structure which handles the actual multiplication.
The product covariance can be combined with the metric structures also. For ex-
ample, if we want to model a temporal component with one covariance function and
the spatial components with other we can construct the covariance function k(x, x

) =
k
1
(x
1
, x

1
) k
2
([x
2
, x
3
]
T
, [x

2
, x

3
]
T
) as follows:
metric1 = metric_euclidean({[1]});
metric2 = metric_euclidean({[2 3]});
gpcf1 = gpcf_exp(metric, metric1);
gpcf2 = gpcf_matern32(metric, metric2);
gpcf = gpcf_prod(cf, {gpcf1, gpcf2});
The product covariance gpcf_prod can be used to combine categorical covariance
gpcf_cat with other covariance functions to build hierarchical linear and non-linear
models, as illustrated in demo_regression_hier. The above construction repre-
sents the covariance function
k(x, x

) = k
exp
(x
1
, x

1
) k
matern32
([x
2
, x
3
]
T
, [x

2
, x

3
]
T
) (6.6)
50 CHAPTER 6. COVARIANCE FUNCTIONS
Chapter 7
Special observation models
In this chapter we will introduce few more models that we are able to infer with GP-
stuff. These models utilize different observation models than what has been considered
this far.
7.1 Robust regression with Student-t likelihood
A commonly used observation model in the GP regression is the Gaussian distribution.
This is convenient since the inference is analytically tractable up to the covariance func-
tion parameters. However, a known limitation with the Gaussian observation model is
its non-robustness, due which outlying observations may signicantly reduce the accu-
racy of the inference. A formal denition of robustness is given, for example, in terms
of an outlier-prone observation model. The observation model is outlier-prone of an
order n, if p(f[y
1
, ..., y
n+1
) p(f[y
1
, ..., y
n
) as y
n+1
(OHagan, 1979; West,
1984). That is, the effect of a single conicting observation on the posterior becomes
asymptotically negligible as the observation approaches innity. This contrasts heavily
with the Gaussian observation model where each observation inuences the posterior
no matter how far it is from the others. A well-known robust observation model is the
Student-t distribution
y [ f , ,
t

n

i=1
(( + 1)/2)
(/2)

t
_
1 +
(y
i
f
i
)
2

2
t
_
(+1)/2
, (7.1)
where is the degrees of freedom and the scale parameter (Gelman et al., 2004).
Student-t distribution is outlier prone of order 1, and it can reject up to m outliers if
there are at least 2m observations in all (OHagan, 1979).
The Student-t distribution can be utilized as such or it can be written via the scale
mixture representation
y
i
[f
i
, , U
i
N(f
i
, U
i
) (7.2)
U
i
Inv-
2
(,
2
), (7.3)
where each observation has its own noise variance U
i
that is Inv-
2
distributed (Neal,
1997; Gelman et al., 2004). The degrees of freedom corresponds to the degrees of
freedom in the Student-t distribution and corresponds to .
51
52 CHAPTER 7. SPECIAL OBSERVATION MODELS
(a) Gaussian observation model. (b) Student-t observation model.
Figure 7.1: An example of regression with outliers. On the left Gaussian and on the
right the Student-t observation model. The real function is plotted with black line.
In GPstuff both of the representations are implemented. The scale mixture repre-
sentation can be inferred only with MCMC and the Student-t observation model with
Laplace approximation and MCMC.
7.1.1 Regression with Student-t distribution
Here we will discuss demo_regression_robust. The demo contains ve parts:
1) Optimization approach with Gaussian noise, 2) MCMC approach with scale mixture
noise model and all parameters sampled 3) Laplace approximation for Student-t likeli-
hood optimizing all parameters, 4) MCMC approach with Student-t likelihood so that
= 4, and 5) Laplace approximation for Student-t observation model so that = 4.
We will demonstrate the steps for parts 2, 3 and 5.
The scale mixture model
The scale mixture representation of Student-t observation model is implemented as a
special kind of covariance function gpcf_noiset. It is very similar to the gpcf_noise
covariance function in that it returns diagonal covariance matrix diag(U). The scale
mixture model is efcient to handle with Gibbs sampling since we are able to sample
all the parameters (U
i
, , ) efciently from their full conditionals with regular built in
samplers. For the degrees of freedom we use slice sampling. All the sampling steps
are stored in the gpcf_noiset structure and gp_mc sampling function knows to use
them if we add gibbs_opt eld in its options structure.
Below we show the lines needed to perform the MCMC for the scale mixture
model. The third line constructs the noise covariance function and the fourth line sets
the option fix_nu to zero, which means that we are going to sample also the degrees
of freedom. The degrees of freedom is many times poorly identiable for which rea-
son xing its value to, for example, four is reasonable. This is the reason why it is
xed by default. With this data set its sampling is safe though. The next line initial-
izes the GP structure and the lines after that set the sampling options. The structure
gibbs_opt contains the options for the slice sampling used for . The parameters
of the squared exponential covariance function are sampled with HMC and the options
for this sampler are set into the structure hmc_opt.
gpcf1 = gpcf_sexp(lengthScale, 1, magnSigma2, 0.2^2);
7.1. ROBUST REGRESSION WITH STUDENT-T LIKELIHOOD 53
gpcf1 = gpcf_sexp(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
lik = lik_gaussiansmt(ndata, n, sigma2, repmat(1,n,1), ...
nu_prior, prior_logunif());
gp = gp_init(lik, lik, cf, {gpcf1}, jitterSigma2, 1e-9) %
hmc_opt.steps=10;
hmc_opt.stepadj=0.06;
hmc_opt.nsamples=1;
hmc2(state, sum(100
*
clock));
hmc_opt.persistence=1;
hmc_opt.decay=0.6;
% Sample
[r,g,opt]=gp_mc(gp, x, y, nsamples, 300, hmc_opt, hmc_opt);
The Student-t observation model with Laplace approximation
The Student-t observation model is implemented in likelih_t. This is used simi-
larly to the observation models in the classication setting. The difference is that now
the likelihood has also hyperparameters. These parameters can be optimized alongside
the covariance function parameters with Laplace approximation. We just need to give
them prior (which is by default log-uniform) and write in the parameter string, which
denes the optimized parameters, likelihood. All this is done with the following lines.
pl = prior_t();
pm = prior_t();
gpcf1 = gpcf_sexp(lengthScale, 1, magnSigma2, 0.2^2);
gpcf1 = gpcf_sexp(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
% Create the likelihood structure
pn = prior_logunif();
lik = lik_t(nu, 4, nu_prior, prior_logunif(), ...
sigma2, 10, sigma2_prior, pn);
% Finally create the GP data structure
gp = gp_set(lik, lik, cf, {gpcf1}, jitterSigma2, 1e-6, ...
latent_method, Laplace);
% Set the options for the scaled conjugate optimization
opt=optimset(TolFun,1e-4,TolX,1e-4,Display,iter,Maxiter,20);
% Optimize with the scaled conjugate gradient method
gp=gp_optim(gp,x,y,opt,opt);
% Predictions to test points
[Eft, Varft] = gp_pred(gp, x, y, xt);
The Student-t observation model with MCMC
When using MCMC for the Student-t observation model we need to dene sampling
options for covariance function parameters, latent variables and likelihood parameters.
After this we can run gp_mc and predict as before. All theses steps are shown below.
pl = prior_t();
pm = prior_sqrtunif();
gpcf1 = gpcf_sexp(lengthScale, 1, magnSigma2, 0.2^2);
gpcf1 = gpcf_sexp(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
% Create the likelihood structure
pn = prior_logunif();
lik = lik_t(nu, 4, nu_prior, [], sigma2, 10, sigma2_prior, pn);
% ... Finally create the GP data structure
gp = gp_set(lik, lik, cf, {gpcf}, jitterSigma2, 1e-4, ...
latent_method, MCMC);
54 CHAPTER 7. SPECIAL OBSERVATION MODELS
f=gp_pred(gp,x,y,x);
gp=gp_set(gp,latent_opt, struct(f,f));
% Set the parameters for MCMC...
% Covariance parameter-options
opt.hmc_opt.steps=5;
opt.hmc_opt.stepadj=0.05;
opt.hmc_opt.nsamples=1;
% Latent-options
opt.latent_opt.display=0;
opt.latent_opt.repeat = 10
opt.latent_opt.sample_latent_scale = 0.05
% Sample
[rgp,g,opt]=gp_mc(gp, x, y, nsamples, 400, opt);
7.2 Models for spatial epidemiology
Spatial epidemiology concerns both describing and understanding the spatial variation
in the disease risk in geographically referenced health data. One of the main classes
of spatial epidemiological studies is disease mapping, where the aim is to describe the
overall disease distribution on a map and, for example, highlight areas of elevated or
lowered mortality or morbidity risk (e.g. Lawson, 2001; Richardson, 2003; Elliot et al.,
2001). The spatially referenced health data may be point level, appointing to con-
tinuously varying co-ordinates and showing for example home residence of diseased
people. More commonly, however, the data are an areal level, referring to a nite sub-
region of a space, as for example, county or country and telling the counts of diseased
people in the area (e.g. Banerjee et al., 2004).
In this section we will consider two disease mapping models. One that utilizes
a Poisson observation model and other with a negative binomial observation model.
The models follow the general approach discussed, for example, by Best et al. (2005).
The data are aggregated into areas A
i
with co-ordinates x = [x
1
, x
2
]
T
. The mortal-
ity/morbidity in an area A
i
is modeled as a Poisson or negative Binomial with mean
e
i

i
, where e
i
is the standardized expected number of cases in the area A
i
, and the
i
is the relative risk, which is given a Gaussian process prior.
The standardized expected number of cases e
i
can be any positive real number that
denes the expected mortality/morbidity count for the ith area. Common practice is
to evaluate it following the idea of the directly standardized rate (e.g. Ahmad et al.,
2000), where the rate in an area is standardized according to the age distribution of the
population in that area. The expected value in the area A
i
is obtained by summing the
products of the rate and population over the age-groups in the area
e
i
=
R

r=1
Y
r
N
r
n
ir
,
where Y
r
and N
r
are the total number of deaths and people in the whole area of study
in the age-group r, and n
ir
is the number of people in the age-group r and in the
area A
i
. In the following demos e
i
and y are calculated from real data that contains
deaths to either alcohol related diseases or cerebral vascular diseases in Finland. The
examples here are based on the works by Vanhatalo and Vehtari (2007) and Vanhatalo
et al. (2010).
7.2. MODELS FOR SPATIAL EPIDEMIOLOGY 55
7.2.1 Disease mapping with Poisson likelihood: demo_spatial1
The model constructed in this section is the following:
y
n

i=1
Poisson(y
i
[ exp(f
i
)e
i
) (7.4)
f(x)[ GP(0, k(x, x

[))? (7.5)
half-Student-t(,
2
t
) (7.6)
The vector y?collects the numbers of deaths for each area. The co-ordinates of the
areas are in the input vectors x?and contains the covariance function parameters.
The co-ordinates are dened from lower left corner of the area in 20km steps. The
model is constructed with the following lines.
gpcf1 = gpcf_matern32(lengthScale, 5, magnSigma2, 0.05);
pl = prior_t(s2,10);
pm = prior_t();
gpcf1 = gpcf_matern32(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
lik = lik_poisson();
gp = gp_set(type, FIC, lik, lik, cf, {gpcf1}, [], X_u, Xu, ...
jitterSigma2, 1e-4, infer_params, covariance);
gp = gp_set(gp, latent_method, Laplace);
FIC sparse approximation is used since the data set is rather large and inferring full
GP would be too slow. The inducing inputs Xu are set to a regular grid (not shown here)
in the two dimensional lattice and they will be considered xed. The extra parameter
z in the last line tells the Laplace algorithm that now there is an input which affects
only the likelihood. This input is stored in ye and it is a vector of expected number
of deaths e = [e
1
, ..., e
n
]
T
. In all of the previous examples we have had only inputs
for the covariance function. However, if there are inputs for likelihood they should be
given with optional parameter-value pair whose indicator is z. The model is now
constructed and we can optimize the parameters and evaluate the posterior predictive
distribution of the latent variables.
opt=optimset(TolFun,1e-3,TolX,1e-3,Display,iter);
gp=gp_optim(gp,x,y,z,ye,opt,opt);
[Ef, Varf] = gp_pred(gp, x, y, x, z, ye, tstind, [1:n]);
Here we predicted to the same locations which were used for training. Thus Ef
and Varf contain the posterior mean and variance (E[f [

], Var[f [

]). In this case,


the prediction functions (such as la_pred for example) require the test index set for
FIC also. This is given with parameter-value pair tstind, [1:n]. These have
previously been used with PIC (see section 4.2). FIC is a limiting case of PIC where
each data point forms one block. Whenever we predict to new locations that have not
been in the training set we do not have to worry about the test index set since all the
test inputs dene their own block. However, whenever we predict for exactly the same
locations that are in the training set we should appoint the test inputs into the same
block with the respective training input. This is done with FIC by giving gp_pred a
vector with indices telling which of the test inputs are in the training set ([1:n] here).
The posterior mean and variance of the latent variables are shown in the gure 7.2.
56 CHAPTER 7. SPECIAL OBSERVATION MODELS


0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
(a) The posterior mean.


0.005
0.01
0.015
0.02
0.025
0.03
(b) The posterior variance.
Figure 7.2: The posterior predictive mean and variance of the latent function in the
demo_spatial1 data set obtained with FIC.
The demo contains also MCMC implementation for the model but it is not dis-
cussed here. Using Markov chain sampler for Poisson likelihood is very straightfor-
ward extension of its usage in classication model. The only difference is that we have
to carry along the extra input e.
7.2.2 Disease mapping with negative Binomial likelihood
The Negative binomial distribution is a robust version of the Poisson distribution sim-
ilarly as Student-t distribution can be considered as a robustied Gaussian distribution
(Gelman et al., 2004). In GPstuff it is parametrized as
y [ f , e, r
n

i=1
(r +y
i
)
y
i
!(r)
_
r
r +
i
_
r
_

i
r +
i
_
y
i
(7.7)
f(x)[ GP(0, k(x, x

[
g
)) , (7.8)
half-Student-t(,
2
t
), (7.9)
where
i
= e exp(f(x
i
)) and r is the dispersion parameter governing the variance.
The model is demonstrated in demo_spatial2 where the data are simulated so that
the latent function is drawn randomly from a GP with piecewise polynomial covariance
function and the observed death cases are sampled from a Negative binomial distribu-
tion. This is done in order to demonstrate the use of CS covariance functions with
non-Gaussian observation model. The CS covariance functions are used just as glob-
ally supported covariance functions but are much faster. The inference in the demo is
conducted with Laplace approximation and EP. The code for Laplace approximation
looks the following
gpcf1 = gpcf_ppcs2(nin, 2, lengthScale, 5, magnSigma2, 0.05);
pl = prior_t();
pm = prior_sqrtt(s2, 0.3);
gpcf1 = gpcf_ppcs2(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
7.3. LOG-GAUSSIAN COX PROCESS 57
% Create the likelihood structure
lik = lik_negbin();
% Create the GP data structure
gp = gp_set(lik, lik, cf, {gpcf1}, jitterSigma2, 1e-4);
% Set the approximate inference method to Laplace
gp = gp_set(gp, latent_method, Laplace);
% Set the options for the scaled conjugate optimization
opt=optimset(TolFun,1e-2,TolX,1e-2,Display,iter);
% Optimize with the scaled conjugate gradient method
gp=gp_optim(gp,x,y,z,ye,opt,opt);
C = gp_trcov(gp,xx);
nnz(C) / prod(size(C))
p = amd(C);
figure
spy(C(p,p))
% make prediction to the data points
[Ef, Varf] = gp_pred(gp, x, y, x, z, ye);
7.3 Log-Gaussian Cox process
Log-Gaussian Cox-process is an inhomogeneous Poisson process model used for point
data, with unknown intensity function (x), modeled with log-Gaussian process so
that f(x) = log (x) (see Rathbun and Cressie, 1994; Mller et al., 1998). If the data
are points X = x
i
; i = 1, 2, . . . , n on a nite region V in 1
d
, then the likelihood of the
unknown function f is
L(X[f) = exp
_

__
V
exp(f(x))d x
_
+
n

i=1
f(x
i
)
_
. (7.10)
Evaluation of the likelihood would require nontrivial integration over the exponential
of the Gaussian process. Mller et al. (1998) propose to discretise the region V and
assume locally constant intensity in subregions. This transforms the problem to a form
equivalent to having Poisson model for each subregion. Likelihood after the discreti-
sation is
L(X[f)
K

k=1
Poisson(y
k
[ exp(f( x
k
))), (7.11)
where x is the coordinate of the kth sub-region and y
k
is the number of data points
in it Tokdar and Ghosh (2007) proved the posterior consistency in limit when sizes of
subregions go to zero.
1D case data are the coal mine disaster data from R distribution (coal.rda) contains
the dates of 191 coal mine explosions that killed ten or more men in Britain between
15 March 1851 and 22 March 1962. Computation time with expectation propagation
and CCD integration over the parameters took 20s.
2D case data are the redwood data from R distribution (redwoodfull.rda) contains
195 locations of redwood trees. Computation time with Laplace approximation and
MAP II for parameters took 3s.
In section 7.2.1, we demonstrated how fast no-MCMC inference for this model
can be made using Laplace method or expectation propagation for integrating over
the latent variables in application from spatial epidemiology. The Log-Gaussian Cox
58 CHAPTER 7. SPECIAL OBSERVATION MODELS
1860 1880 1900 1920 1940 1960
0
1
2
3
4
5
Year
I
n
t
e
n
s
i
t
y
(a) Coal mine disasters.
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1


0
500
1000
(b) Red wood data.
Figure 7.3: Two intensity surfaces estimated with Log-Gaussian Cox process. The
gures are from the demo_lgcp, where the aim is to study an underlying intensity
surface of a point process. On the left a temporal and on the right a spatial point
process.
process with the same techniques is implemented in the function lgcp for one or two
dimensional input data. The usage of the function is demonstrated in demo_lgcp.
This demo analyzes two data sets. The rst one is one dimensional case data with
coal mine disasters (from R distribution). The data contain the dates of 191 coal mine
explosions that killed ten or more men in Britain between 15 March 1851 and 22 March
1962. The analysis is conducted using expectation propagation and CCD integration
over the parameters and the results are shown in Figure 7.3. The second data are the
redwood data (from R distribution). This data contain 195 locations of redwood trees
in two dimensional lattice. The smoothed intensity surface is shown in Figure 7.3.
7.4 Binomial observation model
In this demo (demo_binomial1) we show how binomial likelihood is used in the
GPstuff toolbox. The inference is done in this example with Laplace approximation
and squared exponential covariance function.
The binomial likelihood is dened as follows:
p(y [ f , z) =
N

i=1
z
i
!
y
i
!(z
i
y
i
)!
p
y
i
i
(1 p
i
)
(z
i
y
i
)
(7.12)
where p
i
= exp(f
i
)/(1 + exp(f
i
)) is the success of probability, and the vector z
denotes the number of trials. In this demo, a Gaussian process prior is assumed for the
latent variables f .
The binomial likelihood is initialised in GPstuff as
% Create the likelihood structure
lik = lik_binomial();
% Create the GP data structure
gp = gp_set(lik, lik, cf, {gpcf1}, jitterSigma2, 1e-8);
To use binomial model, an extra parameter (the number of trials) is needed to be set
as a parameter for each function that requires the data y. For example, the model is
initialized and optimized as
7.5. DERIVATIVE OBSERVATIONS IN GP REGRESSION 59
1.5 1 0.5 0 0.5 1 1.5
20
30
40
50
60
70
80
90
100


GP 95% CI
GP mean
observations
true latent function
Figure 7.4: GP solutions (a MAP estimate) with squared exponential covariance func-
tion and binomial likelihood in a toy example.
% Set the approximate inference method
gp = gp_set(gp, latent_method, Laplace);
gp=gp_optim(gp,x,y,z,N,opt,opt);
To make predictions with binomial likelihood model without computing the pre-
dictive density, the total number of trials N
t
in test points needs to be provided (in
addition to N that is total number of trials in training points). In GPstuff, this is done
as following:
% Set the total number of trials Nt at the grid points xgrid
[Eft_la, Varft_la, Eyt_la, Varyt_la] = ...
gp_pred(gp, x, y, xgrid, z, N, zt, Ntgrid);
To compute the predictive densities with binomial likelihood...
% To compute predictive densities at the test points xt, the total number
% of trials Nt must be set additionally:
[Eft_la, Varft_la, Eyt_la, Varyt_la, pyt_la] = gp_pred(gp, x, y, xt, z,
N, yt, yt, zt, Nt);
7.5 Derivative observations in GP regression
Incorporating derivative observations in GP regression is fairly straightforward, be-
cause a derivative of Gaussian process is a Gaussian process. In short, derivative ob-
servation are taken into account by extending covariance matrices to include derivative
observations. This is done by forming joint covariance matrices of function values
and derivatives. Following equations (Rasmussen and Williams, 2006) state how the
covariances between function values and derivatives, and between derivatives are cal-
culated
Cov(f
i
,
f
j
x
dj
) =
k(x
i
, x
j
x
dj
), Cov(
f
i
x
di
,
f
j
x
ej
) =

2
k(x
i
, x
j
x
di
x
ej
.
60 CHAPTER 7. SPECIAL OBSERVATION MODELS
The joint covariance matrix for function values and derivatives is of the following form
K =
_
K
ff
K
fD
K
Df
K
DD
_
K
ij
ff
= k(x
i
, x
j
),
K
ij
Df
=
k(x
i
, x
j
)
x
di
,
K
fD
= (K
Df
)

,
K
ij
DD
=

2
k(x
i
, x
j
)
x
di
x
ej
,
Prediction is done as usual but with derivative observations joint covariance matrices
are to be used instead of the normal ones.
Using derivative observations in GPstuff requires two steps: when initializing the
GP structure one must set option derivobs to on. The second step is to form
right sized observation vector. With input size n m the observation vector with
derivatives should be of size n + m n. The observation vector is constructed by
adding partial derivative observations after function value observations
y
obs
=
_

_
y(x)
y(x)
x
1
.
.
.
y(x)
x
m
_

_
. (7.13)
Different noise level could be assumed for function values and derivative observations
but at the moment the implementation allows only same noise for all the observations.
7.5.1 GP regression with derivatives: demo_derivativeobs
In this section we will go through the demonstration demo_derivativeobs. This
demo will present the main differences between GP regression with and without deriva-
tive observations and how to use them with GPstuff. The demo is divided in two parts:
in the rst part the GP regression is done without derivatives observations and in the
second with them. Here we will present the lines of the second part because the rst
part is almost identical with just a few differences.
First we create the articial data. Notice how observation vector is dened differ-
ently for GP models with and without derivatives observations. With derivative obser-
vations the observation vector includes the partial derivative observations which are set
as a column vector after function value observations
% Create the data
tp=9; %number of training points -1
x=-2:5/tp:2;
y=sin(x).
*
cos(x).^2; % The underlying process f
dy=cos(x).^3 - 2
*
sin(x).^2.
*
cos(x); % Derivative of the process
koh=0.06; % noise standard deviation
% Add noise
y=y + koh
*
randn(size(y));
dy=dy + koh
*
randn(size(dy)); % derivative obs are also noisy
x=x;
dy=dy;
7.5. DERIVATIVE OBSERVATIONS IN GP REGRESSION 61
y=y; % observation vector without derivative observations
y2=[y;dy]; % observation vector with derivative observations
The model constructed for regression is a full GP with a Gaussian likelihood. The
covariance function is squared exponential, which is the only covariance function that
is compatible with derivative observations at the moment. The eld derivobs is
added into gp_set(...) so that the inference is done with derivative observations.
Field DerivativeCheck should also be added to optimset(...) when taking
the predictions so that derivative observations can be used.
gpcf1 = gpcf_sexp(lengthScale, 0.5, magnSigma2, .5);
pl = prior_t(); % a prior structure
pm = prior_sqrtt(); % a prior structure
gpcf1 = gpcf_sexp(gpcf1, lengthScale_prior, pl, magnSigma2_prior, pm);
gp = gp_set(cf, gpcf1, derivobs, on);
3 2 1 0 1 2 3
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
input x
o
u
tp
u
t y
GP without derivative observations


prediction
95%
f(x)
observations
(a) Standard GP
3 2 1 0 1 2 3
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
input x
o
u
tp
u
t y
GP with derivative observations


prediction
95%
f(x)
observations
der. obs.
(b) GP with der. observations
Figure 7.5: GP predictions of the process f with (g b) and without (g a) derivative
observations.
In the gure 7.5 are the predictions, observations, the underlying process and the 95 %
condence intervals for both GP model with and GP model without derivative obser-
vations.
62 CHAPTER 7. SPECIAL OBSERVATION MODELS
Chapter 8
Mean functions
In the standard GP regression a zero mean function is assumed for the prior process.
This is convenient but there are nonetheless some advantages in using a specied mean
function. The basic principle in doing GP regression with a mean function is to apply
a zero mean GP for the difference between observations and the mean function.
8.1 Explicit basis functions
Here we follow closely the presentation of (Rasmussen and Williams, 2006) about the
subject and briey present the main results. A mean function can be specied as a
weighted sum of some basis functions h
m = h(x)

,
with weights . The target of modeling, the underlying process g, is assumed to be a
sum of mean function and a zero mean GP.
g = h(x)

+GP(0, K).
By assuming Gaussian prior for the weights N(b, B), the weight parameters can
be integrated out and the prior for g is another GP
prior g GP
_
h(x)

b, K + h(x)

Bh(x)
_
.
The predictive equations are obtained by using the mean and covariance of the prior
(8.1) in the zero mean GP predictive equations (2.9)
E(g

) = E(f

) + R

, (8.1)
Cov(g

) = Cov(f

) + R

_
B
1
+ HK
1
y
H

_
R, (8.2)
=
_
B
1
+ HK
1
y
H

_
1 _
B
1
b + HK
1
y
y
_
,
R = H

HK
1
y
K

,
H =
_

_
h
1
(x)
h
2
(x)
.
.
.
h
k
(x)
_

_
, x is row vector.
63
64 CHAPTER 8. MEAN FUNCTIONS
If the prior assumptions about the weights are vague then B
1
O, (O is a zero
matrix) and the predictive equations (8.1) and (8.2) dont depend on b or B
E(g

) = E(f

) + R

v
, (8.3)
Cov(g

) = Cov(f

) + R

_
HK
1
y
H

_
R, (8.4)

v
=
_
HK
1
y
H

_
1
HK
1
y
y,
Corresponding the exact and vague prior for the basis functions weights there are two
versions of the marginal likelihood. With exact prior the marginal likelihood is
log p(y [ X, b, B) =
1
2
M

N
1
M
1
2
log [K
y
[
1
2
log [B[
1
2
log [A[
n
2
log 2,
M = H

b y,
N = K
y
+ H

BH,
A = B
1
+ HK
1
y
H

,
where n is the amount of observations. Its derivative with respect to hyperparameters
are

i
log p(y [ X, b, B) = +
1
2
M

N
1
K
y

i
N
1
M

1
2
tr
_
K
1
y
K
y

i
_

1
2
tr
_
A
1
A

i
_
,
A

i
= HK
1
y
K
y

i
K
1
y
H

.
With a vague prior the marginal likelihood is
log p(y [ X) =
1
2
y

K
1
y
y +
1
2
y

Cy

1
2
log [K
y
[
1
2
log [A
v
[
n m
2
log 2,
A
v
= HK
1
y
H

C = K
1
y
H

A
1
v
HK
1
y
,
where m is the rank of H

. Its derivative is

i
log p(y [ X) =
1
2
y

Py +
1
2
_
y

PGG

Py + G

PG
_
(8.5)

1
2
tr
_
K
1
y
K
y

i
_

1
2
tr
_
A
1
v
A
v

i
_
,
P = K
1
y
K
y

i
K
1
y
,
G = H

A
1
HK
1
y
y,
where has been used the fact that matrices K
1
y
,
K
y

i
and A
v
are symmetric. The above
expression (8.5) could be simplied a little further because y

PG+G

Py = 2y

PG.
8.1. EXPLICIT BASIS FUNCTIONS 65
8.1.1 Mean functions in GPstuff: demo_regression_meanf
Here we will demonstrate, with the demo_regression_meanf, how to use mean
functions with GPstuff. GP regression is done for articial one dimensional data with
full GP model and gaussian likelihood. In this presentation only the lines of code are
presented which are relevant in the mean function context. Otherwise this demonstra-
tion follows closely demo_regression1.
After creating the data and initializing the covariance and likelihood functions we
are ready to initialize the GP structure. The basis functions for mean are given in a cell
array of mean function structures (note the similarity to the use of covariance function
structures).
% Create covariance and likelihood structures
gpcf = gpcf_sexp(lengthScale, [0.5], magnSigma2, .5);
lik = lik_gaussian(sigma2, 0.4^2);
% Initialize base functions for GPs mean function.
gpmf1 = gpmf_constant(prior_mean,.3,prior_cov,1);
gpmf2 = gpmf_linear(prior_mean,.3,prior_cov,1);
gpmf3 = gpmf_squared(prior_mean,.3,prior_cov,1);
% Initialize gp structure
gp = gp_set(lik, lik, cf, {gpcf}, meanf, {gpmf1,gpmf2,gpmf3});
In gure 8.1 is presented the underlying process, the GP prediction, the used mean
function, 95 % condence interval and the observations.
3 2 1 0 1 2 3
5
0
5
10
15
20
input x
o
u
tp
u
t y
GP regression with a mean function


prediction
95%
f(x)
mean function
observations
Figure 8.1: GP prediction for the process f with a mean function.
66 CHAPTER 8. MEAN FUNCTIONS
Appendix A
Function list
A.1 GP
THE GP TOOLS (in the GP-folder):
Gaussian process utilities:
GP_SET Create and modify a Gaussian Process structure.
GP_PAK Combine GP parameters into one vector.
GP_UNPAK Set GP parameters from vector to structure
GP_COV Evaluate covariance matrix between two input vectors.
GP_TRCOV Evaluate training covariance matrix (gp_cov + noise covariance).
GP_TRVAR Evaluate training variance vector.
GP_RND Random draws from the postrior Gaussian process
Covariance functions:
GPCF_CAT Create a categorigal covariance function
GPCF_CONSTANT Create a constant covariance function
GPCF_EXP Create a squared exponential covariance function
GPCF_LINEAR Create a linear covariance function
GPCF_MATERN32 Create a Matern nu=3/2 covariance function
GPCF_MATERN52 Create a Matern nu=5/2 covariance function
GPCF_NEURALNETWORK Create a neural network covariance function
GPCF_PERIODIC Create a periodic covariance function
GPCF_PPCS0 Create a piece wise polynomial (q=0) covariance function
GPCF_PPCS1 Create a piece wise polynomial (q=1) covariance function
GPCF_PPCS2 Create a piece wise polynomial (q=2) covariance function
GPCF_PPCS3 Create a piece wise polynomial (q=3) covariance function
GPCF_PROD Create a product form covariance function
GPCF_RQ Create a rational quadratic covariance function
GPCF_SEXP Create a squared exponential covariance function
Likelihood functions:
LIK_GAUSSIAN Create a Gaussian likelihood structure
LIK_GAUSSIANSMT Create a Gaussian scale mixture approximating t
LIK_BINOMIAL Create a binomial likelihood structure
LIK_LOGIT Create a Logit likelihood structure
LIK_NEGBIN Create a Negbin likelihood structure
LIK_POISSON Create a Poisson likelihood structure
LIK_PROBIT Create a Probit likelihood structure
LIK_T Create a Student-t likelihood structure
Inference utilities:
GP_E Evaluate energy function (un-normalized negative marginal
log posterior)
GP_G Evaluate gradient of energy (GP_E) for Gaussian Process
GP_EG Evaluate both GP_E and GP_G. Useful in optimisation.
GP_PRED Make predictions with Gaussian process
GPEP_E Conduct Expectation propagation and return negative marginal
log posterior estimate
GPEP_G Evaluate gradient of EPs negative marginal log posterior
estimate
GPEP_PRED Predictions with Gaussian Process EP approximation
67
68 APPENDIX A. FUNCTION LIST
GPLA_E Construct Laplace approximation and return negative marginal
log posterior estimate
GPLA_G Evaluate gradient of Laplace approximations marginal
log posterior estimate
GPLA_PRED Predictions with Gaussian Process Laplace approximation
GP_MC Markov chain sampling for Gaussian process models
GPMC_PRED Predictions with Gaussian Process MCMC approximation.
GPMC_PREDS Conditional predictions with Gaussian Process MCMC
approximation.
GP_IA Integration approximation with grid, Monte Carlo or
CCD integration
GPIA_PRED Prediction with Gaussian Process GP_IA solution.
LGCP Log Gaussian Cox Process intensity estimate for 1D and
2D data
Model assesment and comparison:
GP_DIC The DIC statistics and efective number of parameters in a GP model
GP_KFCV K-fold cross validation for a GP model
GP_LOOE Evaluate the leave-one-out predictive density in case of
Gaussian observation model
GP_LOOG Evaluate the gradient of the leave-one-out predictive
density (GP_LOOE) in case of Gaussian observation model
GP_EPLOOE Evaluate the leave-one-out predictive density in case of
non-Gaussian observation model and EP
EP_LOOPRED Leave-one-out-predictions with Gaussian Process EP approximation
GP_PEFF The efective number of parameters in GP model with focus
on latent variables.
Metrics:
METRIC_EUCLIDEAN An Euclidean distance for Gaussian process models.
Misc:
LDLROWMODIFY Function to modify the sparse cholesky factorization
L
*
D
*
L = C, when a row and column k of C have changed
LDLROWUPDATE Multiple-rank update or downdate of a sparse LDL factorization.
SPINV Evaluate the sparsified inverse matrix
SCALED_HMC A scaled hybric Monte Carlo samping for latent values
SCALED_MH A scaled Metropolis Hastings samping for latent values
GP_INSTALL Matlab function to compile all the c-files to mex in the
GPstuff/gp folder.
Demonstration programs:
DEMO_BINOMIAL1 Demonstration of Gaussian process model with binomial
likelihood
DEMO_BINOMIAL_APC Demonstration for modeling age-period-cohort data
by a binomial model combined with GP prior.
DEMO_CLASSIFIC Classification problem demonstration for 2 classes
DEMO_LGCP Demonstration for a log Gaussian Cox process
with inference via EP or Laplace approximation
DEMO_MODELASSESMENT1 Demonstration for model assesment with DIC, number
of effective parameters and ten-fold cross validation
DEMO_MODELASSESMENT2 Demonstration for model assesment when the observation
model is non-Gaussian
DEMO_NEURALNETWORKCOV Demonstration of Gaussian process with a neural
network covariance function
DEMO_PERIODIC Regression problem demonstration for periodic data
DEMO_REGRESSION1 Regression problem demonstration for 2-input
function with Gaussian process
DEMO_REGRESSION_PPCS Regression problem demonstration for 2-input
function with Gaussian process using CS covariance
DEMO_REGRESSION_ADDITIVE1 Regression problem demonstration with additive model
DEMO_REGRESSION_ADDITIVE2 Regression demonstration with additive Gaussian
process using linear, squared exponential and
neural network covariance fucntions
DEMO_REGRESSION_HIER Hierarchical regression demonstration
DEMO_REGRESSION_ROBUST A regression demo with Student-t distribution as a
residual model.
DEMO_REGRESSION_SPARSE1 Regression problem demonstration for 2-input
function with sparse Gaussian processes
DEMO_REGRESSION_SPARSE2 Regression demo comparing different sparse
approximations
DEMO_SPATIAL1 Demonstration for a disease mapping problem
with Gaussian process prior and Poisson likelihood
DEMO_SPATIAL2 Demonstration for a disease mapping problem with
A.2. DIAGNOSTIC TOOLS 69
Gaussian process prior and negative binomial
observation model
A.2 Diagnostic tools
THE DIAGNOSTIC TOOLS (in the diag-folder):
Covergence diagnostics
PSRF - Potential Scale Reduction Factor
CPSRF - Cumulative Potential Scale Reduction Factor
MPSRF - Multivariate Potential Scale Reduction Factor
CMPSRF - Cumulative Multivariate Potential Scale Reduction Factor
IPSRF - Interval-based Potential Scale Reduction Factor
CIPSRF - Cumulative Interval-based Potential Scale Reduction Factor
KSSTAT - Kolmogorov-Smirnov goodness-of-fit hypothesis test
HAIR - Brooks hairiness convergence diagnostic
CUSUM - Yu-Mykland convergence diagnostic for MCMC
SCORE - Calculate score-function convergence diagnostic
GBINIT - Initial iterations for Gibbs iteration diagnostic
GBITER - Estimate number of additional Gibbs iterations
Time series analysis
ACORR - Estimate autocorrelation function of time series
ACORRTIME - Estimate autocorrelation evolution of time series (simple)
GEYER_ICSE - Compute autocorrelation time tau using Geyers
initial convex sequence estimator
(requires Optimization toolbox)
GEYER_IMSE - Compute autocorrelation time tau using Geyers
initial monotone sequence estimator
Kernel density estimation etc.:
KERNEL1 - 1D Kernel density estimation of data
KERNELS - Kernel density estimation of independent components of data
KERNELP - 1D Kernel density estimation, with automatic kernel width
NDHIST - Normalized histogram of N-dimensional data
HPDI - Estimates the Bayesian HPD intervals
Manipulation of MCMC chains
THIN - Delete burn-in and thin MCMC-chains
JOIN - Join similar structures of arrays to one structure of arrays
BATCH - Batch MCMC sample chain and evaluate mean/median of batches
Misc:
CUSTATS - Calculate cumulative statistics of data
BBPRCTILE - Bayesian bootstrap percentile
GRADCHEK - Checks a user-defined gradient function using finite
differences.
DERIVATIVECHECK - Compare user-supplied derivatives to
finite-differencing derivatives.
A.3 Distributions
PROBABILITY DISTRIBUTION FUNCTIONS (in the dist-folder):
Priors
PRIOR_FIXED Fix parameter to its current value
PRIOR_GAMMA Gamma prior structure
PRIOR_INVGAMMA Inverse-gamma prior structure
PRIOR_LAPLACE Laplace (double exponential) prior structure
PRIOR_LOGLOGUNIF Uniform prior structure for the log(log(parameter))
PRIOR_LOGUNIF Uniform prior structure for the logarithm of the parameter
PRIOR_GAUSSIAN Gaussian prior structure
PRIOR_LOGGAUSSIAN Log-Gaussian prior structure
PRIOR_SINVCHI2 Scaled inverse-chi-square prior structure
PRIOR_T Student-t prior structure
PRIOR_SQRTT Student-t prior structure for the square root of the
parameter
PRIOR_UNIF Uniform prior structure
PRIOR_SQRTUNIF Uniform prior structure for the square root of the
70 APPENDIX A. FUNCTION LIST
parameter
Probability density functions
BETA_LPDF - Beta log-probability density function (lpdf).
BETA_PDF - Beta probability density function (pdf).
DIR_LPDF - Log probability density function of uniform Dirichlet
distribution
DIR_PDF - Probability density function of uniform Dirichlet
distribution
GAM_CDF - Cumulative of Gamma probability density function (cdf).
GAM_LPDF - Log of Gamma probability density function (lpdf).
GAM_PDF - Gamma probability density function (pdf).
GEO_LPDF - Geometric log probability density function (lpdf).
INVGAM_LPDF - Inverse-Gamma log probability density function.
INVGAM_PDF - Inverse-Gamma probability density function.
LAPLACE_LPDF - Laplace log-probability density function (lpdf).
LAPLACE_PDF - Laplace probability density function (pdf).
LOGN_LPDF - Log normal log-probability density function (lpdf)
LOGT_LPDF - Log probability density function (lpdf) for log Students T
MNORM_LPDF - Multivariate-Normal log-probability density function (lpdf).
MNORM_PDF - Multivariate-Normal log-probability density function (lpdf).
NORM_LPDF - Normal log-probability density function (lpdf).
NORM_PDF - Normal probability density function (pdf).
POISS_LPDF - Poisson log-probability density function.
POISS_PDF - Poisson probability density function.
SINVCHI2_LPDF - Scaled inverse-chi log-probability density function.
SINVCHI2_PDF - Scaled inverse-chi probability density function.
T_LPDF - Students T log-probability density function (lpdf)
T_PDF - Students T probability density function (pdf)
Random number generators
CATRAND - Random matrices from categorical distribution.
DIRRAND - Uniform dirichlet random vectors
EXPRAND - Random matrices from exponential distribution.
GAMRAND - Random matrices from gamma distribution.
INTRAND - Random matrices from uniform integer distribution.
INVGAMRAND - Random matrices from inverse gamma distribution
INVGAMRAND1 - Random matrices from inverse gamma distribution
INVWISHRND - Random matrices from inverse Wishart distribution.
NORMLTRAND - Random draws from a left-truncated normal
distribution, with mean = mu, variance = sigma2
NORMRTRAND - Random draws from a right-truncated normal
distribution, with mean = mu, variance = sigma2
NORMTRAND - Random draws from a normal truncated to interval
NORMTZRAND - Random draws from a normal distribution truncated by zero
SINVCHI2RAND - Random matrices from scaled inverse-chi distribution
TRAND - Random numbers from Students t-distribution
UNIFRAND - Generate unifrom random numberm from interval [A,B]
WISHRND - Random matrices from Wishart distribution.
Others
KERNELP - Kernel density estimator for one dimensional distribution.
HAMMERSLEY - Hammersley quasi-random sequence
A.4 Monte Carlo
MONTE CARLO FUNCTIONS (in the mc-folder):
BBMEAN - Bayesian bootstrap mean
GIBBS - Gibbs sampling
HMC2 - Hybrid Monte Carlo sampling.
HMC2_OPT - Default options for Hybrid Monte Carlo sampling.
HMEAN - Harmonic mean
METROP2 - Markov Chain Monte Carlo sampling with Metropolis algorithm.
METROP2_OPT - Default options for Metropolis sampling.
RESAMPDET - Deterministic resampling
RESAMPRES - Residual resampling
RESAMPSIM - Simple random resampling
RESAMPSTR - Stratified resampling
SLS - Markov Chain Monte Carlo sampling using Slice Sampling
A.5. MISCELLANEOUS 71
SLS_OPT - Default options for Slice Sampling
SLS1MM - 1-dimensional fast minmax Slice Sampling
SLS1MM_OPT - Default options for SLS1MM_OPT
SOFTMAX2 - Softmax transfer function
A.5 Miscellaneous
MISCELLANEOUS FUNCTIONS (in the misc-folder):)
CVIT - Create itr and itst indeces for k-fold-cv
CVITR - Create itr and itst indeces for k-fold-cv with ranndom
permutation
MAPCOLOR - returns a colormap ranging from blue through gray
to red
MAPCOLOR2 - Create a blue-gray-red colormap.
M2KML - Converts GP prediction results to a KML file
QUAD_MOMENTS - Calculate the 0th, 1st and 2nd moment of a given (unnormalized)
probability distribution
RANDPICK - Pick element from x randomly
If x is matrix, pick row from x randomly.
STR2FUN - Compatibility wrapper to str2func
SET_PIC - Set the inducing inputs and blocks for two dimensional input data
WMEAN - weighted mean
A.6 Optimization
OPTIMIZATION FUNCTIONS (in the optim-folder):
BSEARCH - Finds the minimum of a combinatorial function using backward search
BSEARCH_OPT - Default options for backward search
FSEARCH - Finds the minimum of a combinatorial function using forward search
FSEARCH_OPT - Default options for forward search
SCGES - Scaled conjugate gradient optimization with early stopping
SCGES_OPT - Default options for scaled conjugate gradient optimization
SCGES - Scaled conjugate gradient optimization with early stopping (new options structure).
SCG2 - Scaled conjugate gradient optimization
SCG2_OPT - Default options for scaled conjugate gradient optimization (scg2) (new options structure).
FMINSCG - Scaled conjugate gradient optimization
FMINLBFGS - (Limited memory) Quasi Newton
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
72 APPENDIX A. FUNCTION LIST
Appendix B
Covariance functions
In this section we summarize all the covariance functions in the GPstuff package.
Squared exponential covariance function (gpcf_sexp)
Probably the most widely-used covariance function is the squared exponential (SE)
k(x
i
, x
j
) =
2
sexp
exp
_

1
2
d

k=1
(x
i,k
x
j,k
)
2
l
2
k
_
. (B.1)
The length scale l
k
governs the correlation scale in input dimension k and the mag-
nitude
2
sexp
the overall variability of the process. A squared exponential covariance
function leads to very smooth GPs that are innitely times mean square differentiable.
Exponential covariance function (gpcf_exp)
Exponential covariance function is dened as
k(x
i
, x
j
) =
2
exp
exp
_
_

_
d

k=1
(x
i,k
x
j,k
)
2
l
2
k
_
_
. (B.2)
The parameters l
k
and
2
exp
have similar role as with the SE covariance function. The
exponential covariance function leads to very rough GPs that are not mean square dif-
ferentiable.
M?ern class of covariance functions (gpcf_maternXX)
The Matern class of covariance functions is given by
k

(x
i
, x
j
) =
2
m
2
1
()
_

2r
_

2r
_
, (B.3)
where r =
_

d
k=1
(x
i,k
x
j,k
)
2
l
2
k
_
1/2
. The parameter governs the smoothness of the
process, and K

is a modied Bessel function (Abramowitz and Stegun, 1970, sec.


9.6). The Matern covariance functions can be represent in a simpler form when is a
73
74 APPENDIX B. COVARIANCE FUNCTIONS
half integer. The Mtern covariance functions with = 3/2 (gpcf_matern32) and
= 5/2 (gpcf_matern52) are:
k
=3/2
(x
i
, x
j
) =
2
m
_
1 +

3r
_
exp
_

3r
_
(B.4)
k
=5/2
(x
i
, x
j
) =
2
m
_
1 +

5r +
5r
2
3
_
exp
_

5r
_
. (B.5)
Neural network covariance function (gpcf_neuralnetwork)
A neural network with suitable transfer function and prior distribution converges to a
GP as the number of hidden units in the network approaches to innity (Neal, 1996;
Williams, 1996; Rasmussen and Williams, 2006). A nonstationary neural network co-
variance function is
k(x
i
, x
j
) =
2

sin
1
_
2 x
T
i
x
j
(1 + 2 x
T
i
x
i
)(1 + 2 x
T
j
x
j
)
_
, (B.6)
where x = (1, x
1
, . . . , x
d
)
T
is an input vector augmented with 1. = diag(
2
0
,
2
1
, . . . ,
2
d
)
is a diagonal weight prior, where
2
0
is a variance for bias parameter controlling the
functions offset from the origin. The variances for weight parameters are
2
1
, . . . ,
2
d
,
and with small values for weights, the neural network covariance function produces
smooth and rigid looking functions. The larger values for the weight variances pro-
duces more exible solutions.
Constant covariance function (gpcf_constant)
Perhaps the simplest covariance function is the constant covariance function
k(x
i
, x
j
) =
2
(B.7)
with variance parameter
2
. This function can be used to implement the constant term
in the dotproduct covariance function (Rasmussen and Williams, 2006) reviewed be-
low.
Linear covariance function (gpcf_linear)
The linear covariance function is
k(x
i
, x
j
) = x
T
i
x
j
(B.8)
where the diagonal matrix = diag(
2
1
, . . . ,
2
D
) contains the prior variances of the
linear model coefcients. Combining this with the constant function above we can
form covariance function Rasmussen and Williams (2006), which calls a dotproduct
covariance function
k(x
i
, x
j
) =
2
+x
T
i
x
j
. (B.9)
75
Piecewise polynomial functions (gpcf_ppcsX)
The piecewise polynomial functions are the only compactly supported covariance func-
tions (see section 4) in GPstuff. There are four of them with the following forms
k
pp0
(x
i
, x
j
) =
2
(1 r)
j
+
(B.10)
k
pp1
(x
i
, x
j
) =
2
(1 r)
j+1
+
((j + 1)r + 1) (B.11)
k
pp2
(x
i
, x
j
) =

2
3
(1 r)
j+2
+
((j
2
+ 4j + 3)r
2
+ (3j + 6)r + 3) (B.12)
k
pp3
(x
i
, x
j
) =

2
15
(1 r)
j+3
+
((j
3
+ 9j
2
+ 23j + 15)r
3
+
(6j
2
+ 36j + 45)r
2
+ (15j + 45)r + 15) (B.13)
where j = d/2| + q + 1. These functions correspond to processes that are 2q times
mean square differentiable at the zero and positive denite up to the dimension d
(Wendland, 2005). The covariance functions are named gpcf_ppcs0, gpcf_ppcs1,
gpcf_ppcs2, and gpcf_ppcs3.
Rational quadratic covariance function (gpcf_rq)
The rational quadratic (RQ) covariance function (Rasmussen and Williams, 2006)
k
RQ
(x
i
, x
j
) =
_
1 +
1
2
d

k=1
(x
i,k
x
j,k
)
2
l
2
k
_

(B.14)
can be seen as a scale mixture of squared exponential covariance functions with differ-
ent length-scales. The smaller the parameter > 0 is the more diffusive the length-
scales of the mixing components are. The parameter l
k
> 0 characterizes the typical
length-scale of the individual components in input dimension k.
Periodic covariance function (gpcf_periodic)
Many real world systems exhibit periodic phenomena, which can be modelled with
a periodic covariance function. One possible construction (Rasmussen and Williams,
2006) is
k(x
i
, x
j
) = exp
_

k=1
2 sin
2
((x
i,k
x
j,k
)/)
l
2
k
_
, (B.15)
where the parameter controls the inverse length of the periodicity and l
k
the smooth-
ness of the process in dimension k.
Product covariance function (gpcf_product)
A product of two or more covariance functions, k
1
(x, x

)k
2
(x, x

)..., is a valid covari-


ance function as well. Combining covariance functions in a product form can be done
with gpcf_prod, for which the user can freely specify the covariance functions to be
multiplied with each other from the collection of covariance functions implemented in
GPstuff.
76 APPENDIX B. COVARIANCE FUNCTIONS
Categorical covariance function (gpcf_cat)
Categorical covariance function gpcf_cat returns correlation 1 if input values are
equal and 0 otherwise.
k(x
i
, x
j
) =
_
1 if x
i
x
j
= 0
0 otherwise
(B.16)
Categorical covariance function can be combined with other covariance functions using
gpcf_prod, for example, to produce hierarchical models.
Appendix C
Observation models
Here, we summarize all the observation models in GPstuff. Most of them are imple-
mented in les lik_
*
which reminds that at the inference step they are considered
likelihood functions.
Gaussian (lik_gaussian)
The i.i.d Gaussian noise with variance
2
is
y [ f ,
2
N(f ,
2
I). (C.1)
Student-t (lik_t, lik_gaussiansmt)
The Student-t observation model (implemented in lik_t) is
y [ f , ,
t

n

i=1
(( + 1)/2)
(/2)

t
_
1 +
(y
i
f
i
)
2

2
t
_
(+1)/2
, (C.2)
where is the degrees of freedom and the scale parameter. The scale mixture ver-
sion of the Student-t distribution is implemented in lik_gaussiansmt and it is
parametrized as
y
i
[f
i
, , U
i
N(f
i
, U
i
) (C.3)
U
i
Inv-
2
(,
2
), (C.4)
where each observation has its own noise variance U
i
(Neal, 1997; Gelman et al.,
2004).
Logit (lik_logit)
The logit transformation gives the probability for y
i
of being 1 or 1 as
p
logit
(y
i
[f
i
) =
1
1 + exp(y
i
f
i
)
. (C.5)
77
78 APPENDIX C. OBSERVATION MODELS
Probit (lik_probit)
The probit transformation gives the probability for y
i
of being 1 or 1 as
p
probit
(y
i
[f
i
) = (y
i
f(x
i
)) =
_
y
i
f(x
i
)

N(z[0, 1)dz. (C.6)


Poisson (lik_poisson)
The Poisson observation model with expected number of cases e is
y [ f , e
n

i=1
Poisson(y
i
[ exp(f
i
)e
i
). (C.7)
Negative-Binomial (lik_negbin)
The negative-binomial is a robustied version of the Poisson distribution. It is parametrized
y [ f , e, r
n

i=1
(r +y
i
)
y
i
!(r)
_
r
r +
i
_
r
_

i
r +
i
_
y
i
(C.8)
where
i
= e exp(f
i
), r is the dispersion parameter governing the variance, e
i
is the
expected number of cases and y
i
is positive integer telling the observed count.
Binomial (lik_binomial)
The binomial observation model with the probability of success p
i
= exp(f
i
)/(1 +
exp(f
i
)) is
y [ f , z
N

i=1
z
i
!
y
i
!(z
i
y
i
)!
p
y
i
i
(1 p
i
)
(z
i
y
i
)
. (C.9)
Here, z
i
denotes the number of trials and y
i
is the number of successes.
Appendix D
Priors
This chapter lists all the priors implemented in the GPstuff package.
Gaussian prior (prior_gaussian)
The Gaussian distribution is parametrized as
p() =
1

2
2
exp
_

1
2
2
( )
2
_
(D.1)
where is a location parameter and
2
is a scale parameter.
Log-Gaussian prior (prior_loggaussian)
The log-Gaussian distribution is parametrized as
p() =
1

2
2
exp
_

1
2
2
(log() )
2
_
(D.2)
where is a location parameter and
2
is a scale parameter.
Laplace prior (prior_laplace)
The Laplace distribution is parametrized as
p() =
1
2
exp
_

[ [

_
(D.3)
where is a location parameter and > 0 is a scale parameter.
Student-t prior (prior_t)
The Student-t distribution is parametrized as
p() =
(( + 1)/2)
(/2)

2
_
1 +
( )
2

2
_
(+1)/2
(D.4)
where is a location parameter,
2
is a scale parameter and > 0 is the degrees of
freedom.
79
80 APPENDIX D. PRIORS
Square root Student-t prior (prior_sqrtt)
The square root Student-t distribution is parametrized as
p(
1/2
) =
(( + 1)/2)
(/2)

2
_
1 +
( )
2

2
_
(+1)/2
(D.5)
where is a location parameter,
2
is a scale parameter and > 0 is the degrees of
freedom.
Scaled inverse-
2
prior (prior_sinvchi2)
The scaled inverse-
2
distribution is parametrized as
p() =
(/2)
/2
(/2)
(s
2
)
/2

(/2+1)
e
s
2
/(2)
(D.6)
where s
2
is a scale parameter and > 0 is the degrees of freedom parameter.
Gamma prior (prior_gamma)
The gamma distribution is parametrized as
p() =

()

1
e

(D.7)
where > 0 is a shape parameter and > 0 is an inverse scale parameter.
Inverse-gamma prior (prior_invgamma)
The inverse-gamma distribution is parametrized as
p() =

()

(+1)
e
/
(D.8)
where > 0 is a shape parameter and > 0 is a scale parameter.
Uniform prior (prior_unif)
The uniform prior is parametrized as
p() 1. (D.9)
Square root uniform prior (prior_sqrtunif)
The square root uniform prior is parametrized as
p(
1/2
) 1. (D.10)
Log-uniform prior (prior_logunif)
The log-uniform prior is parametrized as
p(log()) 1. (D.11)
81
Log-log-uniform prior (prior_loglogunif)
The log-log-uniform prior is parametrized as
p(log(log())) 1. (D.12)
82 APPENDIX D. PRIORS
Appendix E
Transformation of
hyperparameters
The inference on the parameters of covariance functions is conducted mainly trans-
formed space. Most of ten used transformation is log-transformation, which has the
advantage that the parameter space (0, ) is transformed into (, ). The change
of parametrization has to be taken into account in the evaluation of the probability den-
sities of the model. If parameter with probability density p

() is transformed into
the parameter w = f() with equal number of components, the probability density of
w is given by
p
w
(w) = [J[p

(f
1
(w)), (E.1)
where J is the Jacobian of the transformation = f
1
(w). The parameter transforma-
tions are discussed shortly, for example, in Gelman et al. (2004)[p. 24].
Due to the log transformation w = log() transformation the probability densities
p

() are changed to the densities


p
w
(w) = [J[p

(exp(w)) = [J[p

(), (E.2)
where the Jacobian is J =
exp(w)
w
= exp(w) = . Now, given Gaussian observation
model (see Section 2.1.2) the posterior of w can be written as
p
w
(w[D) p(y[X, )p([), (E.3)
which leads to energy function
E(w) = log p(y[X, ) log p([) log([[).
= E() log(),
where the absolute value signs are not shown explicitly around because it is strictly
positive. Thus, the log transformation just adds term log in the energy function.
The inference on w requires also the gradients of an energy function E(w). These
83
84 APPENDIX E. TRANSFORMATION OF HYPERPARAMETERS
can be obtained easily with the chain rule
E(w)
w
=

[E() log([J[)]

w
=
_
E()


log([J[)

_

w
=
_
E()


1
[J[
[J[

_
J. (E.4)
Here we have used the fact that the last term, derivative of with respect to w, is the
same as the Jacobian J =

w
=
f
1
w
. Now in the case of log transformation the
Jacobian can be replaced by and the gradient is gotten an easy expression
E(w)
w
=
E()

1. (E.5)
Bibliography
Abramowitz, M. and Stegun, I. A. (1970). Handbook of mathematical functions. Dover
Publications, Inc.
Ahmad, O. B., Boschi-Pinto, C., Lopez, A. D., Murray, C. J., Lozano, R., and Inoue,
M. (2000). Age standardization of rates: A new WHO standard. GPE Discussion
Paper Series, 31.
Alvarez, M. A., Luengo, D., Titsias, M. K., and Lawrence, N. D. (2010). Efcient mul-
tioutput Gaussian processes through variational inducing kernels. JMLR Workshop
and conference proceedings, 9:2532.
Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004). Hierarchical Modelling and
Analysis for Spatial Data. Chapman Hall/CRC.
Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predictive
process models for large spatial data sets. Journal of the Royal Statistical Society B,
70(4):825848.
Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics,
7(3):686690.
Bernardo, J. M. and Smith, A. F. M. (2000). Bayesian Theory. John Wiley & Sons,
Ltd.
Best, N., Richardson, S., and Thomson, A. (2005). A comparison of Bayesian spatial
models for disease mapping. Statistical Methods in Medical Research, 14:3559.
Buhmann, M. D. (2001). A new class of radial basis functions with compact support.
Mathematics of Computation, 70(233):307318.
Burman, P. (1989). A comparative study of ordinary cross-validation, v-fold cross-
validation and the repeated learning-testing methods. Biometrika, 76(3):503514.
Christensen, O. F., Roberts, G. O., and Skld, M. (2006). Robust Markov chain Monte
Carlo methods for spatial generalized linear mixed models. Journal of Computa-
tional and Graphical Statistics, 15:117.
Cressie, N. A. C. (1993). Statistics for Spatial Data. John Wiley & Sons, Inc.
Csat, L. and Opper, M. (2002). Sparse online Gaussian processes. Neural Computa-
tion, 14(3):641669.
Davis, T. A. (2006). Direct Methods for Sparse Linear Systems. SIAM.
85
86 BIBLIOGRAPHY
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classi-
cation learning algorithms. Neural Computation, 10(7):18951924.
Diggle, P. J. and Ribeiro, P. J. (2007). Model-based Geostatistics. Springer Sci-
ence+Business Media, LLC.
Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). Model-based geostatistics. Jour-
nal of the Royal Statistical Society. Series C (Applied Statistics), 47(3):299350.
Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo.
Physics Letters B, 195(2):216222.
Elliot, P., Wakeeld, J., Best, N., and Briggs, D., editors (2001). Spatial Epidemiology
Methods and Applications. Oxford University Press.
Finkenstdt, B., Held, L., and Isham, V. (2007). Statistical Methods for Spatio-
Temporal Systems. Chapman & Hall/CRC.
Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpo-
lation of large spatial datasets. Journal of Computational and Graphical Statistics,
15(3):502523.
Gaspari, G. and Cohn, S. (1999). Construction of correlation functions in two and three
dimensions. Quarterly Journal of the Royal Meteorological Society, 125(554):723
757.
Gelfand, A. E., Diggle, P. J., Fuentes, M., and Guttorp, P. (2010). Handbook of Spatial
Statistics. CRC Press.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models.
Bayesian Analysis, 1(3):515533.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data
Analysis. Chapman & Hall/CRC, second edition.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo
integration. Econometrica, 57(6):721741.
Gibbs, M. N. and Mackay, D. J. C. (2000). Variational Gaussian process classiers.
IEEE Transactions on Neural Networks, 11(6):14581464.
Gilks, W., Richardson, S., and Spiegelhalter, D. (1996). Markov Chain Monte Carlo in
Practice. Chapman & Hall.
Gneiting, T. (1999). Correlation functions for atmospheric data analysis. Quarterly
Journal of the Royal Meteorological Society, 125:24492464.
Gneiting, T. (2002). Compactly supported correlation functions. Journal of Multivari-
ate Analysis, 83:493508.
Goel, P. K. and Degroot, M. H. (1981). Information about hyperparamters in hierar-
chical models. Journal of the American Statistical Association, 76(373):140147.
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society. Series
B (Methodological), 14(1):107114.
BIBLIOGRAPHY 87
Grewal, M. S. and Andrews, A. P. (2001). Kalman Filtering: Theory and Practice
Using Matlab. Wiley Interscience, second edition.
Harville, D. A. (1997). Matrix Algebra From a Statisticians Perspective. Springer-
Verlag.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Association, 90(430):773795.
Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance tapering
for likelihood-based estimation in large spatial data sets. Journal of the American
Statistical Association, 103(484):15451555.
Kuss, M. and Rasmussen, C. E. (2005). Assessing approximate inference for binary
Gaussian process classication. Journal of Machine Learning Research, 6:1679
1704.
Lawrence, N. (2007). Learning for larger datasets with the Gaussian process latent
variable model. In Meila, M. and Shen, X., editors, Proceedings of the Eleventh
International Workshop on Articial Intelligence and Statistics. Omnipress.
Lawson, A. B. (2001). Statistical Methods in Spatial Epidemology. John Wiley &
Sons, Ltd.
Matheron, G. (1973). The intrinsic random functions and their applications. Advances
in Applied Probability, 5(3):439468.
Minka, T. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD
thesis, Massachusetts Institute of Technology.
Mller, J., Syversveen, A. R., and Waagepetersen, R. P. (1998). Log Gaussian Cox
processes. Scandinavian Journal of Statistics, 25:451482.
Moreaux, G. (2008). Compactly supported radial covariance functions. Journal of
Geodecy, 82(7):431443.
Neal, R. (1998). Regression and classication using Gaussian process priors. In
Bernardo, J. M., Berger, J. O., David, A. P., and Smith, A. P. M., editors, Bayesian
Statistics 6, pages 475501. Oxford University Press.
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.
Neal, R. M. (1997). Monte Carlo Implementation of Gaussian Process Models for
Bayesian Regression and Classication. Technical Report 9702, Dept. of statistics
and Dept. of Computer Science, University of Toronto.
Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3):705767.
Nickisch, H. and Rasmussen, C. E. (2008). Approximations for binary Gaussian pro-
cess classication. Journal of Machine Learning Research, 9:20352078.
OHagan, A. (1978). Curve tting and optimal design for prediction. Journal of the
Royal Statistical Society. Series B., 40(1):142.
OHagan, A. (1979). On outlier rejection phenomena in Bayes inference. Journal of
the Royal Statistical Society. Series B., 41(3):358367.
88 BIBLIOGRAPHY
OHagan, A. (2004). Dicing with the unknown. Signicance, 1:132133.
Quionero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse ap-
proximate Gaussian process regression. Journal of Machine Learning Research,
6(3):19391959.
Rasmussen, C. E. (1996). Evaluations of Gaussian Processes and Other Methods for
Non-linear Regression. PhD thesis, University of Toronto.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine
Learning. The MIT Press.
Rathbun, S. L. and Cressie, N. (1994). Asymptotic properties of estimators for the
parameters of spatial inhomogeneous poisson point processes. Advances in Applied
Probability, 26(1):122154.
Richardson, S. (2003). Spatial models in epidemiological applications. In Green,
P. J., Hjort, N. L., and Richardson, S., editors, Highly Structured Stochastic Systems,
pages 237259. Oxford University Press.
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer,
second edition.
Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inference for
latent Gaussian models by using integrated nested Laplace approximations. Journal
of the Royal statistical Society B, 71(2):135.
Sanchez, S. M. and Sanchez, P. J. (2005). Very large fractional factorials and cen-
tral composite designs. ACM Transactions on Modeling and Computer Simulation,
15:362377.
Sans, F. and Schuh, W.-D. (1987). Finite covariance functions. Journal of Geodesy,
61(4):331347.
Seeger, M. (2005). Expectation propagation for exponential families. Technical report,
Max Planck Institute for Biological Cybernetics, Tbingen, Germany.
Seeger, M. (2008). Bayesian inference and optimal design for the sparse linear model.
Journal of Machine Learning Research, 9:759813.
Seeger, M., Williams, C. K. I., and Lawrence, N. (2003). Fast forward selection to
speed up sparse Gaussian process regression. In Bishop, C. M. and Frey, B. J., ed-
itors, Ninth International Workshop on Articial Intelligence and Statistics. Society
for Articial Intelligence and Statistics.
Snelson, E. (2007). Flexible and Efcient Gaussian Process Models for Machine
Learning. PhD thesis, University College London.
Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian process using pseudo-inputs.
In Weiss, Y., Schlkopf, B., and Platt, J., editors, Advances in Neural Information
Processing Systems 18. The MIT Press.
Snelson, E. and Ghahramani, Z. (2007). Local and global sparse Gaussian process ap-
proximations. In Meila, M. and Shen, X., editors, Articial Intelligence and Statis-
tics 11. Omnipress.
BIBLIOGRAPHY 89
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesian
measures of model complexity and t. Journal of the Royal Statistical Society B,
64(4):583639.
Storkey, A. (1999). Efcient Covariance Matrix Methods for Bayesian Gaussian Pro-
cesses and Hopeld Neural Networks. PhD thesis, University of London.
Sundararajan, S. and Keerthi, S. S. (2001). Predictive approaches for choosing hyper-
parameters in gaussian processes. Neural Computation, 13(5):11031118.
Takahashi, K., Fagan, J., and Chen, M.-S. (1973). Formation of a sparse bus impedance
matrix and its application to short circuit study. In Power Industry Computer Appli-
cation Conference Proceedings. IEEE Power Engineering Society.
Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments
and marginal densities. Journal of the American Statistical Association, 81(393):82
86.
Titsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian
processes. JMLR Workshop and Conference Proceedings, 5:567574.
Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic Gaussian pro-
cess priors in density estimation. Journal of Statistical Planning and Inference,
137:3442.
Vanhatalo, J., Pietilinen, V., and Vehtari, A. (2010). Approximate inference for disease
mapping with sparse gaussian processes. Statistics in Medicine, 29(15):15801607.
Vanhatalo, J. and Vehtari, A. (2007). Sparse log Gaussian processes via MCMC for
spatial epidemiology. JMLR Workshop and Conference Proceedings, 1:7389.
Vanhatalo, J. and Vehtari, A. (2008). Modelling local and global phenomena with
sparse Gaussian processes. In McAllester, D. A. and Myllymki, P., editors, Pro-
ceedings of the 24th Conference on Uncertainty in Articial Intelligence, pages 571
578.
Vanhatalo, J. and Vehtari, A. (2010). Speeding up the binary Gaussian process classi-
cation. In Grnwald, P. and Spirtes, P., editors, Proceedings of the 26th Conference
on Uncertainty in Articial Intelligence, pages 19.
Vehtari, A. and Lampinen, J. (2002). Bayesian model assessment and comparison using
cross-validation predictive densities. Neural Computation, 14(10):24392468.
Wendland, H. (1995). Piecewise polynomial, positive denite and compactly sup-
ported radial functions of minimal degree. Advances in Computational Mathematics,
4(1):389396.
Wendland, H. (2005). Scattered Data Approximation. Cambridge University Press.
West, M. (1984). Outlier models and prior distributions in Bayesian linear regression.
Journal of the Royal Statistical Society. Series B., 46(3):431439.
Wiener, N. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time
Series. MIT Press.
90 BIBLIOGRAPHY
Williams, C. K. I. (1996). Computing with innite networks. In Mozer, M. C., Jordan,
M. I., and Petsche, T., editors, Advances in Neural Information Processing Systems,
volume 9. The MIT Press.
Williams, C. K. I. and Barber, D. (1998). Bayesian classication with Gaussian
processes. IEEE Transactions on Pattern Analysis and Machine Intelligence,
20(12):13421351.
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In
Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural
Information Processing Systems 8, pages 514520. MIT Press.
Wu, Z. (1995). Compactly supported positive denite radial functions. Advances in
Computational Mathematics, 4(1):283292.

Вам также может понравиться