Вы находитесь на странице: 1из 17

The Expectation-Maximization (EM) Algorithm

Dr. Simon J.D. Prince


Dept. of Computer Science,
University College London
1 Overview of Generative Models and the EM Algorithm
The Expectation-Maximization (EM) algorithm is used to t parameters to models where there
are hidden variables. It is frequently used for unsupervised learning of generative models. In
this section we will present a general overview of the use of the EM algorithm. In subsequent
sections we present worked examples of the use of the EM algorithm.
1.1 Generative Model
Assume that we have some visible data {x
1
. . . x
n
}. We would like to describe that data using
a generative model, that depends on some parameters which apply to all the data and hidden
variables, {h
1
. . . h
n
} associated with each data point. In the most general form, we can express
this as:
x
i
= f(, h
i
) + g(, h
i
,
i
) (1)
The observed data vector x
i
is built up from a deterministic component, f(, h
i
), and a stochastic
component, g(, h
i
,
i
). The term
i
is a random noise term drawn from some known probability
distribution Pr().
Each data point x
i
has a hidden variable, h
i
associated with it. The hidden variable is generally
of smaller dimensionality than the original data. There is also a small set of parameters which
relate to the whole data set {x
1
. . . x
n
}. Each data point x
i
is generated deterministically by
combining the appropriate hidden variable h
i
with the parameters . Ignoring the stochastic
term, we can see that:
x
i
f(, h
i
) (2)
Typically, the total number of parameters {, h
1
. . . h
n
} will be less than that of the original data
{x
1
. . . x
n
}. In other words, the deterministic part of the model produces a compressed approx-
imation to the original dataset. We have exploited existing structure in the data {x
1
. . . x
n
} to
describe most of the data variance with a smaller number of parameters.
The remaining unexplained variance is ascribed to the stochastic (noise) term.
1.2 Generating Samples
Lets assume we know the parameters, . If this is a generative model, then how can we generate
examples from it? Before we can do this, we must know the prior distribution for the hidden
variable Pr(h). Then the procedure to generate an example x
i
is simple:
1
1. Generate a random hidden vector h
i
from the prior distribution Pr(h).
2. Generate a random example of the noise
i
from the probability distribution Pr().
3. Calculate the deterministic term, f(, h
i
) and the stochastic term, g(, h
i
,
i
) and sum to
get x
i
as in Equation 1.
1.3 Conditional and Joint Probability Densities
The generative model can also be written as a probability distribution describing the probability
of getting an observation x
i
given the hidden variables, h
i
and the parameters . The data will
be less likely if we require extreme amounts of noise to adjust the prediction of the deterministic
component to match the data.
To summarize, the generative model can equivalently be written as the probability distribution,
Pr(x
i
|h
i
, ). To infer the parameters, we will also need to know the joint distribution of the
observed and hidden variables:
Pr(x
i
, h
i
|) = Pr(x
i
|h
i
, )Pr(h
i
|)
= Pr(x
i
|h
i
, )Pr(h
i
) (3)
1.4 Inference: The EM Algorithm
Our goal: We aim to maximize the likelihood of the data with respect to the parameters. In
other words, we seek:

= arg max

_
Pr()
n

i=1
Pr(x
i
|)
_
(4)
As per usual, instead of maximizing the likelihood directly, we take the logarithm. Since the
logarithm is a monotonic transformation, the maximum will be in the same place.

= arg max

_
log [Pr()] +
n

i=1
log [Pr(x
i
|)]
_
= arg max

_
log [Pr()] +
n

i=1
log
__
dh
i
Pr(x
i
, h
i
|)
_
_
(5)
We could try to optimize these parameters directly. However, this often becomes mathematically
unnattractive. The E-M algorithm takes an alternative strategy: it denes a lower bound on the
log likelihood. It then increases this lower bound by alternating between the Expectation step
(E-Step) in which the lower bound is maximized with respect to a distribution over the hidden
variables, q(h) and the Maximization step (M-Step) in which the lower bound is maximized with
respect to the parameters .
If the lower bound on the likelihood keeps increasing with each step, then we will eventually
reach a maximum in the likelihood. Note however that there is no guarantee that this is a global
maximum.
2
The particular lower bound that we choose is:
B(q(h), ) =
_
q(h) log
Pr(x, h|)
q(h)
dh (6)
The complete procedure is:
1. E-Step - Optimize B(q(h), ) with respect to a distribution over the hidden variables, q(h).
q
[t]
(h) = arg max
q(h)
B(q(h), ) (7)
It can be shown that B(q(h), ) is maximized when
q
[t]
(h) = Pr(h|x,
[t1]
) (8)
2. M-Step - Optimize the lower bound with respect to the parameters .

= arg max

B(q(h), )
= arg max

_
q(h) log
Pr(x, h|)
q(h)
dh
= arg max

_
q(h) log Pr(x, h|)dh
_
q(h) log q(h)dh
= arg max

_
q(h) log Pr(x, h|)dh
where we have eliminated the second term in the last step because it has no dependence
on the parameter . Substituting in the result from the E-Step, we get:

[t]
= arg max

_
Pr(h|x,
[t1]
) log Pr(x, h|)dh (9)
1.5 Summary of the E-M Algorithm
Alternate between the E-Steps and the M-Steps:
E-Step:
q
[t]
(h) = Pr(h|x,
[t1]
) (10)
M-Step:

[t]
= arg max

_
q
[t]
(h) log Pr(x, h|)dh (11)
3
2 Mixtures of Gaussians
We will deal with the simple case where the data has been generated by a mixture of two
univariate Gaussians, with dierent means
1
and
2
, and dierent standard deviations
1
and

2
. The probability with which data is generated from the rst and second Gaussian is f and
1 f respectively, where 0 f 1.
2.1 Generative Model
The generative model can be expressed as:
x
i
=
_

1
+
1

i
if h
i
= 1

2
+
2

i
if h
i
= 2
(12)
where
i
is a zero-mean Gaussian distributed noise term with unity standard deviation. The
discrete hidden variable, h
i
takes values of either 1 or 2 and indicates which Gaussian the
particular data point was generated from. The deterministic part of the generative model is the
rst term on the right hand side(modulated by the particular value of h
i
). The stochastic part of
the generative model is the second term on the right hand side (also dependent on the particular
value of h
i
).
2.2 Generating Samples
In order to generate new examples from the model, we rst generate a value of h
i
, which has a
prior distribution:
Pr(h = 1) = f
Pr(h = 2) = 1 f
Then we generate a random example of the noise and substitute these values into Equation 12.
2.3 Conditional and Joint Probability Densities
The conditional density of x
i
given h
i
is:
Pr(x
i
|h
i
= 1, ) =
1
_
2
2
1
exp
_

(x
i

1
)
2
2
2
1
_
Pr(x
i
|h
i
= 2, ) =
1
_
2
2
2
exp
_

(x
i

2
)
2
2
2
2
_
The joint probability density can be calculated using
Pr(x
i
, h
i
|) = Pr(x
i
|h
i
, )Pr(h
i
|)
= Pr(x
i
|h
i
, )Pr(h
i
) (13)
4
since there is no probabilistic relationship between the hidden variables h
i
and the parameters
. Substituting in, we get the following expression for the probability density:
Pr(x
i
, h
i
= 1|) =
f
_
2
2
1
exp
_

(x
i

1
)
2
2
2
1
_
Pr(x
i
, h
i
= 2|) =
(1 f)
_
2
2
2
exp
_

(x
i

2
)
2
2
2
2
_
2.4 Inference: the E-M Algorithm
Given a set of data x = {x
1
. . . x
n
} infer the parameters, = {
1
,
2
,
1
,
2
, f}. We use the E-M
algorithm:
1. E-Step - Optimize B(q(h), ) with respect to a distribution over the hidden variables, q(h).
From before:
q
[t]
(h) = Pr(h|x,
[t1]
) (14)
Using Bayes rule:
Pr(h
i
= 1|x
i
,
[t1]
) =
Pr(x
i
|h
i
= 1,
[t1]
)Pr(h
i
= 1)

2
j=1
Pr(x
i
|h
i
= j,
[t1]
)Pr(h
i
= j)
(15)
and
Pr(h
i
= 2|x
i
,
[t1]
) =
Pr(x
i
|h
i
= 2,
[t1]
)Pr(h
i
= 2)

2
j=1
Pr(x
i
|h
i
= j,
[t1]
)Pr(h
i
= j)
(16)
2. M-Step - Optimize the lower bound with respect to the parameters . From before:

[t]
= arg max

_
Pr(h|x,
[t1]
) log Pr(x, h|)dh (17)
Substituting in for this particular case

[t]
= arg max

i=1
2

j=1
Pr(h
i
= j|x
i
,
[t1]
) log Pr(x
i
, h
i
|) (18)
Expanding and substituting in we get:

[t]
= arg max

n
i=1
_
Pr(h
i
= 1|x
i
,
[t1]
)
_
log(f) log(
_
2
2
1
)
(x
i

1
)
2
2
2
1
_
+ . . .
Pr(h
i
= 2|x
i

[t1]
)
_
log(1 f) log(
_
2
2
2
)
(x
i

2
)
2
2
2
2
__
5
We take derivatives with respect to f,
1
,
2
,
1
and
2
and setting the resulting expressions
to zero to solve for these parameters and nd:

f
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)

n
i=1
_
Pr(h
i
= 1|x
i
,
[t1]
) + Pr(h
i
= 2|x
i
,
[t1]
)
(19)

1
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)x
i

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)
(20)

2
[t]
=

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)x
i

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)
(21)

1
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)(x
i

[t]
1
)
2

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)
(22)

2
[t]
=

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)(x
i

[t]
2
)
2

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)
(23)
2.5 Summary of the E-M Algorithm for Mixtures of Gaussians
Alternate between the E-Steps and the M-Steps:
E-Step:
Pr(h
i
= 1|x
i
,
[t1]
) =
Pr(x
i
, h
i
= 1|
[t1]
)

2
j=1
Pr(x
i
, h
i
= j|
[t1]
)
Pr(h
i
= 2|x
i
,
[t1]
) =
Pr(x
i
, h
i
= 2|
[t1]
)

2
j=1
Pr(x
i
, h
i
= j|
[t1]
)
(24)
where:
Pr(x
i
, h
i
= 1|
[t1]
) =
f
[t1]

2
[t1]
1
exp
_

(x
i

[t1]
1
)
2
2(
[t1]
1
)
2
_
Pr(x
i
, h
i
= 2|
[t1]
) =
(1 f
[t1]
)

2
[t1]
2
exp
_

(x
i

[t1]
2
)
2
2(
[t1]
2
)
2
_
6
M-Step:

f
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)

n
i=1
_
Pr(h
i
= 1|x
i
,
[t1]
) + Pr(h
i
= 2|x
i
,
[t1]
)


1
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)x
i

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)

2
[t]
=

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)x
i

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)

1
[t]
=

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)(x
i

[t]
1
)
2

n
i=1
Pr(h
i
= 1|x
i
,
[t1]
)

2
[t]
=

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)(x
i

[t]
2
)
2

n
i=1
Pr(h
i
= 2|x
i
,
[t1]
)
(25)
7
3 Factor Analysis
In factor analysis, a d-dimensional multivariate dataset {x
1
. . . x
n
} is modelled as being a weighted
linear sum of k factors f
1
. . . f
k
. The weights are the hidden variables and dier for each data
point. The remaining variation is described as Gaussian noise. This model is closely related to
principal components analysis.
3.1 Generative Model
The generative model is given by:
x
i
= Fh
i
+
i
(26)
where F is a dk matrix containing the factors f
1
. . . f
k
in its columns. The hidden variable h
i
is
a k 1 vector. The term
i
is a d 1 Gaussian noise term with mean 0 and diagonal covariance
= Diag[
1
. . .
d
].
The term Fh
i
is the deterministic part of the model. The term
i
is the stochastic part. There
is a hidden variable h
i
associated with each observed data point x
i
. The unknown parameters,
which relate to all of the data are = {F, }.
3.2 Generating Samples
In order to generate samples from the model, we also need to know the prior on the hidden
variable, Pr(h). We will dene this to be Gaussian with mean 0 and identity covariance, so that:
Pr(h) = G
h
[0, I] (27)
To generate an example x
i
from the model, we rst generate a random example of the hidden
variable h
i
and calculate the deterministic term Fh
i
. Then we generate a stochastic term
i
. We
sum these to get the nal simulated sample x
i
.
3.3 Joint and Conditional Distributions
From the generative model (Equation 26) we can see that that the conditional probability dis-
tribution can be calculated as:
Pr(x
i
|h
i
) = G
x
i
[Fh
i
, ] (28)
The joint probability density can be calculated using
Pr(x
i
, h
i
|) = Pr(x
i
|h
i
, )Pr(h
i
|)
= Pr(x
i
|h
i
, )Pr(h
i
) (29)
Substituting in, we get:
Pr(x
i
, h
i
|) = Pr(x
i
|h
i
, )Pr(h
i
)
= G
x
i
[Fh
i
, ].G
h
i
[0, I] (30)
8
We can express the rst term as a Gaussian in h and then combine the two terms (see Appendices
A and B for details).
Pr(x
i
, h
i
|) = G
x
i
[Fh
i
, ].G
h
i
[0, I]
=
1
(x
i
)G
h
i
[(F
T

1
F)
1
F
T

1
x
i
, (F
T

1
F)
1
].G
h
i
[0, I]
=
2
(x
i
)G
h
i
[(F
T

1
F +I)
1
F
T

1
x
i
, (F
T

1
F +I)
1
] (31)
where
1
and
2
are constant factors. Using the Matrix inversion lemma and some algebraic
manipulation (see Appendices C and D), we can re-arrange this expression to get:
Pr(x
i
, h
i
|) =
2
(x
i
)G
h
i
[(F
T

1
F +I)
1
F
T

1
x
i
, (F
T

1
F +I)
1
]
=
2
(x
i
)G
h
i
[(F
T
(FF
T
+ )
1
x
i
, I F
T
(FF
T
+ )
1
F] (32)
3.4 Inference: the E-M Algorithm
Given a set of data X = {x
1
. . . x
n
} infer the parameters, = {F, }. We use the E-M algorithm:
1. E-Step - Optimize B(q(h), ) with respect to a distribution over the hidden variables, q(h).
From before:
q
[t]
(h) = Pr(h|x,
[t1]
) (33)
Using Bayes rule:
Pr(h
i
|x
i
,
[t1]
) =
Pr(x
i
|h
i
,
[t1]
)Pr(h
i
)
_
Pr(x
i
|h
i
,
[t1]
)Pr(h
i
)dh
i
=
Pr(x
i
|h
i
,
[t1]
)Pr(h
i
)
_
Pr(x
i
|h
i
,
[t1]
)Pr(h
i
)dh
i
= G
h
i
[(F
T
(FF
T
+ )
1
x
i
, I F
T
(FF
T
+ )
1
F] (34)
The constant
2
disappears as it exactly cancels out the denominator. This follows from
the fact that the L.H.S. must be a proper probability distribution. The purpose of the
denominator is exactly to make this the case.
2. M-Step - Optimize the lower bound with respect to the parameters . From before:

[t]
= arg max

_
Pr(h|x,
[t1]
) log Pr(x, h|)dh (35)
= arg max

E
h
[log Pr(x, h|)] (36)
We rst consider the expression, log Pr(x, h|): we will re-introduce the expectation later.
Substituting in from Equations 27 and 28, we get:
9
log Pr(x, h|) = log Pr(x|h, ) + log Pr(h)
= 0.5 log det(
1
) 0.5(x
n
Fh)
T

1
(x
n
Fh) + 0.5hh
T
(37)
Taking the derivatives with respect to F and and equating the results to zero:

F
=
n

i=1

1
_
x
i
h
T
i
Fh
i
h
T
i
_
= 0 (38)
from which we get:
F
[t]
=
_
n

i=1
x
i
h
T
i
__
n

i=1
h
i
h
T
i
_
1
(39)
Taking derivatives with respect to
1
:

1
=

1
ij

j,k

j,l

k,l
=
1
2
n

i=1
_
x
i
x
T
i
x
i
h
T
i
F
T
Fh
i
x
T
i
+Fh
i
h
T
i
F
T
_
jk

jl

kl
= 0 (40)
Re-arranging, we see that:

[t]
=
1
n
Diag
_
x
i
x
T
i
Fh
i
x
T
i

(41)
Unfortunately, we dont know the value of the terms h
i
. Referring back to Equation 35 we
see that the solution is to multiply by Pr(h
i
|x
i
,
[t1]
) and integrate. In other words, we
take the expected value of the terms involving h
i
. Considering Equations 39 and 41, we
need the terms E[h
i
] and E[h
i
h
T
i
]. These are simply the mean and second moment about
zero for the distribution Pr(h
i
|x
i
, ). From Equation 44, we see:
E[h
i
] = (F
T
(FF
T
+ )
1
x
i
(42)
E[h
i
h
T
i
] = I +F
T
(FF
T
+ )
1
F E[h
i
]E[h
T
i
] (43)
where the last term in the second equation adds in the missing component due to the
non-zero mean. These expectations are usually taken during the M-Step.
Notice that this is slightly dierent from the previous example, in which the hidden variable
h
i
was discrete. We must now integrate over the continuous variable, rather than sum, and
consequently, we must nd closed form solutions for the expected value of the terms involving h.
10
3.5 Summary of the E-M Algorithm for Factor Analysis
Alternate between the E-Steps and the M-Steps:
E-Step: For every i
Pr(h
i
|x
i
,
[t1]
) = G
h
i
[(F
T
(FF
T
+ )
1
x
i
, I F
T
(FF
T
+ )
1
F] (44)
which implies:
E[h
i
] = (F
T
(FF
T
+ )
1
x
i
(45)
E[h
i
h
T
i
] = I +F
T
(FF
T
+ )
1
F E[h
i
]E[h
T
i
] (46)
where the iteration superscripts have been omitted for clarity, but needless to say F F
[t1]
and
[t1]
.
M-Step:
F
[t]
=
_
n

i=1
x
i
E[h
i
][
T
__
n

i=1
E[h
i
h
T
i
]
_
1
(47)

[t]
=
1
n
Diag
_
x
i
x
T
i
FE[h
i
]x
T
i

(48)
11
4 Why does the EM Algorithm work?
4.1 Summary of the E-M Algorithm
Alternate between the E-Steps and the M-Steps:
E-Step:
q
[t]
(h) = Pr(h|x,
[t1]
) (49)
M-Step:

[t]
= arg max

_
q
[t]
(h) log Pr(x, h|)dh (50)
4.2 Sketch Proof of the EM Algorithm
1. We assert that the function
B(q(h),
[t]
) =
_
q(h) log
Pr(x, h|
[t]
)
q(h)
dh (51)
is a lower bound on the log likelihood so that
log Pr(x|
[t]
)
_
q(h) log
Pr(x, h|
[t]
)
q(h)
dh (52)
In the above equation q(h) is a probability distribution over h. We prove that this is a
bound in Section 4.3
2. We assert that the case of equality holds i
q(h) = Pr(h|x,
[t]
) (53)
so that
Pr(x|
[t]
) =
_
Pr(h|x,
[t]
) log
Pr(x, h|
[t]
)
Pr(h|x,
[t]
)
dh
=
_
Pr(h|x,
[t]
) log Pr(x, h|
[t]
)dh
_
Pr(h|x,
[t]
) log Pr(h|x,
[t]
)(54)
We prove this in Section 4.4. Note that calculating Pr(h, x|
[t]
) is the E-Step.
3. We now maximize the bound with respect to the parameters which necessarily gives:
log Pr(x|
[t]
)
_
Pr(h|x,
[t]
) log Pr(x, h|
[t+1]
)dh
_
Pr(h|x,
[t]
) log Pr(h|x,
[t]
) (55)
where we treat the distribution q(x) as xed so that the second term does not eect the
re-estimation of . Note that maximizing this rst term is the M-Step.
12
4. Recombining the terms, we see that:
log Pr(x|
[t]
)
_
Pr(h|x,
[t]
) log
Pr(x, h|
[t+1]
)
Pr(h|x,
[t]
)
dh
(56)
5. The R.H.S is now of the same form as the bound in Equation 51, but the distribution q(h)
is not the correct one force equality, so
log Pr(x|
[t]
)
_
Pr(h|x,
[t]
) log
Pr(x, h|
[t+1]
)
Pr(h|x,
[t]
)
dh
log Pr(x|
[t+1]
) (57)
as required.
4.3 Proof of the Bound
In this section we prove that B(q(h)s,
[t]
) is a lower bound on the likelihood at time t.
log Pr(x|
[t]
) = log
_
Pr(x, h|
[t]
)dh
= log
_
q(h)
Pr(x, h|
[t]
)
q(h)
dh (58)
By Jensens inequality, if q(h) is a probability distribution:
log
_
q(h)
Pr(x, h|
[t]
)
q(h)
dh
_
q(h) log
Pr(x, h|
[t]
)
q(h)
dh (59)
In words, Jensens inequality states, that for a concave function like the logarithm, the function
applied to a weighted sum of points is always greater or equal than the weighted sum of the
function applied to the points individually. i.e. the R.H.S is a lower bound on the log likelihood.
4.4 Maximizing the Bound
We need to show that
_
log Pr(x|
[t]
) = q(h) log
Pr(x, h|
[t]
)
q(h)
dh (60)
i
q(h) = Pr(h|x,
[t]
) (61)
Substituting into from the R.H.S of Equation 60:
13
_
q(h,
[t]
) log
Pr(x, h|
[t]
)
q(h,
[t]
)
dh =
_
q(h,
[t]
) log
Pr(x|
[t]
)Pr(h|x,
[t]
)
q(h,
[t]
)
dh
=
_
q(h,
[t]
) log Pr(x|
[t]
)dh
_
q(h,
[t]
) log
q(h,
[t]
)
Pr(h|x,
[t]
)
= log Pr(x|
[t]
)
_
Pr(h|x,
[t]
)dh
_
q(h,
[t]
) log
q(h,
[t]
)
Pr(h|x,
[t]
)
= log Pr(x|
[t]
)
_
q(h,
[t]
) log
q(h,
[t]
)
Pr(h|x,
[t]
)
(62)
It can be seen that the second term on the R.H.S disappears if
q(h,
[t]
) = Pr(h|x,
[t]
) (63)
as each log term becomes log[1] = 0.
In fact the term on the R.H.S is the Kullback-Leibler divergence between q(h,
[t]
) and Pr(h|x,
[t]
).
This is a distance between probability distributions and always takes a positive value except
when the distributions are the same when it is zero. Further information can be found in any
textbook on probability theory.
14
APPENDICES
A Gaussian Relation #1: Product
The product of two Gaussians is also a Gaussian. To see this, consider that the exponent of
each Gaussian is a quadratic in x. When we multiply the two Gaussians, we add the exponents.
This produces the sum of two quadratics, which is itself another quadratic. In fact, the following
relation holds:
G
x
(a, A) G
x
(b, B) G
x
_
(A
1
+B
1
)
1
(A
1
a +B
1
b), (A
1
+B
1
)
1
_
(64)
Proof:
G
x
(a, A) G
x
(b, B) =
1
(2)
n
|A|
1/2
|B|
1/2
exp
_
0.5((x a)
T
A
1
(x a) + (x b)
T
B
1
(x b))

= k exp
_
0.5(x
T
(A
1
+B
1
)x +x
T
(A
1
a +B
1
b)
+(a
T
A
1
+b
T
B
1
)x)

(65)
where the exponential terms that do not depend on x have been subsumed into the constant,
k. It is clear from the quadratic term, that this can be re-arranged to form a Gaussian with
covariance (A
1
+B
1
)
1
. We can complete the square:
G
x
(a, A) G
x
(b, B) = k
2
exp
_
0.5(x
T
(A
1
+B
1
)x +x
T
(A
1
a +B
1
b)
+ (a
T
A
1
+b
T
B
1
)x
+ (A
1
a +B
1
b)
T
(A
1
+B
1
)
1
(A
1
a +B
1
b))

= k
2
exp
_
0.5
_
x (A
1
+B
1
)
1
(A
1
a +B
1
b)
_
T
_
A
1
+B
1
_
_
x (A
1
+B
1
)
1
(A
1
a +B
1
b)
_
= k
2
G
x
_
(A
1
+B
1
)
1
(A
1
a +B
1
b), (A
1
+B
1
)
1
_
(66)
as required.
B Gaussian Relation #2
We will also need a second Gaussian relation: Consider a Gaussian in x with a mean that is a
linear function H of y. It can be shown that this Gaussian can be re-expressed in terms of y:
G
x
[Hy, ] G
y
_
(H
T

1
H)
1
H
T

1
x, (H
T

1
H)
1

(67)
Proof:
G
x
[Hy, ] = k exp
_
0.5
_
(x Hy)
T

1
(x Hy)
_
= k exp
_
0.5
_
y
T
H
T

1
Hy y
T
H
T

1
x x
T

1
Hy
T
+x
T

1
x
__
(68)
15
This exponent is a quadratic function in y with covariance (H
T

1
H)
1
. Completing the square
and absorbing all terms that do not depend on y into the constant.
G
x
[Hy, ] = k exp
_
0.5
_
y
T
H
T

1
Hy y
T
H
T

1
x x
T

1
Hy
T
+x
T

1
x
__
= k
2
exp
_
0.5
_
y
T
H
T

1
Hy y
T
H
T

1
x x
T

1
Hy
T
+x
T

1
H(H
T

1
H)
1
H
T

1
x
_
= k
2
exp
_
0.5
_
(y (H
T

1
H)
1
H
T
x)
T
(H
T

1
H)(y (H
T

1
H)
1
H
T
x)
_
= k
2
G
y
_
(H
T

1
H)
1
H
T

1
x, (H
T

1
H)
1

(69)
as required.
B.1 Matrix Lemma #1
Consider the d d matrix P, the k k matrix R and the k d matrix H where P and R are
symmetric, positive denite covariance matrices. The following equality holds:
(P
1
+H
T
R
1
H)
1
H
T
R
1
= PH
T
(HPH
T
+R)
1
(70)
Proof:
H
T
R
1
HPH
T
+H
T
= H
T
+H
T
R
1
HPH
T
H
T
R
1
(HPH
T
+R) = (P
1
+H
T
R
1
H)PH
T
(71)
Taking the inverse of both sides:
(P
1
+H
T
R
1
H)
1
H
T
R
1
= PH
T
(HPH
T
+R)
1
(72)
as required.
C Matrix Relation #2: Inversion Lemma
Consider the d d matrix P, the k k matrix R and the k d matrix H where P and R are
symmetric, positive denite covariance matrices. The following equality holds:
(P
1
+H
T
R
1
H)
1
= PPH
T
(HPH
T
+R)
1
HP (73)
This is known as the Matrix Inversion Lemma.
Proof:
(P
1
+H
T
R
1
H)
1
= (P
1
+H
T
R
1
H)
1
(I +H
T
R
1
HPH
T
R
1
HP)
= (P
1
+H
T
R
1
H)
1
_
(P
1
+H
T
R
1
H)PH
T
R
1
HP
_
= P(P
1
+H
T
R
1
H)
1
HR
1
HP (74)
Now, applying Matrix Lemma #1 to the term in brackets:
16
(P
1
+H
T
R
1
H)
1
= P(P
1
+H
T
R
1
H)
1
HR
1
HP
= PPH
T
(HPH
T
+R)
1
HP (75)
as required.
17

Вам также может понравиться