Академический Документы
Профессиональный Документы
Культура Документы
S
SpeechhRRecognition
iti
H
Hung-Shin
Shi Lee
L
Institute of Information, Sinica Academic
D off Electrical
Dep. El t i l E Engineering,
i i N
National
ti lTTaiwan
i U
University
i it
2
References -2
2
9) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” IEEE Trans.
expansion, Trans on Speech and Audio Processing,
Processing vol.
vol 12,
12 no.
no 11,
11 pp.
pp
37-46, 2004.
10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.
11) N. Kumar and A. Andreou, “Heteroscedastic
Heteroscedastic discriminant analysis and. reduced
rank HMMs for improved speech recognition,” Speech Communication, vol. 26,
no. 4, pp. 283-297, 1998.
12) A. Ljolje, “The importance of cepstral parameter correlations in speech
recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.
13) B. Flury, “Common principal components in k groups,” Journal of the
American Statistical Association, vol. 79, no. 388, 1984.
14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal
transformation of several positive definite symmetric matrices to nearly
diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7,
no 1,
no. 1 1986.
1986
3
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
4
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
5
Introduction to Feature Decorrelation -11
• Definition of Covariance Matrix:
– Random vector:
X = [ x1 xn ]T
random variable
6
Introduction to Feature Decorrelation -2
2
• Feature Decorrelation (FD)
– To find transformations Θ that make all variables or
parameters (nearly) uncorrelated
~ ~ ~ ~
cov( X i , X j ) = 0, ∀X i , X j , i ≠ j
transformed random variable
ΘT ΣΘ = D
The covariance matrix can be
global or depend on each class.
covariance matrix diagonal matrix
7
Introduction to Feature Decorrelation -33
• Why Feature Decorrelation (FD)?
– In many speech recognition systems, the observation density
function for each HMM state are modeled as mixtures of
diagonal
g covariance Gaussians
– For the sake of computational simplicity, the off-diagonal
elements of the covariance matrices of the Gaussians are
assumed to be close to zero
f (x) = π i N (x; μ i , Σ i )
i
1 1
= π i exp − ( x − μ i ) T
Σ −1
i ( x − μ i
)
observation i (2π ) | Σ i |
12 12
2
density function
diagonal matrix
8
FD on Speech Recognition -11
• Approaches for FD can be divided into two categories:
Feature-space and Model-space
• Feature-space Schemes
– Hard to find a single transform which decorrelates all elements
of the feature vector for all states
• Model-space Schemes
– A different transform is selected depending on which
component the observation was hypothesized to be generated
from
– In the limit a transform may be used for each component,
which
hi h iis equivalent
i l t tto a ffullll covariance
i matrix
t i system
t
9
FD on Speech Recognition -2
2
• Feature-space Schemes on LVCSR
DCT, PCA, LDA, MLLT, …
Feature-space
Decorrelation
AM Training AM LM Lexicon
10
FD on Speech Recognition -33
• Model-space Schemes on LVCSR
Speech Front-End Test Speech Textual
Signal Preprocessing Data Decoding Results
Training Data
AM Training
g AM LM Lexicon
Model-space
p
Decorrelation
11
FD on Speech Recognition -4
4
Without Label
With Label Information
Information
Discrete Cosine Transform Linear Discriminant Analysis (LDA)
(DCT) Feature-Space
F t S
Principal Component Common Principal Components Schemes
Analysis (PCA) (CPC)
Extended Maximum Likelihood
Linear Transform (EMLLT)
Semi-Tie Covariance Matrices Model-Space
Multiple Linear Transforms (MLT) S h
Schemes
Maximum Linear Likelihood
Transform (MLLT)
12
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
13
Discrete Cosine Transform -11
• Discrete Cosine Transform (DCT)
– Since the log-power spectrum is real and symmetric, the
inverse DFT reduces to a Discrete Cosine Transform (DCT)
– Is Applied
pp to the log-energies
g g of output
p filters (Mel-scaled
(
filterbank) during the MFCC parameterization
2 n π × j
cj =
n i =1
xi cos
n
(i − 0.5) , ffor j = 0,1,..., m < n
i-th coordinate of the input vector x
j-th coordinate of the output vector c
– Partial decorrelation
14
Discrete Cosine Transform -2
2
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
18 Mel-scaled Filterbank 133 Mel-cepstrum
p ((using
g DCT))
2 8
6
15
1.5
4
1
2
0.5
0
0 -2
20 40
15 20 30 40
10 15 20 30
10 20
5 5 10 10
0 0 0 0
15
Principal Component Analysis -11
• Principal Component Analysis (PCA)
– Based on the calculation of the major directions of variations of
a set of data points in a high dimension space
– Extracts the direction of the ggreatest variance,, assuming g that
the less variation of the data, the less information it carries of
the features
– Principal components V = [ v1 ,..., v p ] : the largest eigenvectors
of the total covariance matrix Σ
Σv i = λi v i
V T ΣV = D
diagonal matrix
16
Principal Component Analysis -2
2
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
2 150
15
1.5
100
1
50
0.5
0
0
-0.5 -50
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0
17
Linear Discriminant Analysis -11
• Linear Discriminant Analysis (LDA)
– Seeks a linear transformation matrix Θ that satisfies
(
Θ = arg max trace (ΘT SW Θ) −1 (ΘT S B Θ)
Θ
)
– Θ is formed by the largest eigenvectors of SW−1S B
– The LDA subspace is not orthogonal (but PCA subspace is)
ΘT Θ ≠ I
– The LDA subspace makes the transformed
f variables
statistically uncorrelated.
18
Linear Discriminant Analysis -2
2
• From any two distinct eigenvalue/eigenvector pairs
(λi , θ i ) and (λ j , θ j )
S B θ i = λi SW θ i
S B θ j = λ j SW θ j
p y g byy θTj and θTi , respectively
– Pre-multiplying p y
θTj S B θ i = λi θTj SW θ i θTj S B θ i =θTi S B θ j
T ⎯⎯ ⎯ ⎯ ⎯ → λ θ T
S
i j W iθ = λ T
j i SW θ j
θ
θ i S B θ j = λ j θ i SW θ j
T
20
Linear Discriminant Analysis -4
4
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
162 Spliced
p Mel-Filterbank 39 LDA Transformed Subspace
p
2 15
1.5
10
1
5
0.5
0
0
-0.5 -5
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0
21
Maximum Likelihood Linear Transform
• Maximum Likelihood Linear Transform (MLLT) seems to
have two types:
– Feature-based: (Gopinath, 1998)
– Model
Model-based:
based: (Olsen,
(Olsen 2002)
2002), (Olsen,
(Olsen 2004)
22
Feature-based
Feature based MLLT -11
• Feature-based Maximum Likelihood Linear Transform
(F-MLLT)
– Tries to alleviate four problems:
(a) Data insufficiency implying unreliable models
(b) Large storage requirement
(c) Large computational requirement
(d) ML is
i nott discriminating
di i i ti between
b t classes
l
– Solutions
(a)-(c): Sharing parameters across classes
(d): Appealing to LDA
23
Feature-based
Feature based MLLT -2
2
• Maximum Likelihood Modeling
– The likelihood of the training data {(x i , li )} is given by
1
− ( x i −m li )T Σ l−i 1 ( x i −m li )
N N
e 2
N
} {Σ li }) = ∏ p (x i , {μ li },
p (x , {μ li },
1 } {Σ li }) = ∏
i =1 i =1 (2π ) d | Σ li | Appendix A
Nd
N
log p (x , {μ j }, {Σ j }) = −
1 log(2π )
2
C
−
nj
(
(m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ) + log | Σ j | )
j =1 2
24
Feature-based
Feature based MLLT -33
• The idea of maximum likelihood estimation (MLE) is to
choose the parameters {μˆ j } and {Σˆ j } so as to maximize
log p(x1N ,{μ j }, {Σ j })
ˆ }
– ML Estimators: {μˆ j } and {Σ j
25
Feature-based
Feature based MLLT -4
4
• Multiclass ML Modeling
– The training data is modeled with Gaussians (μˆ j , Σˆ j )
– The ML estimators: μˆ j = m j , Σˆ j = S j
Appendix B
– The log-ML
g value:
ˆ j })
log p ∗ (x1N ) = log p(x1N , {μˆ j }, {Σ
C C
Nd nj nj
=− (llog((2π ) + 1) − | S j | = g ( N , d ) − | S j |
2 j =1 2 j =1 2
26
Feature-based
Feature based MLLT -55
• Constrained ML – Diagonal Covariance
– The ML estimators: μˆ j = m j , Σˆ j = diag(S j )
– The log-ML value:
C
nj
log p ∗
diag (x ) = g ( N , d ) − | diag(S j ) |
N
1
j =1 2
the Jacobian
How to choose Θ for each class?
Appendix C 27
Feature-based
Feature based MLLT -6
6
• Multiclass ML Modeling – Some Issues
– If the sample size for each class is not large enough then the
ML parameter estimates may have large variance and hence be
unreliable
– The storage requirement for the model: O(Cd 2 )
– The computational requirement: O(Cd 2 )
Why?
28
Feature-based
Feature based MLLT -77
• Multiclass ML Modeling – Some Issues (cont.)
– Parameters sharing across classes: reduces the number of
parameters, storage requirements, and computational
requirements
q
– Hard to justify parameters sharing is more discriminating
29
Feature-based
Feature based MLLT -8
8
• Multiclass ML Modeling – Some Issues (cont.)
– We can globally transform the data with a unimodular matrix Θ
and model the transformed data with diagonal Gaussians
((There is a loss in likelihood too)) det(Θ) = 1
30
Feature-based
Feature based MLLT -9
9
• Another constrained MLE with sharing of parameters
– Equal Covariance
– Clustering
– The ML value:
C
logg p ∗
diag (x ) = g ( N , d ) −
N
1
nj
( g(Θ C j S j ΘTC j ) | + logg | Θ C j |
| diag( )
j =1 2
31
Feature-based
Feature based MLLT -10
10
• One Cluster with Diagonal Covariance
– When the number of clusters is one, there is single global
transformation and the classes are modeled as diagonal
Gaussians in this feature space
p
– The ML estimators: μˆ j = m j , Σˆ j = Θ −T diag(ΘT S j Θ)Θ −1
– The log-ML value:
C
nj
∗
log pdiag (x1N ) = g ( N , d ) − | diag(ΘT S j Θ) | + N log | Θ |
j =1 2
j =1 2
Θ
32
Feature-based
Feature based MLLT -11
11
• Optimization – the numerical approach:
– The objective function
C
nj
F (Θ) = | diag(ΘT S j Θ) | − N log | Θ |
j=1 2
( )
G (Θ) = NΘ −T − n j diag(ΘS j ΘT ) ΘS j
−1
33
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
34
Model-based
Model based MLLT -11
• In model-based MLLT, instead of having a distinct
covariance matrix for every component in the recognizer,
each covariance matrix consists of two elements:
– A non singular linear transformation matrix Θ shared over a
non-singular
set of components
– The diagonal elements in the matrix Λ j for each class j
• Log-likelihood function:
N
1
L(Θ : X) =
N
log f (x
i =1
i | Θ)
36
Model-based
Model based MLLT -33
• Parameters of the M-MLLT model for each HMM state
Θ = {π 1m , μ1m , Λ1m , Θ} are estimated using a generalized
expectation-maximization (EM) algorithm
• The auxiliary “Q “Q-function”
function” should be introduced
introduced.
– The Q-function satisfies the inequality
* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,”
37
J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.
Model-based
Model based MLLT -4
4
• The auxiliary Q-function is
N m
1
ˆ
Q(Θ; Θ) =
N
γ
i =1 j =1
ij L j (x i )
1 N m
γ ij The component
=
N
2
i =1 j =1
log-likelihood of
the observation i
L j (x i ) = log(π j N (x i ; μ j , Σ j ))
38
Model-based
Model based MLLT -55
• Gales’ Approach for deriving Θ
Estimate the mean and the component weight, which are
independent of the other model parameters
N N N
1
πj =
N
γ
i =1
ij μ j = γ ij x i
i =1
γ
i =1
ij
λij = θ i S j θTi
the i-th entry of diagonal the i-th row vector of Θ
variance of component j
40
Semi-Tied
Semi Tied Covariance Matrices -11
• Introduction to Semi-Tied Covariance Matrices
– A natural extension of the state-specific rotation scheme
– The transform is estimated in a maximum-likelihood (ML)
fashion g
given the current model parameters
p
– The optimization is performed using a simple iterative scheme,
which is guaranteed to increase the likelihood of the training
data
41
Semi-Tied
Semi Tied Covariance Matrices -2
2
• State-Specific Rotation
– A full covariance matrix is calculated for each state in the
system, and is decomposed into its eigenvectors and
eigenvalues
g
Σ (full
s)
= U ( s ) Λ ( s ) U ( s )T
43
Semi-Tied
Semi Tied Covariance Matrices -4
4
• Semi-Tied Covariance Matrices
– Instead of having a distinct covariance matrix for every
component in the recognizer, each covariance matrix consists
of two elements
Σ ( m ) = H ( r ) Σ (diag
m)
H ( r )T
component specific diagonal
covariance element
semi-tied class-dependent,
non-diagonal matrix,
may be tied over a set of
components
44
Semi-Tied
Semi Tied Covariance Matrices -55
• Semi-Tied Covariance Matrices (cont.)
– It is very complex to optimize these parameters directly so an
expectation-maximization approach is adopted
M = {π ( m ) , μ ( m ) , Σ (diag
d
m)
, Θ (r )
}
Θ ( r ) = H ( r ) −1
45
Semi-Tied
Semi Tied Covariance Matrices -6
6
• The auxiliary Q-function
Q( M , Mˆ ) =
|Θ ˆ ( r ) |2
γ m (τ )
log
ˆ (m)
− ο((τ ) − ˆ
μ )
Θ Σ diag Θ (
( m ) T ˆ ( r )T ˆ ( m ) −1 ˆ ( r )
ο (τ ) − ˆ
μ (m)
)
m∈M ( r ),τ
| Σ diag |
component m at time τ
46
Semi-Tied
Semi Tied Covariance Matrices -77
• If all the model parameters are to be simultaneously
optimized then the Q-function may be rewritten as
| Θˆ ( r ) |2
Q( M , M ) = γ m (τ ) log
ˆ l − dβ
| diag (Θ ˆ ( r )T ) |
ˆ ( r ) W ( m )Θ
m∈M ( r ),τ
where
τ γ (τ )ο(τ )
m
τ γ m (
(τ ) ο(τ ) − μˆ (m)
)(ο(τ ) − μˆ )
(m) T μˆ ( m ) =
τ γ (τ ) m
(m)
W =
τ γ m (τ )
β= γτ
m∈M ( r ),
m (τ )
47
Semi-Tied
Semi Tied Covariance Matrices -8
8
• The ML estimate of the diagonal element of the
covariance matrix is given by
ˆ (diag
Σ m)
g = diag ˆ
Θ((r )
W ( m ) ˆ ( r )T
Θ )
• The reestimation formulae for the component weights
and transition probabilities are identical to the standard
HMM cases (Rabiner, 1989)
49
Semi-Tied
Semi Tied Covariance Matrices -10
10
• How to carry out step (3)?
– Optimizing the semi-tied transform requires an iterative
estimation scheme even after fixing all other model
p
parameters
– Selecting a particular row of Θˆ ( r ) , θˆ i( r ) , and rewriting the
former Q-function using the current set {Σˆ (diag r)
}
ˆ (r ) |
|Θ
Q( M , Mˆ ;{Σ
ˆ (r )
diag } o ( m ) (τ ) − μˆ ( m )
(θˆ ( r )T ( m )
ˆ
ο (τ )) 2
= γ m (τ ) log(
l (θˆ i( r )T ci ) 2 − log ˆ (diag | −
m) j
l |Σ
σ ( m ) 2
m∈M ( r ),τ j diag j
The ith row vector of the cofactors the leading diagonal element j
50
Semi-Tied
Semi Tied Covariance Matrices -11
11
• How to carry out step (3)? (cont.)
– It is shown that the ML estimate for the ith row of the semi-tied
transform, θˆ i( r ) , is given by
β
θˆ i( r ) = c i G ( ri ) −1
c i G ( ri ) −1cTi
1
G ( ri ) = σˆ (m)2
W ( m ) γ m (τ )
τ
m∈M ( r ) diag i
51
Semi-Tied
Semi Tied Covariance Matrices -12
12
• It can be shown that
Q( M , Mˆ ) ≥ Q( M , Mˆ ;{Σ
ˆ (diag
r)
})
Θ ( r ) T ο(τ ) 1 2 log | Θ ( r ) |2
52
Extended MLLT -11
• The extended MLLT (EMLLT) is much similar to M-MLLT.
But the only difference is the precision matrix modeling
D
P j = Σ = Θ Λ j Θ = λkj θ k θTk
−1
j
T
k =1
basis
54
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
55
Common Principal Components -11
• Why Common Principal Components (CPC)?
– We often deal with the situation of the same variables being
measured on objects from different groups, and the covariance
structure mayy varyy from g
group
p to g
group
p
– But sometimes the covariance matrices of different groups
look somehow similar, and it seems reasonable to assume that
the covariance matrices have a common basic structure
57
Common Principal Components -33
• Assume ni S i are independently distributed as W p (ni , Σ i ) .
The common likelihood function is
k
ni −1
L( Σ1 ,..., Σ k ) = C × ∏ exp trace − Σ i S i Σ i
− ni 2
i =1 2
58
Common Principal Components -4
4
• Assume H C holds for some orthogonal matrix β , and
Λ i = diag(λi1 ,..., λip ) , then
p
log | Σ i |= log λij , i = 1,...,kk
j =1
p
βTj S i β j
trace(Σ i−1S i ) = trace(βΛ i−1βT S i ) = trace(Λ i−1βT S i β) =
j =1 λij
• Therefore g ( Σ1 ,..., Σ k ) = g (β1 ,..., β p , λ11 ,..., λ1 p , λ21 ,..., λkp )
k p βTj S i β j
= ni log λij +
j =1 λijj
i =1
59
Common Principal Components -55
• The function g is to be minimized under the restrictions
0 if h ≠ j G ( Σ1 ,..., Σ k )
T
β βj =
h p p
1 if h = j ( )
= g ( Σ1 ,..., Σ k ) − γ h β β h − 1 − 2 γ hjj βTh β j
T
h
h =1 h< j
60
Common Principal Components -6
6
• Minimization:
k
p
βTj S i β j
∂ ni log λij +
∂G ( Σ1 ,..., Σ k ) i =1 j =1 λij =0
=
∂λij ∂λij
λij = βTj S i β j , i = 1,,...,, k ; j = 1,,...,, p
61
Common Principal Components -77
• Minimization (cont.):
k p
β T
S β
ni log λij + j i j
i =1 j =1 λij
∂
p p
− γ h (β h β h − 1) − 2 γ hj β h β j
T T
∂G ( Σ1 ,..., Σ k )
= =0
h =1 h< j
∂β j ∂β j
k ni S i β j p
− γ j β j − γ jh β h = 0, j = 1,..., p
i =1 λij h =1
h≠ j
k
– Multiplying
M lti l i the
th left b βTh gives
l ft by i γ j = ni , j = 1,..., p
i =1
62
Common Principal Components -8
8
• Minimization (cont.): Thus
k ni S i β j k p
i =1 λij
−( ni )β j − γ jh β h , j = 1,..., p
i =1 h =1
h≠ j
h≠
63
Common Principal Components -9
9
• Minimization (cont.):
k ni βTl S i β j k ni βTl S i β j
i =1 λij
−
i =1 λil
=0
k λil − λij
β l ni
T
S i β j = 0, l,j = 1,...,p; l ≠ j
i =1 λ λ
il ijj
64
Common Principal Components -10
10
• Solving procedure of CPC – FG Algorithm
– F-Algorithm
65
Common Principal Components -11
11
– G-Algorithm
66
Common Principal Components -12
12
• Likelihood Ratio Test
– The sample common principal components:
U i = βˆ T X i , i = 1,..., k
k
| diag(Fi ) | k
∏ f jj( i )
χ = ni log
2
= ni log
j =1
p
| Fi |
i =1 i =1
∏l
j =1
ij eigenvalues of Fi
67
Common Principal Components -13
13
• Likelihood Ratio Test (cont.)
– The likelihood ratio criterion is a measure of simultaneous
diagonalizability of k p.d.s. matrices
– The CPC's can be viewed as obtained byy a simultaneous
transformation, yielding variables that are as uncorrelated as
possible
– It can also be seen from another viewpoint of Hadamard
Hadamard’ss
inequality
| Fi |≤| diag(
g(Fi ) |
68
Common Principal Components -14
14
• Actually, CPC can be also viewed as another measure of
“deviation from diagonality”
| diag(Fi ) |
ϕ (Fi ) = ≥1
| Fi |
69
Comparison Between CPC and F-MLLT
F MLLT
• CPC tries to maximize
C
i
n (log
i =1
| Σ i | + trace( Σ −1
i S i ))
The estimates are KNOWN
70
Appendix A -11
• Show
1
− ( x i −μ li )T Σ l−i 1 ( x i −μ li )
j 2j (( m j −μ j )T Σ −j1 ( m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )
N Nd n
e 2 − −
∏
i =1 (2π ) | Σ li |
d
= (2π ) 2
e
71
Appendix A -2
2
For each class C j ,
(x
N
− m j )T Σ −j1 (m j − μ j )
(xi − μ li )T Σl−i1 (xi − μ li )
i =1
x i ∈C j
i
N = Σ −j1 (m j − μ j ) (x i − m j )T
= ( x i − m li + m li − μ li ) Σ ( x i − m li + m li − μ li )
T −1
li x i ∈C j
i =1
N =0
= ((x i − m li ) + (m li − μ li ) ) Σ ((x i − m li ) + (m i − μ li ))
T T −1
li
i =1
N
= ((x i − m li )T Σ l−i1 (x i − m li ) + (m li − μ li )T Σ l−i1 (m li − μ li ) + (x i − m li )T Σ l−i1 (m li − μ li ) + (m li − μ li )T Σ l−i1 (x i − m li ))
i =1
N N
= trace(Σ l−i1 (x i − m li )(x i − m li )T + N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N N
= n j trace(Σ −j1S j ) + N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N
= n j ((m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ))
i =1
72
Appendix B
• ML estimators for log p(x1N ,{μ j },{Σ j })
∂ log p (x1N ,{μ j }, {Σ j })
= NΣ −j1 (m j − μ j ) = 0
∂μ j
μˆ j = m j
73
Appendix C -11
• Change of Variable Theorem
– Consider a one-one mapping g : ℜ n → ℜ n , Y = f ( X)
– Equal probability: The probability of falling in a region in space
X should be the same as the p probabilityy of falling
g in the
corresponding region in space Y
– Suppose the region dx1dx2 ...dxn maps to the region dA in the Y
space Equating probabilities,
space. probabilities we have
f Y ( y1 ,..., yn )dA = f X ( x1 ,..., xn )dx1...dxn
74
Appendix C -2
2
• Change of Variable Theorem (cont.)
– The hyper-parallelepiped dA can be calculated by
dy1 dyn dy1 dyn
dx1 ,,...,, dx1 ,,...,,
d1
dx d1
dx d1
dx d1
dx
dA = = dx1...dxn
dy1 dyn dy1 dyn
dxn ,..., dxn ,...,
d n
dx d n
dx d n
dx ddxn
75
Appendix D -11
• If A is a square matrix, then the minor entry of aij is
denoted by Mij and is defined to be the determinant of
the submatrix that remains after the i-th row and the j-th
column are deleted from A. ( 1)i + jMij is
A The number (−1)
denoted by cij and is called the cofactor of aij.
• Given the matrix
b11 b12 b13 c33 = (−1) 3+3 ( M 33 ) c23 = (−1) 2+3 ( M 23 )
B = b21 b22 b23
b11 b12 ×
b31 b32 b33 b11 b12
M 23 = × × × = = b11b32 − b12b31
b31 b32
b31 b32 ×
76
Appendix D -2
2
• Given the n by n matrix
– The determinant of A can be written as the sum of its cofactors
multiplied by the entries that generated them.
a11 a12 a1n
a a22 a2 n
A=
21
an1 an 2 ann
(cofactor expansion along the jth column)
det( A) = a1 j c1 j + a2 j c2 j + a3 j c3 j + + anj cnj = A( j )T c ( j )
((cofactor expansion
p along
g the ith row))
det( A) = ai1ci1 + ai 2 ci 2 + ai 3ci 3 + + ain cin = A( i ) c(Ti )
77