Feature Decorrelation On Speech Recognition

Feature Decorrelation on
S
SpeechhRRecognition
iti
H
Hung-Shin
Shi Lee
L
Institute of Information, Sinica Academic
D off Electrical
Dep. El t i l E Engineering,
i i N
National
ti lTTaiwan
i U
University
i it
2009-10-09 @ IIS, Sinica Academic

References -11
1) J. Psutka and L. Muller, “Comparison of various feature decorrelation
techniques in automatic speech recognition
recognition,” in Proc.
Proc CITSA 2006.
2006
2) Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw
3) Batlle et al., “Feature decorrelation methods in speech recognition - a
comparative study,
study,” in Proc. ICSLP 1998.
4) K. Demuynck et al., “Improved feature decorrelation for HMM-based speech
recognition,” in Proc. ICSLP 1998.
5) W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective,
Oxford Press, 1988.
6) R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for
classification,” in Proc. ICASSP 1998.
7) M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE
Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999.
8) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” in Proc.
expansion,” Proc ICASSP 2002.
2002
2
References -2
2
9) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” IEEE Trans.
expansion, Trans on Speech and Audio Processing,
Processing vol.
vol 12,
12 no.
no 11,
11 pp.
pp
37-46, 2004.
10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.
11) N. Kumar and A. Andreou, “Heteroscedastic
Heteroscedastic discriminant analysis and. reduced
rank HMMs for improved speech recognition,” Speech Communication, vol. 26,
no. 4, pp. 283-297, 1998.
12) A. Ljolje, “The importance of cepstral parameter correlations in speech
recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.
13) B. Flury, “Common principal components in k groups,” Journal of the
American Statistical Association, vol. 79, no. 388, 1984.
14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal
transformation of several positive definite symmetric matrices to nearly
diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7,
no 1,
no. 1 1986.
1986
3
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition
• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
4
Outline
5
Introduction to Feature Decorrelation -11
• Definition of Covariance Matrix:
– Random vector:
X = [ x1  xn ]T
random variable
– Covariance matrix and mean vector:

Σ ≡ E[( X − μ)( X − μ)T ]
expected value random vector
μ ≡ E[X]
6
2
• Feature Decorrelation (FD)
– To find transformations Θ that make all variables or
parameters (nearly) uncorrelated
~ ~ ~ ~
cov( X i , X j ) = 0, ∀X i , X j , i ≠ j
transformed random variable
– Or make the covariance matrix diagonal (not necessary

identity)
ΘT ΣΘ = D
The covariance matrix can be
global or depend on each class.
covariance matrix diagonal matrix
7
• Why Feature Decorrelation (FD)?
– In many speech recognition systems, the observation density
function for each HMM state are modeled as mixtures of
diagonal
g covariance Gaussians
– For the sake of computational simplicity, the off-diagonal
elements of the covariance matrices of the Gaussians are
assumed to be close to zero
f (x) =  π i N (x; μ i , Σ i )
i
1  1 
= π i exp − ( x − μ i ) T
Σ −1
i ( x − μ i 
)
observation i (2π ) | Σ i |
12 12
 2 
density function
diagonal matrix
8
FD on Speech Recognition -11
• Approaches for FD can be divided into two categories:
Feature-space and Model-space
• Feature-space Schemes
– Hard to find a single transform which decorrelates all elements
of the feature vector for all states
• Model-space Schemes
– A different transform is selected depending on which
component the observation was hypothesized to be generated
from
– In the limit a transform may be used for each component,
which
hi h iis equivalent
i l t tto a ffullll covariance
i matrix
t i system
t
9
2
• Feature-space Schemes on LVCSR
DCT, PCA, LDA, MLLT, …
Feature-space
Decorrelation
Speech Front-End Test Speech Textual

Signal Preprocessing Data Decoding Results
Training Data
AM Training AM LM Lexicon
10
• Model-space Schemes on LVCSR
Speech Front-End Test Speech Textual
Signal Preprocessing Data Decoding Results
Training Data
AM Training
g AM LM Lexicon
Model-space
p
Decorrelation
MLLT,, EMLLT,, MLT,, Semi-Tied,, …
11
4
Without Label
With Label Information
Information
Discrete Cosine Transform Linear Discriminant Analysis (LDA)
(DCT) Feature-Space
F t S
Principal Component Common Principal Components Schemes
Analysis (PCA) (CPC)
Extended Maximum Likelihood
Linear Transform (EMLLT)
Semi-Tie Covariance Matrices Model-Space
Multiple Linear Transforms (MLT) S h
Schemes
Maximum Linear Likelihood
Transform (MLLT)
12
Outline
13
Discrete Cosine Transform -11
• Discrete Cosine Transform (DCT)
– Since the log-power spectrum is real and symmetric, the
inverse DFT reduces to a Discrete Cosine Transform (DCT)
– Is Applied
pp to the log-energies
g g of output
p filters (Mel-scaled
(
filterbank) during the MFCC parameterization
2 n π × j 
cj = 
n i =1
xi cos
 n
(i − 0.5) , ffor j = 0,1,..., m < n

i-th coordinate of the input vector x
j-th coordinate of the output vector c
– Partial decorrelation
14
Discrete Cosine Transform -2
2
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
18 Mel-scaled Filterbank 133 Mel-cepstrum
p ((using
g DCT))
2 8
6
15
1.5
4
1
2
0.5
0
0 -2
20 40
15 20 30 40
10 15 20 30
10 20
5 5 10 10
0 0 0 0
15
Principal Component Analysis -11
• Principal Component Analysis (PCA)
– Based on the calculation of the major directions of variations of
a set of data points in a high dimension space
– Extracts the direction of the ggreatest variance,, assuming g that
the less variation of the data, the less information it carries of
the features
– Principal components V = [ v1 ,..., v p ] : the largest eigenvectors
of the total covariance matrix Σ
Σv i = λi v i
 V T ΣV = D
diagonal matrix
16
Principal Component Analysis -2
2
162 Spliced Mel

Mel-Filterbank
Filterbank 39 PCA Transformed Subspace
2 150
15
1.5
100
1
50
0.5
0
0
-0.5 -50
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0
17
Linear Discriminant Analysis -11
• Linear Discriminant Analysis (LDA)
– Seeks a linear transformation matrix Θ that satisfies
(
Θ = arg max trace (ΘT SW Θ) −1 (ΘT S B Θ)
Θ
)
– Θ is formed by the largest eigenvectors of SW−1S B
– The LDA subspace is not orthogonal (but PCA subspace is)
ΘT Θ ≠ I
– The LDA subspace makes the transformed
f variables
statistically uncorrelated.
18
2
• From any two distinct eigenvalue/eigenvector pairs
(λi , θ i ) and (λ j , θ j )
S B θ i = λi SW θ i

S B θ j = λ j SW θ j
p y g byy θTj and θTi , respectively
– Pre-multiplying p y
θTj S B θ i = λi θTj SW θ i θTj S B θ i =θTi S B θ j
 T ⎯⎯ ⎯ ⎯ ⎯ → λ θ T
S
i j W iθ = λ T
j i SW θ j
θ
θ i S B θ j = λ j θ i SW θ j
T
θTj SW θ i =θTi SW θ j and λi ≠ λ j

⎯⎯ ⎯ ⎯ ⎯ ⎯ ⎯⎯→ θTj SW θ i = θTi SW θ j = 0
∴ θTi SW θ j = 0 for
f ∀i≠ j
Ref. (Krzanowski, 1988) 19
• To overcome arbitrary scaling of θ i , it is usual to adopt
the normalization
θTi SW θ i = 1 λ1 0 0 
 
 θTi S B θ i = λi θTi SW θ i = λi Λ = 0  0 
∴ Θ T SW Θ = I , Θ T S B Θ = Λ  0 0 λp 
 
• The total covariance matrix ST = SW + S B is also

transformed as a diagonal matrix
20
4
162 Spliced
p Mel-Filterbank 39 LDA Transformed Subspace
p
2 15
1.5
10
1
5
0.5
0
0
-0.5 -5
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0
21
Maximum Likelihood Linear Transform
• Maximum Likelihood Linear Transform (MLLT) seems to
have two types:
– Feature-based: (Gopinath, 1998)
– Model
Model-based:
based: (Olsen,
(Olsen 2002)
2002), (Olsen,
(Olsen 2004)
• The common goal of the two types – Find a global linear

transformation matrix that decorrelate features.
22
Feature-based
Feature based MLLT -11
• Feature-based Maximum Likelihood Linear Transform
(F-MLLT)
– Tries to alleviate four problems:
(a) Data insufficiency implying unreliable models
(b) Large storage requirement
(c) Large computational requirement
(d) ML is
i nott discriminating
di i i ti between
b t classes
l
– Solutions
(a)-(c): Sharing parameters across classes
(d): Appealing to LDA
23
Feature-based
2
• Maximum Likelihood Modeling
– The likelihood of the training data {(x i , li )} is given by
1
− ( x i −m li )T Σ l−i 1 ( x i −m li )
N N
e 2
N
} {Σ li }) = ∏ p (x i , {μ li },
p (x , {μ li },
1 } {Σ li }) = ∏
i =1 i =1 (2π ) d | Σ li | Appendix A
 j 2j (( m j −μ j )T Σ −j1 (m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )

Nd n
− −
Log-Likelihood
g = (2π ) 2
e
Nd
N
log p (x , {μ j }, {Σ j }) = −
1 log(2π )
2
C
−
nj
(
(m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ) + log | Σ j | )
j =1 2
sample mean ML estimators sample covariance
24
Feature-based
• The idea of maximum likelihood estimation (MLE) is to
choose the parameters {μˆ j } and {Σˆ j } so as to maximize
log p(x1N ,{μ j }, {Σ j })
ˆ }
– ML Estimators: {μˆ j } and {Σ j
What’s the difference between

estimator and “estimate”?
“estimator” estimate ?
25
Feature-based
4
• Multiclass ML Modeling
– The training data is modeled with Gaussians (μˆ j , Σˆ j )
– The ML estimators: μˆ j = m j , Σˆ j = S j
Appendix B
– The log-ML
g value:
ˆ j })
log p ∗ (x1N ) = log p(x1N , {μˆ j }, {Σ
C C
Nd nj nj
=− (llog((2π ) + 1) −  | S j | = g ( N , d ) −  | S j |
2 j =1 2 j =1 2
– There is “no interaction” between the classes and therefore

f
unconstrained ML modeling is not “discriminating”
26
Feature-based
• Constrained ML – Diagonal Covariance
– The ML estimators: μˆ j = m j , Σˆ j = diag(S j )
– The log-ML value:
C
nj
log p ∗
diag (x ) = g ( N , d ) −  | diag(S j ) |
N
1
j =1 2
– If one linearly transforms the data and models using a diagonal

Gaussian, the ML value is
C
log p ∗
diag (x ) = g ( N , d ) − 
N
1
nj
(
| diag(ΘTj S j Θ j ) | + log | Θ j | )
j =1 2
the Jacobian
How to choose Θ for each class?
Appendix C 27
Feature-based
6
• Multiclass ML Modeling – Some Issues
– If the sample size for each class is not large enough then the
ML parameter estimates may have large variance and hence be
unreliable
– The storage requirement for the model: O(Cd 2 )
– The computational requirement: O(Cd 2 )
Why?
– The parameters for each class are obtained independently:

ML principle dose not allow for
f discrimination between classes
28
Feature-based
• Multiclass ML Modeling – Some Issues (cont.)
– Parameters sharing across classes: reduces the number of
parameters, storage requirements, and computational
requirements
q
– Hard to justify parameters sharing is more discriminating
– We can appeal to Fisher

Fisher’ss criterion of LDA and a result of
Campbell to argue that sometimes constrained ML modeling is
discriminating.
But, what is discriminating?
29
Feature-based
8
• Multiclass ML Modeling – Some Issues (cont.)
– We can globally transform the data with a unimodular matrix Θ
and model the transformed data with diagonal Gaussians
((There is a loss in likelihood too)) det(Θ) = 1
– Among all possible transformation Θ, we can choose the one

that takes the least loss in likelihood
(In essence we will find a linearly transformed (shared) feature
space in which the diagonal Gaussian assumption is almost
valid)
30
Feature-based
9
• Another constrained MLE with sharing of parameters
– Equal Covariance
– Clustering
• Covariances Diagonalization and Cluster Transformation

– Classes are grouped into clusters
– Each
E h cluster
l t iis modeled
d l d with
ith a diagonal
di l Gaussian
G i in i a
transformed feature space
– The ML estimators: μˆ j = m j , Σˆ j = Θ C−T diag(Θ C S j ΘTC )Θ C−1
j j j j
– The ML value:
C
logg p ∗
diag (x ) = g ( N , d ) − 
N
1
nj
( g(Θ C j S j ΘTC j ) | + logg | Θ C j |
| diag( )
j =1 2
31
Feature-based
10
• One Cluster with Diagonal Covariance
– When the number of clusters is one, there is single global
transformation and the classes are modeled as diagonal
Gaussians in this feature space
p
– The ML estimators: μˆ j = m j , Σˆ j = Θ −T diag(ΘT S j Θ)Θ −1
– The log-ML value:
C
nj
∗
log pdiag (x1N ) = g ( N , d ) −  | diag(ΘT S j Θ) | + N log | Θ |
j =1 2
– The optimal Θ can be obtained by optimization as follows

f
 C nj 
Θ = arg min  | diag(Θ S j Θ) | − N log | Θ | 
T
 j =1 2
Θ

32
Feature-based
11
• Optimization – the numerical approach:
– The objective function
C
nj
F (Θ) =  | diag(ΘT S j Θ) | − N log | Θ |
j=1 2
– Differentiate F with respect to A, and get the derivative G:
( )
G (Θ) = NΘ −T −  n j diag(ΘS j ΘT ) ΘS j
−1
– Directly optimizing the objective function

f is nontrivial and
requires numerical optimization techniques and full matrix to
be stored at each class
33
Outline
34
Model-based
Model based MLLT -11
• In model-based MLLT, instead of having a distinct
covariance matrix for every component in the recognizer,
each covariance matrix consists of two elements:
– A non singular linear transformation matrix Θ shared over a
non-singular
set of components
– The diagonal elements in the matrix Λ j for each class j
• The precision matrices are constrained to be of the form

d λ1j 0 0 
P j = Σ −j1 = ΘT Λ j Θ =  λkj θ k θTk  
k =1 Λj = 0  0 
 0 0 λdj 
precision matrix diagonal matrix
 
Ref. (Olsen, 2004) 35
Model-based
Model based MLLT -2
2
• M-MLLT fits within the standard maximum-likelihood
criterion used for training HMM's with each state
represented by a GMM
m
f (x | Θ) =  π j N (x; μ j , Σ j )
j =1
j th componentt weight
j-th i ht
• Log-likelihood function:
N
1
L(Θ : X) =
N
 log f (x
i =1
i | Θ)
36
Model-based
• Parameters of the M-MLLT model for each HMM state
Θ = {π 1m , μ1m , Λ1m , Θ} are estimated using a generalized
expectation-maximization (EM) algorithm
• The auxiliary “Q “Q-function”
function” should be introduced
introduced.
– The Q-function satisfies the inequality
L(Θ : X) − L(Θˆ : X) ≥ Q(Θ, Θˆ ) − Q(Θˆ , Θˆ )
Θ = {π 1m , μ1m , Λ1m , Θ} → New

 ˆ m ˆ m ˆ
 Θ = {πˆ1
m
, ˆ
μ 1 , Λ1 , Θ} → Old
* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,”
37
J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.
Model-based
Model based MLLT -4
4
• The auxiliary Q-function is
N m
1
ˆ
Q(Θ; Θ) =
N
 γ
i =1 j =1
ij L j (x i )
1 N m
γ ij The component
=
N
 2
i =1 j =1
log-likelihood of
the observation i
L j (x i ) = log(π j N (x i ; μ j , Σ j ))
× (2 log π j − d log(2π ) + log | P j | − trace(P j (x i − μ j )(x i − μ j )T )
A posteriori probability exp( Lˆ j ( xi ))

γ ij = m
of Gaussian component j
given the observation i  exp(
k =1
p( Lˆ ( x ))
k i
38
Model-based
• Gales’ Approach for deriving Θ
 Estimate the mean and the component weight, which are
independent of the other model parameters
N N N
1
πj =
N
γ
i =1
ij μ j =  γ ij x i
i =1
γ
i =1
ij
 Use the current estimate of the transform Θ, and estimate

the set of class-specific diagonal variances
λij = θ i S j θTi
the i-th entry of diagonal the i-th row vector of Θ
variance of component j
Ref. (Gales, 1999) 39

Model-based
Model based MLLT -6
6
• Gales’ Approach for deriving Θ (cont.)
 Estimate the transform Θ using the current set of diagonal
covariances
N
θˆ i = c i G i−1
c i G i−1cTi
Nj The i-th row vector of the

G i =  ( j )2 S j cofactors of the current Θ
j σˆ diag i
Appendix D
 Go to step 2 until convergence, or appropriate criterion

satisfied
40
Semi-Tied
Semi Tied Covariance Matrices -11
• Introduction to Semi-Tied Covariance Matrices
– A natural extension of the state-specific rotation scheme
– The transform is estimated in a maximum-likelihood (ML)
fashion g
given the current model parameters
p
– The optimization is performed using a simple iterative scheme,
which is guaranteed to increase the likelihood of the training
data
– An alternative approach to solve the optimization problem of

DHLDA
41
Semi-Tied
2
• State-Specific Rotation
– A full covariance matrix is calculated for each state in the
system, and is decomposed into its eigenvectors and
eigenvalues
g
Σ (full
s)
= U ( s ) Λ ( s ) U ( s )T
– All data from that state is then decorrelated using the

eigenvectors calculated
ο ( s ) (τ ) = U ( s )T ο(τ )
– Multiple diagonal covariance matrix Gaussian components are
then trained
Ref. (Gales, 1999), (Kumar, 1998) 42

Semi-Tied
• Drawbacks of State-Specific Rotation
– It does not fit within the standard ML estimation framework for
training HMM’s
– The transforms are not related to the multiple-component

models being used to model the data
43
Semi-Tied
4
• Semi-Tied Covariance Matrices
– Instead of having a distinct covariance matrix for every
component in the recognizer, each covariance matrix consists
of two elements
Σ ( m ) = H ( r ) Σ (diag
m)
H ( r )T
component specific diagonal
covariance element
semi-tied class-dependent,
non-diagonal matrix,
may be tied over a set of
components
44
Semi-Tied
• Semi-Tied Covariance Matrices (cont.)
– It is very complex to optimize these parameters directly so an
expectation-maximization approach is adopted
– Parameters for each component m
M = {π ( m ) , μ ( m ) , Σ (diag
d
m)
, Θ (r )
}
Θ ( r ) = H ( r ) −1
45
Semi-Tied
6
• The auxiliary Q-function
Q( M , Mˆ ) =
 |Θ ˆ ( r ) |2  
 γ m (τ )

log 
 ˆ (m) 
 − ο((τ ) − ˆ
μ )
Θ Σ diag Θ (
( m ) T ˆ ( r )T ˆ ( m ) −1 ˆ ( r )
ο (τ ) − ˆ
μ (m)
)


m∈M ( r ),τ
  | Σ diag |  
p (qm (τ ) | M , ΟT ) | Σ(m) | Σ ( m )−1
component m at time τ
46
Semi-Tied
• If all the model parameters are to be simultaneously
optimized then the Q-function may be rewritten as
 | Θˆ ( r ) |2 
Q( M , M ) =  γ m (τ ) log
ˆ l    − dβ
| diag (Θ ˆ ( r )T ) | 
ˆ ( r ) W ( m )Θ
m∈M ( r ),τ  
where
τ γ (τ )ο(τ )
m
τ γ m (
(τ ) ο(τ ) − μˆ (m)
)(ο(τ ) − μˆ )
(m) T μˆ ( m ) =
τ γ (τ ) m
(m)
W =
τ γ m (τ )
β=  γτ
m∈M ( r ),
m (τ )
47
Semi-Tied
8
• The ML estimate of the diagonal element of the
covariance matrix is given by
ˆ (diag
Σ m)
g = diag ˆ
Θ((r )
W ( m ) ˆ ( r )T
Θ )
• The reestimation formulae for the component weights
and transition probabilities are identical to the standard
HMM cases (Rabiner, 1989)
• Unfortunately, optimizing the new Q-function is

nontrivial and more complicated, so an alternative
approachh iis proposed
d nextt
48
Semi-Tied
9
• Gales’ approach for optimizing the Q-function
 Estimate the mean, which is independent of the other model
parameters
 Using the current estimate of the semi-tied transform Θˆ ( r )

and Σˆ (diag
m)
= diag(Θ ˆ ( r ) W ( m )Θˆ ( r )T ) estimate the set of component
specific diagonal variances.
variances This set of parameters will be
denoted as {Σˆ (diag
r)
} = {Σ ˆ (diag
r)
, m ∈ M (r )}
 Estimate the transform

f ˆ ( r ) using the current set {Σˆ (diag
Θ
r)
}
 Go to ((2)) until convergence,

g , or appropriate
pp p criterion satisfied
49
Semi-Tied
10
• How to carry out step (3)?
– Optimizing the semi-tied transform requires an iterative
estimation scheme even after fixing all other model
p
parameters
– Selecting a particular row of Θˆ ( r ) , θˆ i( r ) , and rewriting the
former Q-function using the current set {Σˆ (diag r)
}
ˆ (r ) |
|Θ
Q( M , Mˆ ;{Σ
ˆ (r )
diag } o ( m ) (τ ) − μˆ ( m )
 (θˆ ( r )T ( m )
ˆ
ο (τ )) 2 
=  γ m (τ ) log(
l (θˆ i( r )T ci ) 2 − log ˆ (diag | − 
m) j
l |Σ
 σ ( m ) 2 
m∈M ( r ),τ  j diag j 
The ith row vector of the cofactors the leading diagonal element j
50
Semi-Tied
11
• How to carry out step (3)? (cont.)
– It is shown that the ML estimate for the ith row of the semi-tied
transform, θˆ i( r ) , is given by
β
θˆ i( r ) = c i G ( ri ) −1
c i G ( ri ) −1cTi
1
G ( ri ) =  σˆ (m)2
W ( m )  γ m (τ )
τ
m∈M ( r ) diag i
51
Semi-Tied
12
• It can be shown that
Q( M , Mˆ ) ≥ Q( M , Mˆ ;{Σ
ˆ (diag
r)
})
with equality when diagonal elements of the covariance

matrix are given by Σˆ (diag
m) ˆ ( r ) W ( m )Θ
= diag(Θ ˆ ( r )T )
• During recognition the log-likelihood is based on

log( L(ο(τ ); μ ( m ) , Σ ( m ) , Θ ( r ) ))
= log( N (ο ( r ) (τ ); Θ ( r )μ ( m ) , Σ (diag
m)
)) + log | Θ ( r ) |
Θ ( r ) T ο(τ ) 1 2 log | Θ ( r ) |2
52
Extended MLLT -11
• The extended MLLT (EMLLT) is much similar to M-MLLT.
But the only difference is the precision matrix modeling
D
P j = Σ = Θ Λ j Θ =  λkj θ k θTk
−1
j
T
k =1
basis
– Note that in EMLLT, D ≥ d , while in M-MLLT, D = d

– λkj are not required to be positive!
– {λkj } have
h tto b
be chosen
h such thatt P j is
h th i positive
iti definite
d fi it
– The authors provided two algorithms for iterative-update of
the parameters
Ref. (Olsen, 2002), (Olsen, 2004) 53

Extended MLLT -2
2
• The highlight of EMLLT:
– More flexible: the number of basis elements {θ k θTk } can be
gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling
the value of D
54
Outline
55
Common Principal Components -11
• Why Common Principal Components (CPC)?
– We often deal with the situation of the same variables being
measured on objects from different groups, and the covariance
structure mayy varyy from g
group
p to g
group
p
– But sometimes the covariance matrices of different groups
look somehow similar, and it seems reasonable to assume that
the covariance matrices have a common basic structure
• Goal of CPC – To find a rotation diagonalizes the

covariance matrices simultaneously
Ref. (Flury, 1984), (Flury, 1986) 56

2
• Hypothesis of CPC’s
H c : BT Σ i B = Λ i , i = 1,..., k
diagonal matrix
– The common principal components (CPC's):

U i = BT x i
– Note that, no canonical ordering of the columns of B need be

given since the rank order of the diagonal elements of the
given,
is not necessarily the same for all groups
57
• Assume ni S i are independently distributed as W p (ni , Σ i ) .
The common likelihood function is
k
  ni −1  
L( Σ1 ,..., Σ k ) = C × ∏ exp trace − Σ i S i   Σ i
− ni 2
i =1   2 
• Instead of maximizing the likelihood function, minimize

g ( Σ1 ,..., Σ k ) = −2 log L( Σ1 ,..., Σ k ) + 2 log C
k
=  ni (log | Σ i | + trace(Σ i−1S i ))
i =1
58
4
• Assume H C holds for some orthogonal matrix β , and
Λ i = diag(λi1 ,..., λip ) , then
p
log | Σ i |=  log λij , i = 1,...,kk
j =1
p
βTj S i β j
trace(Σ i−1S i ) = trace(βΛ i−1βT S i ) = trace(Λ i−1βT S i β) = 
j =1 λij
• Therefore g ( Σ1 ,..., Σ k ) = g (β1 ,..., β p , λ11 ,..., λ1 p , λ21 ,..., λkp )
k p βTj S i β j 
=  ni   log λij + 

j =1  λijj 
i =1 
59
• The function g is to be minimized under the restrictions
0 if h ≠ j G ( Σ1 ,..., Σ k )
T
β βj = 
h p p
1 if h = j ( )
= g ( Σ1 ,..., Σ k ) −  γ h β β h − 1 − 2 γ hjj βTh β j
T
h
h =1 h< j
• Thus we wish to minimize the function

p p
G ( Σ1 ,,...,, Σ k ) = g ( Σ1 ,,...,, Σ k ) −  γ h (βTh β h − 1) − 2 γ hj βTh β j
h =1 h< j
60
6
• Minimization:
k

p
βTj S i β j 
∂  ni   log λij + 
∂G ( Σ1 ,..., Σ k ) i =1 j =1  λij  =0
=
∂λij ∂λij
 λij = βTj S i β j , i = 1,,...,, k ; j = 1,,...,, p
Key point! Keep it in your mind.
 trace(Σ i−1S i ) = p, i = 1,..., k
61
• Minimization (cont.):
 k p 
β T
S β  
  ni   log λij + j i j
 
 i =1 j =1  λij  
∂ 
 p p

 −  γ h (β h β h − 1) − 2 γ hj β h β j 
T T
∂G ( Σ1 ,..., Σ k )
=   =0
h =1 h< j
∂β j ∂β j
k ni S i β j p
 − γ j β j −  γ jh β h = 0, j = 1,..., p
i =1 λij h =1
h≠ j
k
– Multiplying
M lti l i the
th left b βTh gives
l ft by i γ j =  ni , j = 1,..., p
i =1
62
8
• Minimization (cont.): Thus
k ni S i β j k p

i =1 λij
−( ni )β j −  γ jh β h , j = 1,..., p
i =1 h =1
h≠ j
h≠
– Multiplying the left by βTl (l ≠ j ) implies

k ni βTl S i β j

i =1 λij
=γ jl , j = 1,..., p, l ≠ j
– Note that βTj S i β l = βTl S i β j , and γ jl = γ lj

k ni βTl S i β j

i =1
1 λij
=γ jl , j = 1,..., p, j ≠ l
63
9
• Minimization (cont.):
k ni βTl S i β j k ni βTl S i β j

i =1 λij
−
i =1 λil
=0
 k λil − λij 
 β l  ni
T
S i β j = 0, l,j = 1,...,p; l ≠ j
 i =1 λ λ 
 il ijj 
the optimization objective
– These p( p − 1) 2 equations have to be solved under the

orthonormality conditions βT β = I p and λij = βTj S i β
64
10
• Solving procedure of CPC – FG Algorithm
– F-Algorithm
65
11
– G-Algorithm
66
12
• Likelihood Ratio Test
– The sample common principal components:
U i = βˆ T X i , i = 1,..., k
– For the ith group, the transformed covariance matrix is

Fi = βˆ T S i βˆ , i = 1,,...,, k
– Since Λˆ i = diag(Fi ), i = 1,..., k
the statistic can be written as a function of the alone:
p
k
| diag(Fi ) | k
∏ f jj( i )
χ =  ni log
2
=  ni log
j =1
p
| Fi |
i =1 i =1
∏l
j =1
ij eigenvalues of Fi
67
13
• Likelihood Ratio Test (cont.)
– The likelihood ratio criterion is a measure of simultaneous
diagonalizability of k p.d.s. matrices
– The CPC's can be viewed as obtained byy a simultaneous
transformation, yielding variables that are as uncorrelated as
possible
– It can also be seen from another viewpoint of Hadamard
Hadamard’ss
inequality
| Fi |≤| diag(
g(Fi ) |
68
14
• Actually, CPC can be also viewed as another measure of
“deviation from diagonality”
| diag(Fi ) |
ϕ (Fi ) = ≥1
| Fi |
– The CPC criterion can be

k
Φ(F1 ,..., Fk ; n1,...,nk ) = ∏ (ϕ (Fi )) ni
i =1
– Let Fi = BT A i Bfor a given orthonal matrix B

nk ) = min Φ(BT A1B,..., BT A k B; n1,...,nnk )
Φ 0 ( A1 ,..., A k ; n1,...,n
B
69
Comparison Between CPC and F-MLLT
F MLLT
• CPC tries to maximize
C
 i
n (log
i =1
| Σ i | + trace( Σ −1
i S i ))
The estimates are KNOWN
with Σ i = BT Λ i B, i = 1,..., C in the original space
• F-MLLT tries to maximize

1 C

2 j =1
N j | diag ( A T
S j A) | − N log | A | The estimates are UNKNOWN
70
Appendix A -11
• Show
1
− ( x i −μ li )T Σ l−i 1 ( x i −μ li )
 j 2j (( m j −μ j )T Σ −j1 ( m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )
N Nd n
e 2 − −
∏
i =1 (2π ) | Σ li |
d
= (2π ) 2
e
71
Appendix A -2
2
For each class C j ,
 (x
N
− m j )T Σ −j1 (m j − μ j )
 (xi − μ li )T Σl−i1 (xi − μ li )
i =1
x i ∈C j
i
N = Σ −j1 (m j − μ j )  (x i − m j )T
=  ( x i − m li + m li − μ li ) Σ ( x i − m li + m li − μ li )
T −1
li x i ∈C j
i =1
N =0
=  ((x i − m li ) + (m li − μ li ) ) Σ ((x i − m li ) + (m i − μ li ))
T T −1
li
i =1
N
=  ((x i − m li )T Σ l−i1 (x i − m li ) + (m li − μ li )T Σ l−i1 (m li − μ li ) + (x i − m li )T Σ l−i1 (m li − μ li ) + (m li − μ li )T Σ l−i1 (x i − m li ))
i =1
N N
=  trace(Σ l−i1 (x i − m li )(x i − m li )T +  N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N N
=  n j trace(Σ −j1S j ) +  N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N
=  n j ((m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ))
i =1
72
Appendix B
• ML estimators for log p(x1N ,{μ j },{Σ j })
∂ log p (x1N ,{μ j }, {Σ j })
= NΣ −j1 (m j − μ j ) = 0
∂μ j
 μˆ j = m j
∂ log p (x1N , {μ j }, {Σ j }) ∂ (trace(Σ −j 1S j ) + log | Σ j |)

= =0
∂Σ j ∂Σ j
 − Σ −j T STj Σ −j T + Σ −j T = 0
Σˆ j =Sj
73
Appendix C -11
• Change of Variable Theorem
– Consider a one-one mapping g : ℜ n → ℜ n , Y = f ( X)
– Equal probability: The probability of falling in a region in space
X should be the same as the p probabilityy of falling
g in the
corresponding region in space Y
– Suppose the region dx1dx2 ...dxn maps to the region dA in the Y
space Equating probabilities,
space. probabilities we have
f Y ( y1 ,..., yn )dA = f X ( x1 ,..., xn )dx1...dxn
– The region dA is a hyperparallelepipe described by the vectors

dy1 dy dy dy dy dy
( dx1 ,..., n dx1 ), ( 1 dx2 ,..., n dx2 ),..., ( 1 dxn ,..., n dxn )
d1
dx d1
dx d 2
dx d 2
dx d n
dx d n
dx
74
Appendix C -2
2
• Change of Variable Theorem (cont.)
– The hyper-parallelepiped dA can be calculated by
dy1 dyn dy1 dyn
dx1 ,,...,, dx1 ,,...,,
d1
dx d1
dx d1
dx d1
dx
dA =  =  dx1...dxn
dy1 dyn dy1 dyn
dxn ,..., dxn ,...,
d n
dx d n
dx d n
dx ddxn
J: the Jacobian of function g

– So,
f Y ( y1 ,..., yn ) | J | dx1...dxn = f X ( x1 ,..., xn )dx1...dxn
 f Y ( y1 ,..., yn ) =| J |−1 f X ( x1 ,..., xn )
75
Appendix D -11
• If A is a square matrix, then the minor entry of aij is
denoted by Mij and is defined to be the determinant of
the submatrix that remains after the i-th row and the j-th
column are deleted from A. ( 1)i + jMij is
A The number (−1)
denoted by cij and is called the cofactor of aij.
• Given the matrix
b11 b12 b13  c33 = (−1) 3+3 ( M 33 ) c23 = (−1) 2+3 ( M 23 )
B = b21 b22 b23 
b11 b12 ×
b31 b32 b33    b11 b12 
M 23 =  × × × =   = b11b32 − b12b31
b31 b32 
b31 b32 ×
76
Appendix D -2
2
• Given the n by n matrix
– The determinant of A can be written as the sum of its cofactors
multiplied by the entries that generated them.
 a11 a12  a1n 
a a22  a2 n 
A=
21
     
 
an1 an 2  ann 
(cofactor expansion along the jth column)
det( A) = a1 j c1 j + a2 j c2 j + a3 j c3 j +  + anj cnj = A( j )T c ( j )
((cofactor expansion
p along
g the ith row))
det( A) = ai1ci1 + ai 2 ci 2 + ai 3ci 3 +  + ain cin = A( i ) c(Ti )
77

Feature Decorrelation On Speech Recognition

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Feature Decorrelation On Speech Recognition

Загружено:

Авторское право:

Доступные форматы

Feature Decorrelation on

2009-10-09 @ IIS, Sinica Academic

• Common Principal Component (CPC)

• Common Principal Component (CPC)

– Covariance matrix and mean vector:

– Or make the covariance matrix diagonal (not necessary

Speech Front-End Test Speech Textual

MLLT,, EMLLT,, MLT,, Semi-Tied,, …

• Common Principal Component (CPC)

162 Spliced Mel

θTj SW θ i =θTi SW θ j and λi ≠ λ j

• The total covariance matrix ST = SW + S B is also

• The common goal of the two types – Find a global linear

 j 2j (( m j −μ j )T Σ −j1 (m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )

sample mean ML estimators sample covariance

What’s the difference between

– There is “no interaction” between the classes and therefore

– If one linearly transforms the data and models using a diagonal

– The parameters for each class are obtained independently:

– We can appeal to Fisher

– Among all possible transformation Θ, we can choose the one

• Covariances Diagonalization and Cluster Transformation

– The optimal Θ can be obtained by optimization as follows

– Differentiate F with respect to A, and get the derivative G:

– Directly optimizing the objective function

• Common Principal Component (CPC)

• The precision matrices are constrained to be of the form

L(Θ : X) − L(Θˆ : X) ≥ Q(Θ, Θˆ ) − Q(Θˆ , Θˆ )

Θ = {π 1m , μ1m , Λ1m , Θ} → New

× (2 log π j − d log(2π ) + log | P j | − trace(P j (x i − μ j )(x i − μ j )T )

A posteriori probability exp( Lˆ j ( xi ))

 Use the current estimate of the transform Θ, and estimate

Ref. (Gales, 1999) 39

Nj The i-th row vector of the

 Go to step 2 until convergence, or appropriate criterion

– An alternative approach to solve the optimization problem of

– All data from that state is then decorrelated using the

Ref. (Gales, 1999), (Kumar, 1998) 42

– The transforms are not related to the multiple-component

– Parameters for each component m

p (qm (τ ) | M , ΟT ) | Σ(m) | Σ ( m )−1

• Unfortunately, optimizing the new Q-function is

 Using the current estimate of the semi-tied transform Θˆ ( r )

 Estimate the transform

 Go to ((2)) until convergence,

with equality when diagonal elements of the covariance

• During recognition the log-likelihood is based on

– Note that in EMLLT, D ≥ d , while in M-MLLT, D = d

Ref. (Olsen, 2002), (Olsen, 2004) 53

• Common Principal Component (CPC)

• Goal of CPC – To find a rotation diagonalizes the

Ref. (Flury, 1984), (Flury, 1986) 56

– The common principal components (CPC's):

– Note that, no canonical ordering of the columns of B need be

• Instead of maximizing the likelihood function, minimize

• Thus we wish to minimize the function

Key point! Keep it in your mind.

 trace(Σ i−1S i ) = p, i = 1,..., k

– Multiplying the left by βTl (l ≠ j ) implies

– Note that βTj S i β l = βTl S i β j , and γ jl = γ lj

the optimization objective

– These p( p − 1) 2 equations have to be solved under the

– For the ith group, the transformed covariance matrix is