Вы находитесь на странице: 1из 77

Feature Decorrelation on

S
SpeechhRRecognition
iti
H
Hung-Shin
Shi Lee
L
Institute of Information, Sinica Academic
D off Electrical
Dep. El t i l E Engineering,
i i N
National
ti lTTaiwan
i U
University
i it

2009-10-09 @ IIS, Sinica Academic


References -11
1) J. Psutka and L. Muller, “Comparison of various feature decorrelation
techniques in automatic speech recognition
recognition,” in Proc.
Proc CITSA 2006.
2006
2) Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw
3) Batlle et al., “Feature decorrelation methods in speech recognition - a
comparative study,
study,” in Proc. ICSLP 1998.
4) K. Demuynck et al., “Improved feature decorrelation for HMM-based speech
recognition,” in Proc. ICSLP 1998.
5) W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective,
Oxford Press, 1988.
6) R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for
classification,” in Proc. ICASSP 1998.
7) M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE
Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999.
8) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” in Proc.
expansion,” Proc ICASSP 2002.
2002

2
References -2
2
9) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” IEEE Trans.
expansion, Trans on Speech and Audio Processing,
Processing vol.
vol 12,
12 no.
no 11,
11 pp.
pp
37-46, 2004.
10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.
11) N. Kumar and A. Andreou, “Heteroscedastic
Heteroscedastic discriminant analysis and. reduced
rank HMMs for improved speech recognition,” Speech Communication, vol. 26,
no. 4, pp. 283-297, 1998.
12) A. Ljolje, “The importance of cepstral parameter correlations in speech
recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.
13) B. Flury, “Common principal components in k groups,” Journal of the
American Statistical Association, vol. 79, no. 388, 1984.
14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal
transformation of several positive definite symmetric matrices to nearly
diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7,
no 1,
no. 1 1986.
1986

3
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition

• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

4
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition

• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

5
Introduction to Feature Decorrelation -11
• Definition of Covariance Matrix:
– Random vector:
X = [ x1  xn ]T
random variable

– Covariance matrix and mean vector:


Σ ≡ E[( X − μ)( X − μ)T ]
expected value random vector
μ ≡ E[X]

6
Introduction to Feature Decorrelation -2
2
• Feature Decorrelation (FD)
– To find transformations Θ that make all variables or
parameters (nearly) uncorrelated
~ ~ ~ ~
cov( X i , X j ) = 0, ∀X i , X j , i ≠ j
transformed random variable

– Or make the covariance matrix diagonal (not necessary


identity)

ΘT ΣΘ = D
The covariance matrix can be
global or depend on each class.
covariance matrix diagonal matrix

7
Introduction to Feature Decorrelation -33
• Why Feature Decorrelation (FD)?
– In many speech recognition systems, the observation density
function for each HMM state are modeled as mixtures of
diagonal
g covariance Gaussians
– For the sake of computational simplicity, the off-diagonal
elements of the covariance matrices of the Gaussians are
assumed to be close to zero
f (x) =  π i N (x; μ i , Σ i )
i

1  1 
= π i exp − ( x − μ i ) T
Σ −1
i ( x − μ i 
)
observation i (2π ) | Σ i |
12 12
 2 
density function
diagonal matrix

8
FD on Speech Recognition -11
• Approaches for FD can be divided into two categories:
Feature-space and Model-space
• Feature-space Schemes
– Hard to find a single transform which decorrelates all elements
of the feature vector for all states

• Model-space Schemes
– A different transform is selected depending on which
component the observation was hypothesized to be generated
from
– In the limit a transform may be used for each component,
which
hi h iis equivalent
i l t tto a ffullll covariance
i matrix
t i system
t
9
FD on Speech Recognition -2
2
• Feature-space Schemes on LVCSR
DCT, PCA, LDA, MLLT, …

Feature-space
Decorrelation

Speech Front-End Test Speech Textual


Signal Preprocessing Data Decoding Results
Training Data

AM Training AM LM Lexicon

10
FD on Speech Recognition -33
• Model-space Schemes on LVCSR
Speech Front-End Test Speech Textual
Signal Preprocessing Data Decoding Results
Training Data

AM Training
g AM LM Lexicon

Model-space
p
Decorrelation

MLLT,, EMLLT,, MLT,, Semi-Tied,, …

11
FD on Speech Recognition -4
4

Without Label
With Label Information
Information
Discrete Cosine Transform Linear Discriminant Analysis (LDA)
(DCT) Feature-Space
F t S
Principal Component Common Principal Components Schemes
Analysis (PCA) (CPC)
Extended Maximum Likelihood
Linear Transform (EMLLT)
Semi-Tie Covariance Matrices Model-Space
Multiple Linear Transforms (MLT) S h
Schemes
Maximum Linear Likelihood
Transform (MLLT)

12
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition

• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

13
Discrete Cosine Transform -11
• Discrete Cosine Transform (DCT)
– Since the log-power spectrum is real and symmetric, the
inverse DFT reduces to a Discrete Cosine Transform (DCT)
– Is Applied
pp to the log-energies
g g of output
p filters (Mel-scaled
(
filterbank) during the MFCC parameterization
2 n π × j 
cj = 
n i =1
xi cos
 n
(i − 0.5) , ffor j = 0,1,..., m < n

i-th coordinate of the input vector x
j-th coordinate of the output vector c

– Partial decorrelation

14
Discrete Cosine Transform -2
2
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
18 Mel-scaled Filterbank 133 Mel-cepstrum
p ((using
g DCT))

2 8

6
15
1.5

4
1
2

0.5
0

0 -2
20 40
15 20 30 40
10 15 20 30
10 20
5 5 10 10
0 0 0 0

15
Principal Component Analysis -11
• Principal Component Analysis (PCA)
– Based on the calculation of the major directions of variations of
a set of data points in a high dimension space
– Extracts the direction of the ggreatest variance,, assuming g that
the less variation of the data, the less information it carries of
the features
– Principal components V = [ v1 ,..., v p ] : the largest eigenvectors
of the total covariance matrix Σ
Σv i = λi v i
 V T ΣV = D
diagonal matrix

16
Principal Component Analysis -2
2
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN

162 Spliced Mel


Mel-Filterbank
Filterbank 39 PCA Transformed Subspace

2 150

15
1.5
100

1
50
0.5

0
0

-0.5 -50
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0

17
Linear Discriminant Analysis -11
• Linear Discriminant Analysis (LDA)
– Seeks a linear transformation matrix Θ that satisfies

(
Θ = arg max trace (ΘT SW Θ) −1 (ΘT S B Θ)
Θ
)
– Θ is formed by the largest eigenvectors of SW−1S B
– The LDA subspace is not orthogonal (but PCA subspace is)
ΘT Θ ≠ I
– The LDA subspace makes the transformed
f variables
statistically uncorrelated.

18
Linear Discriminant Analysis -2
2
• From any two distinct eigenvalue/eigenvector pairs
(λi , θ i ) and (λ j , θ j )
S B θ i = λi SW θ i

S B θ j = λ j SW θ j
p y g byy θTj and θTi , respectively
– Pre-multiplying p y
θTj S B θ i = λi θTj SW θ i θTj S B θ i =θTi S B θ j
 T ⎯⎯ ⎯ ⎯ ⎯ → λ θ T
S
i j W iθ = λ T
j i SW θ j
θ
θ i S B θ j = λ j θ i SW θ j
T

θTj SW θ i =θTi SW θ j and λi ≠ λ j


⎯⎯ ⎯ ⎯ ⎯ ⎯ ⎯⎯→ θTj SW θ i = θTi SW θ j = 0
∴ θTi SW θ j = 0 for
f ∀i≠ j
Ref. (Krzanowski, 1988) 19
Linear Discriminant Analysis -33
• To overcome arbitrary scaling of θ i , it is usual to adopt
the normalization
θTi SW θ i = 1 λ1 0 0 
 
 θTi S B θ i = λi θTi SW θ i = λi Λ = 0  0 
∴ Θ T SW Θ = I , Θ T S B Θ = Λ  0 0 λp 
 

• The total covariance matrix ST = SW + S B is also


transformed as a diagonal matrix

20
Linear Discriminant Analysis -4
4
• (Total) covariance matrix calculated using the 25-hour
training data from MATBN
162 Spliced
p Mel-Filterbank 39 LDA Transformed Subspace
p

2 15

1.5
10

1
5
0.5

0
0

-0.5 -5
200 40
150 200 30 40
100 150 20 30
100 20
50 50 10 10
0 0 0 0

21
Maximum Likelihood Linear Transform
• Maximum Likelihood Linear Transform (MLLT) seems to
have two types:
– Feature-based: (Gopinath, 1998)
– Model
Model-based:
based: (Olsen,
(Olsen 2002)
2002), (Olsen,
(Olsen 2004)

• The common goal of the two types – Find a global linear


transformation matrix that decorrelate features.

22
Feature-based
Feature based MLLT -11
• Feature-based Maximum Likelihood Linear Transform
(F-MLLT)
– Tries to alleviate four problems:
(a) Data insufficiency implying unreliable models
(b) Large storage requirement
(c) Large computational requirement
(d) ML is
i nott discriminating
di i i ti between
b t classes
l

– Solutions
(a)-(c): Sharing parameters across classes
(d): Appealing to LDA

23
Feature-based
Feature based MLLT -2
2
• Maximum Likelihood Modeling
– The likelihood of the training data {(x i , li )} is given by
1
− ( x i −m li )T Σ l−i 1 ( x i −m li )
N N
e 2
N
} {Σ li }) = ∏ p (x i , {μ li },
p (x , {μ li },
1 } {Σ li }) = ∏
i =1 i =1 (2π ) d | Σ li | Appendix A

 j 2j (( m j −μ j )T Σ −j1 (m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )


Nd n
− −
Log-Likelihood
g = (2π ) 2
e

Nd
N
log p (x , {μ j }, {Σ j }) = −
1 log(2π )
2
C
−
nj
(
(m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ) + log | Σ j | )
j =1 2

sample mean ML estimators sample covariance

24
Feature-based
Feature based MLLT -33
• The idea of maximum likelihood estimation (MLE) is to
choose the parameters {μˆ j } and {Σˆ j } so as to maximize
log p(x1N ,{μ j }, {Σ j })

ˆ }
– ML Estimators: {μˆ j } and {Σ j

What’s the difference between


estimator and “estimate”?
“estimator” estimate ?

25
Feature-based
Feature based MLLT -4
4
• Multiclass ML Modeling
– The training data is modeled with Gaussians (μˆ j , Σˆ j )
– The ML estimators: μˆ j = m j , Σˆ j = S j
Appendix B
– The log-ML
g value:
ˆ j })
log p ∗ (x1N ) = log p(x1N , {μˆ j }, {Σ
C C
Nd nj nj
=− (llog((2π ) + 1) −  | S j | = g ( N , d ) −  | S j |
2 j =1 2 j =1 2

– There is “no interaction” between the classes and therefore


f
unconstrained ML modeling is not “discriminating”

26
Feature-based
Feature based MLLT -55
• Constrained ML – Diagonal Covariance
– The ML estimators: μˆ j = m j , Σˆ j = diag(S j )
– The log-ML value:
C
nj
log p ∗
diag (x ) = g ( N , d ) −  | diag(S j ) |
N
1
j =1 2

– If one linearly transforms the data and models using a diagonal


Gaussian, the ML value is
C
log p ∗
diag (x ) = g ( N , d ) − 
N
1
nj
(
| diag(ΘTj S j Θ j ) | + log | Θ j | )
j =1 2

the Jacobian
How to choose Θ for each class?

Appendix C 27
Feature-based
Feature based MLLT -6
6
• Multiclass ML Modeling – Some Issues
– If the sample size for each class is not large enough then the
ML parameter estimates may have large variance and hence be
unreliable
– The storage requirement for the model: O(Cd 2 )
– The computational requirement: O(Cd 2 )
Why?

– The parameters for each class are obtained independently:


ML principle dose not allow for
f discrimination between classes

28
Feature-based
Feature based MLLT -77
• Multiclass ML Modeling – Some Issues (cont.)
– Parameters sharing across classes: reduces the number of
parameters, storage requirements, and computational
requirements
q
– Hard to justify parameters sharing is more discriminating

– We can appeal to Fisher


Fisher’ss criterion of LDA and a result of
Campbell to argue that sometimes constrained ML modeling is
discriminating.
But, what is discriminating?

29
Feature-based
Feature based MLLT -8
8
• Multiclass ML Modeling – Some Issues (cont.)
– We can globally transform the data with a unimodular matrix Θ
and model the transformed data with diagonal Gaussians
((There is a loss in likelihood too)) det(Θ) = 1

– Among all possible transformation Θ, we can choose the one


that takes the least loss in likelihood
(In essence we will find a linearly transformed (shared) feature
space in which the diagonal Gaussian assumption is almost
valid)

30
Feature-based
Feature based MLLT -9
9
• Another constrained MLE with sharing of parameters
– Equal Covariance
– Clustering

• Covariances Diagonalization and Cluster Transformation


– Classes are grouped into clusters
– Each
E h cluster
l t iis modeled
d l d with
ith a diagonal
di l Gaussian
G i in i a
transformed feature space
– The ML estimators: μˆ j = m j , Σˆ j = Θ C−T diag(Θ C S j ΘTC )Θ C−1
j j j j

– The ML value:
C
logg p ∗
diag (x ) = g ( N , d ) − 
N
1
nj
( g(Θ C j S j ΘTC j ) | + logg | Θ C j |
| diag( )
j =1 2

31
Feature-based
Feature based MLLT -10
10
• One Cluster with Diagonal Covariance
– When the number of clusters is one, there is single global
transformation and the classes are modeled as diagonal
Gaussians in this feature space
p
– The ML estimators: μˆ j = m j , Σˆ j = Θ −T diag(ΘT S j Θ)Θ −1
– The log-ML value:
C
nj

log pdiag (x1N ) = g ( N , d ) −  | diag(ΘT S j Θ) | + N log | Θ |
j =1 2

– The optimal Θ can be obtained by optimization as follows


f
 C nj 
Θ = arg min  | diag(Θ S j Θ) | − N log | Θ | 
T

 j =1 2
Θ

32
Feature-based
Feature based MLLT -11
11
• Optimization – the numerical approach:
– The objective function
C
nj
F (Θ) =  | diag(ΘT S j Θ) | − N log | Θ |
j=1 2

– Differentiate F with respect to A, and get the derivative G:

( )
G (Θ) = NΘ −T −  n j diag(ΘS j ΘT ) ΘS j
−1

– Directly optimizing the objective function


f is nontrivial and
requires numerical optimization techniques and full matrix to
be stored at each class

33
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition

• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

34
Model-based
Model based MLLT -11
• In model-based MLLT, instead of having a distinct
covariance matrix for every component in the recognizer,
each covariance matrix consists of two elements:
– A non singular linear transformation matrix Θ shared over a
non-singular
set of components
– The diagonal elements in the matrix Λ j for each class j

• The precision matrices are constrained to be of the form


d λ1j 0 0 
P j = Σ −j1 = ΘT Λ j Θ =  λkj θ k θTk  
k =1 Λj = 0  0 
 0 0 λdj 
precision matrix diagonal matrix
 
Ref. (Olsen, 2004) 35
Model-based
Model based MLLT -2
2
• M-MLLT fits within the standard maximum-likelihood
criterion used for training HMM's with each state
represented by a GMM
m
f (x | Θ) =  π j N (x; μ j , Σ j )
j =1
j th componentt weight
j-th i ht

• Log-likelihood function:
N
1
L(Θ : X) =
N
 log f (x
i =1
i | Θ)

36
Model-based
Model based MLLT -33
• Parameters of the M-MLLT model for each HMM state
Θ = {π 1m , μ1m , Λ1m , Θ} are estimated using a generalized
expectation-maximization (EM) algorithm
• The auxiliary “Q “Q-function”
function” should be introduced
introduced.
– The Q-function satisfies the inequality

L(Θ : X) − L(Θˆ : X) ≥ Q(Θ, Θˆ ) − Q(Θˆ , Θˆ )

Θ = {π 1m , μ1m , Λ1m , Θ} → New


 ˆ m ˆ m ˆ
 Θ = {πˆ1
m
, ˆ
μ 1 , Λ1 , Θ} → Old

* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,”
37
J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.
Model-based
Model based MLLT -4
4
• The auxiliary Q-function is
N m
1
ˆ
Q(Θ; Θ) =
N
 γ
i =1 j =1
ij L j (x i )

1 N m
γ ij The component
=
N
 2
i =1 j =1
log-likelihood of
the observation i
L j (x i ) = log(π j N (x i ; μ j , Σ j ))

× (2 log π j − d log(2π ) + log | P j | − trace(P j (x i − μ j )(x i − μ j )T )

A posteriori probability exp( Lˆ j ( xi ))


γ ij = m
of Gaussian component j
given the observation i  exp(
k =1
p( Lˆ ( x ))
k i

38
Model-based
Model based MLLT -55
• Gales’ Approach for deriving Θ
 Estimate the mean and the component weight, which are
independent of the other model parameters
N N N
1
πj =
N
γ
i =1
ij μ j =  γ ij x i
i =1
γ
i =1
ij

 Use the current estimate of the transform Θ, and estimate


the set of class-specific diagonal variances

λij = θ i S j θTi
the i-th entry of diagonal the i-th row vector of Θ
variance of component j

Ref. (Gales, 1999) 39


Model-based
Model based MLLT -6
6
• Gales’ Approach for deriving Θ (cont.)
 Estimate the transform Θ using the current set of diagonal
covariances
N
θˆ i = c i G i−1
c i G i−1cTi

Nj The i-th row vector of the


G i =  ( j )2 S j cofactors of the current Θ
j σˆ diag i
Appendix D

 Go to step 2 until convergence, or appropriate criterion


satisfied

40
Semi-Tied
Semi Tied Covariance Matrices -11
• Introduction to Semi-Tied Covariance Matrices
– A natural extension of the state-specific rotation scheme
– The transform is estimated in a maximum-likelihood (ML)
fashion g
given the current model parameters
p
– The optimization is performed using a simple iterative scheme,
which is guaranteed to increase the likelihood of the training
data

– An alternative approach to solve the optimization problem of


DHLDA

41
Semi-Tied
Semi Tied Covariance Matrices -2
2
• State-Specific Rotation
– A full covariance matrix is calculated for each state in the
system, and is decomposed into its eigenvectors and
eigenvalues
g
Σ (full
s)
= U ( s ) Λ ( s ) U ( s )T

– All data from that state is then decorrelated using the


eigenvectors calculated
ο ( s ) (τ ) = U ( s )T ο(τ )
– Multiple diagonal covariance matrix Gaussian components are
then trained

Ref. (Gales, 1999), (Kumar, 1998) 42


Semi-Tied
Semi Tied Covariance Matrices -33
• Drawbacks of State-Specific Rotation
– It does not fit within the standard ML estimation framework for
training HMM’s

– The transforms are not related to the multiple-component


models being used to model the data

43
Semi-Tied
Semi Tied Covariance Matrices -4
4
• Semi-Tied Covariance Matrices
– Instead of having a distinct covariance matrix for every
component in the recognizer, each covariance matrix consists
of two elements

Σ ( m ) = H ( r ) Σ (diag
m)
H ( r )T
component specific diagonal
covariance element

semi-tied class-dependent,
non-diagonal matrix,
may be tied over a set of
components

44
Semi-Tied
Semi Tied Covariance Matrices -55
• Semi-Tied Covariance Matrices (cont.)
– It is very complex to optimize these parameters directly so an
expectation-maximization approach is adopted

– Parameters for each component m

M = {π ( m ) , μ ( m ) , Σ (diag
d
m)
, Θ (r )
}
Θ ( r ) = H ( r ) −1

45
Semi-Tied
Semi Tied Covariance Matrices -6
6
• The auxiliary Q-function
Q( M , Mˆ ) =
 |Θ ˆ ( r ) |2  
 γ m (τ )

log 
 ˆ (m) 
 − ο((τ ) − ˆ
μ )
Θ Σ diag Θ (
( m ) T ˆ ( r )T ˆ ( m ) −1 ˆ ( r )
ο (τ ) − ˆ
μ (m)
)


m∈M ( r ),τ
  | Σ diag |  

p (qm (τ ) | M , ΟT ) | Σ(m) | Σ ( m )−1

component m at time τ

46
Semi-Tied
Semi Tied Covariance Matrices -77
• If all the model parameters are to be simultaneously
optimized then the Q-function may be rewritten as
 | Θˆ ( r ) |2 
Q( M , M ) =  γ m (τ ) log
ˆ l    − dβ
| diag (Θ ˆ ( r )T ) | 
ˆ ( r ) W ( m )Θ
m∈M ( r ),τ  
where
τ γ (τ )ο(τ )
m

τ γ m (
(τ ) ο(τ ) − μˆ (m)
)(ο(τ ) − μˆ )
(m) T μˆ ( m ) =
τ γ (τ ) m
(m)
W =
τ γ m (τ )
β=  γτ
m∈M ( r ),
m (τ )

47
Semi-Tied
Semi Tied Covariance Matrices -8
8
• The ML estimate of the diagonal element of the
covariance matrix is given by
ˆ (diag
Σ m)
g = diag ˆ
Θ((r )
W ( m ) ˆ ( r )T
Θ )
• The reestimation formulae for the component weights
and transition probabilities are identical to the standard
HMM cases (Rabiner, 1989)

• Unfortunately, optimizing the new Q-function is


nontrivial and more complicated, so an alternative
approachh iis proposed
d nextt
48
Semi-Tied
Semi Tied Covariance Matrices -9
9
• Gales’ approach for optimizing the Q-function
 Estimate the mean, which is independent of the other model
parameters

 Using the current estimate of the semi-tied transform Θˆ ( r )


and Σˆ (diag
m)
= diag(Θ ˆ ( r ) W ( m )Θˆ ( r )T ) estimate the set of component
specific diagonal variances.
variances This set of parameters will be
denoted as {Σˆ (diag
r)
} = {Σ ˆ (diag
r)
, m ∈ M (r )}

 Estimate the transform


f ˆ ( r ) using the current set {Σˆ (diag
Θ
r)
}

 Go to ((2)) until convergence,


g , or appropriate
pp p criterion satisfied

49
Semi-Tied
Semi Tied Covariance Matrices -10
10
• How to carry out step (3)?
– Optimizing the semi-tied transform requires an iterative
estimation scheme even after fixing all other model
p
parameters
– Selecting a particular row of Θˆ ( r ) , θˆ i( r ) , and rewriting the
former Q-function using the current set {Σˆ (diag r)
}
ˆ (r ) |

Q( M , Mˆ ;{Σ
ˆ (r )
diag } o ( m ) (τ ) − μˆ ( m )

 (θˆ ( r )T ( m )
ˆ
ο (τ )) 2 
=  γ m (τ ) log(
l (θˆ i( r )T ci ) 2 − log ˆ (diag | − 
m) j
l |Σ
 σ ( m ) 2 
m∈M ( r ),τ  j diag j 

The ith row vector of the cofactors the leading diagonal element j

50
Semi-Tied
Semi Tied Covariance Matrices -11
11
• How to carry out step (3)? (cont.)
– It is shown that the ML estimate for the ith row of the semi-tied
transform, θˆ i( r ) , is given by

β
θˆ i( r ) = c i G ( ri ) −1
c i G ( ri ) −1cTi

1
G ( ri ) =  σˆ (m)2
W ( m )  γ m (τ )
τ
m∈M ( r ) diag i

51
Semi-Tied
Semi Tied Covariance Matrices -12
12
• It can be shown that
Q( M , Mˆ ) ≥ Q( M , Mˆ ;{Σ
ˆ (diag
r)
})

with equality when diagonal elements of the covariance


matrix are given by Σˆ (diag
m) ˆ ( r ) W ( m )Θ
= diag(Θ ˆ ( r )T )

• During recognition the log-likelihood is based on


log( L(ο(τ ); μ ( m ) , Σ ( m ) , Θ ( r ) ))
= log( N (ο ( r ) (τ ); Θ ( r )μ ( m ) , Σ (diag
m)
)) + log | Θ ( r ) |

Θ ( r ) T ο(τ ) 1 2 log | Θ ( r ) |2
52
Extended MLLT -11
• The extended MLLT (EMLLT) is much similar to M-MLLT.
But the only difference is the precision matrix modeling
D
P j = Σ = Θ Λ j Θ =  λkj θ k θTk
−1
j
T

k =1
basis

– Note that in EMLLT, D ≥ d , while in M-MLLT, D = d


– λkj are not required to be positive!
– {λkj } have
h tto b
be chosen
h such thatt P j is
h th i positive
iti definite
d fi it
– The authors provided two algorithms for iterative-update of
the parameters

Ref. (Olsen, 2002), (Olsen, 2004) 53


Extended MLLT -2
2
• The highlight of EMLLT:
– More flexible: the number of basis elements {θ k θTk } can be
gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling
the value of D

54
Outline
• Introduction to Feature Decorrelation (FD)
– FD on Speech Recognition

• Feature-based Decorrelation
(DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation
(M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

55
Common Principal Components -11
• Why Common Principal Components (CPC)?
– We often deal with the situation of the same variables being
measured on objects from different groups, and the covariance
structure mayy varyy from g
group
p to g
group
p
– But sometimes the covariance matrices of different groups
look somehow similar, and it seems reasonable to assume that
the covariance matrices have a common basic structure

• Goal of CPC – To find a rotation diagonalizes the


covariance matrices simultaneously

Ref. (Flury, 1984), (Flury, 1986) 56


Common Principal Components -2
2
• Hypothesis of CPC’s
H c : BT Σ i B = Λ i , i = 1,..., k
diagonal matrix

– The common principal components (CPC's):


U i = BT x i

– Note that, no canonical ordering of the columns of B need be


given since the rank order of the diagonal elements of the
given,
is not necessarily the same for all groups

57
Common Principal Components -33
• Assume ni S i are independently distributed as W p (ni , Σ i ) .
The common likelihood function is
k
  ni −1  
L( Σ1 ,..., Σ k ) = C × ∏ exp trace − Σ i S i   Σ i
− ni 2

i =1   2 

• Instead of maximizing the likelihood function, minimize


g ( Σ1 ,..., Σ k ) = −2 log L( Σ1 ,..., Σ k ) + 2 log C
k
=  ni (log | Σ i | + trace(Σ i−1S i ))
i =1

58
Common Principal Components -4
4
• Assume H C holds for some orthogonal matrix β , and
Λ i = diag(λi1 ,..., λip ) , then
p
log | Σ i |=  log λij , i = 1,...,kk
j =1
p
βTj S i β j
trace(Σ i−1S i ) = trace(βΛ i−1βT S i ) = trace(Λ i−1βT S i β) = 
j =1 λij
• Therefore g ( Σ1 ,..., Σ k ) = g (β1 ,..., β p , λ11 ,..., λ1 p , λ21 ,..., λkp )
k p βTj S i β j 
=  ni   log λij + 

j =1  λijj 
i =1 

59
Common Principal Components -55
• The function g is to be minimized under the restrictions
0 if h ≠ j G ( Σ1 ,..., Σ k )
T
β βj = 
h p p

1 if h = j ( )
= g ( Σ1 ,..., Σ k ) −  γ h β β h − 1 − 2 γ hjj βTh β j
T
h
h =1 h< j

• Thus we wish to minimize the function


p p
G ( Σ1 ,,...,, Σ k ) = g ( Σ1 ,,...,, Σ k ) −  γ h (βTh β h − 1) − 2 γ hj βTh β j
h =1 h< j

60
Common Principal Components -6
6
• Minimization:
k

p
βTj S i β j 
∂  ni   log λij + 
∂G ( Σ1 ,..., Σ k ) i =1 j =1  λij  =0
=
∂λij ∂λij
 λij = βTj S i β j , i = 1,,...,, k ; j = 1,,...,, p

Key point! Keep it in your mind.

 trace(Σ i−1S i ) = p, i = 1,..., k

61
Common Principal Components -77
• Minimization (cont.):
 k p 
β T
S β  
  ni   log λij + j i j
 
 i =1 j =1  λij  
∂ 
 p p

 −  γ h (β h β h − 1) − 2 γ hj β h β j 
T T

∂G ( Σ1 ,..., Σ k )
=   =0
h =1 h< j

∂β j ∂β j
k ni S i β j p
 − γ j β j −  γ jh β h = 0, j = 1,..., p
i =1 λij h =1
h≠ j
k
– Multiplying
M lti l i the
th left b βTh gives
l ft by i γ j =  ni , j = 1,..., p
i =1

62
Common Principal Components -8
8
• Minimization (cont.): Thus
k ni S i β j k p


i =1 λij
−( ni )β j −  γ jh β h , j = 1,..., p
i =1 h =1
h≠ j
h≠

– Multiplying the left by βTl (l ≠ j ) implies


k ni βTl S i β j

i =1 λij
=γ jl , j = 1,..., p, l ≠ j

– Note that βTj S i β l = βTl S i β j , and γ jl = γ lj


k ni βTl S i β j

i =1
1 λij
=γ jl , j = 1,..., p, j ≠ l

63
Common Principal Components -9
9
• Minimization (cont.):
k ni βTl S i β j k ni βTl S i β j

i =1 λij
−
i =1 λil
=0

 k λil − λij 
 β l  ni
T
S i β j = 0, l,j = 1,...,p; l ≠ j
 i =1 λ λ 
 il ijj 

the optimization objective

– These p( p − 1) 2 equations have to be solved under the


orthonormality conditions βT β = I p and λij = βTj S i β

64
Common Principal Components -10
10
• Solving procedure of CPC – FG Algorithm
– F-Algorithm

65
Common Principal Components -11
11
– G-Algorithm

66
Common Principal Components -12
12
• Likelihood Ratio Test
– The sample common principal components:
U i = βˆ T X i , i = 1,..., k

– For the ith group, the transformed covariance matrix is


Fi = βˆ T S i βˆ , i = 1,,...,, k
– Since Λˆ i = diag(Fi ), i = 1,..., k
the statistic can be written as a function of the alone:
p

k
| diag(Fi ) | k
∏ f jj( i )
χ =  ni log
2
=  ni log
j =1
p
| Fi |
i =1 i =1
∏l
j =1
ij eigenvalues of Fi
67
Common Principal Components -13
13
• Likelihood Ratio Test (cont.)
– The likelihood ratio criterion is a measure of simultaneous
diagonalizability of k p.d.s. matrices
– The CPC's can be viewed as obtained byy a simultaneous
transformation, yielding variables that are as uncorrelated as
possible
– It can also be seen from another viewpoint of Hadamard
Hadamard’ss
inequality
| Fi |≤| diag(
g(Fi ) |

68
Common Principal Components -14
14
• Actually, CPC can be also viewed as another measure of
“deviation from diagonality”
| diag(Fi ) |
ϕ (Fi ) = ≥1
| Fi |

– The CPC criterion can be


k
Φ(F1 ,..., Fk ; n1,...,nk ) = ∏ (ϕ (Fi )) ni
i =1

– Let Fi = BT A i Bfor a given orthonal matrix B


nk ) = min Φ(BT A1B,..., BT A k B; n1,...,nnk )
Φ 0 ( A1 ,..., A k ; n1,...,n
B

69
Comparison Between CPC and F-MLLT
F MLLT
• CPC tries to maximize
C

 i
n (log
i =1
| Σ i | + trace( Σ −1
i S i ))
The estimates are KNOWN

with Σ i = BT Λ i B, i = 1,..., C in the original space

• F-MLLT tries to maximize


1 C

2 j =1
N j | diag ( A T
S j A) | − N log | A | The estimates are UNKNOWN

70
Appendix A -11
• Show
1
− ( x i −μ li )T Σ l−i 1 ( x i −μ li )
 j 2j (( m j −μ j )T Σ −j1 ( m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | )
N Nd n
e 2 − −

i =1 (2π ) | Σ li |
d
= (2π ) 2
e

71
Appendix A -2
2
For each class C j ,

 (x
N
− m j )T Σ −j1 (m j − μ j )
 (xi − μ li )T Σl−i1 (xi − μ li )
i =1
x i ∈C j
i

N = Σ −j1 (m j − μ j )  (x i − m j )T
=  ( x i − m li + m li − μ li ) Σ ( x i − m li + m li − μ li )
T −1
li x i ∈C j
i =1
N =0
=  ((x i − m li ) + (m li − μ li ) ) Σ ((x i − m li ) + (m i − μ li ))
T T −1
li
i =1
N
=  ((x i − m li )T Σ l−i1 (x i − m li ) + (m li − μ li )T Σ l−i1 (m li − μ li ) + (x i − m li )T Σ l−i1 (m li − μ li ) + (m li − μ li )T Σ l−i1 (x i − m li ))
i =1
N N
=  trace(Σ l−i1 (x i − m li )(x i − m li )T +  N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N N
=  n j trace(Σ −j1S j ) +  N j (m j − μ j )T Σ −j1 (m j − μ j )
i =1 i =1
N
=  n j ((m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ))
i =1

72
Appendix B
• ML estimators for log p(x1N ,{μ j },{Σ j })
∂ log p (x1N ,{μ j }, {Σ j })
= NΣ −j1 (m j − μ j ) = 0
∂μ j
 μˆ j = m j

∂ log p (x1N , {μ j }, {Σ j }) ∂ (trace(Σ −j 1S j ) + log | Σ j |)


= =0
∂Σ j ∂Σ j
 − Σ −j T STj Σ −j T + Σ −j T = 0
Σˆ j =Sj

73
Appendix C -11
• Change of Variable Theorem
– Consider a one-one mapping g : ℜ n → ℜ n , Y = f ( X)
– Equal probability: The probability of falling in a region in space
X should be the same as the p probabilityy of falling
g in the
corresponding region in space Y
– Suppose the region dx1dx2 ...dxn maps to the region dA in the Y
space Equating probabilities,
space. probabilities we have
f Y ( y1 ,..., yn )dA = f X ( x1 ,..., xn )dx1...dxn

– The region dA is a hyperparallelepipe described by the vectors


dy1 dy dy dy dy dy
( dx1 ,..., n dx1 ), ( 1 dx2 ,..., n dx2 ),..., ( 1 dxn ,..., n dxn )
d1
dx d1
dx d 2
dx d 2
dx d n
dx d n
dx

74
Appendix C -2
2
• Change of Variable Theorem (cont.)
– The hyper-parallelepiped dA can be calculated by
dy1 dyn dy1 dyn
dx1 ,,...,, dx1 ,,...,,
d1
dx d1
dx d1
dx d1
dx
dA =  =  dx1...dxn
dy1 dyn dy1 dyn
dxn ,..., dxn ,...,
d n
dx d n
dx d n
dx ddxn

J: the Jacobian of function g


– So,
f Y ( y1 ,..., yn ) | J | dx1...dxn = f X ( x1 ,..., xn )dx1...dxn
 f Y ( y1 ,..., yn ) =| J |−1 f X ( x1 ,..., xn )

75
Appendix D -11
• If A is a square matrix, then the minor entry of aij is
denoted by Mij and is defined to be the determinant of
the submatrix that remains after the i-th row and the j-th
column are deleted from A. ( 1)i + jMij is
A The number (−1)
denoted by cij and is called the cofactor of aij.
• Given the matrix
b11 b12 b13  c33 = (−1) 3+3 ( M 33 ) c23 = (−1) 2+3 ( M 23 )
B = b21 b22 b23 
b11 b12 ×
b31 b32 b33    b11 b12 
M 23 =  × × × =   = b11b32 − b12b31
b31 b32 
b31 b32 ×

76
Appendix D -2
2
• Given the n by n matrix
– The determinant of A can be written as the sum of its cofactors
multiplied by the entries that generated them.
 a11 a12  a1n 
a a22  a2 n 
A=
21

     
 
an1 an 2  ann 
(cofactor expansion along the jth column)
det( A) = a1 j c1 j + a2 j c2 j + a3 j c3 j +  + anj cnj = A( j )T c ( j )

((cofactor expansion
p along
g the ith row))
det( A) = ai1ci1 + ai 2 ci 2 + ai 3ci 3 +  + ain cin = A( i ) c(Ti )
77

Вам также может понравиться