Shinji Watanabe PHD Thesis

Speech Recognition Based on a Bayesian Approach
February 2006
Graduate School of Science and Engineering Waseda University
Shinji Watanabe
ABSTRACT
Speech recognition is a very important technology, which functions as a human interface that converts speech information into text information. Conventional speech recognition systems have been developed by many researchers using a common database. Therefore, currently available systems relate to the specic environment of the database, which lacks robustness. This lack of robustness is an obstacle as regards applying speech recognition technology in practice, and improving robustness has been a common worldwide challenge in the elds of acoustic and language studies. Acoustic studies have taken mainly two directions: the improvement of acoustic models beyond the conventional Hidden Markov Model (HMM), and the improvement of the acoustic model learning method beyond the conventional Maximum Likelihood (ML) approach. This thesis addresses the challenge in terms of improving the learning method by employing a Bayesian approach. This thesis denes the term Bayesian approach to include a consideration of the posterior distribution of any variable, as well as the prior distribution. That is to say, all the variables introduced when models are parameterized, such as model parameters and latent variables, are regarded as probabilistic variables, and their posterior distributions are obtained based on the Bayes rule. The difference between the Bayesian and ML approaches is that the estimation target is the distribution function in the Bayesian approach whereas it is the parameter value in the ML approach. Based on this posterior distribution estimation, the Bayesian approach can generally achieve more robust model construction and classication than an ML approach. In fact, the Bayesian approach has the following three advantages: Effective utilization of prior knowledge through prior distributions (prior utilization) Model selection in the sense of maximizing a probability for the posterior distribution of model complexity (model selection) Robust classication by marginalizing model parameters (robust classication) However, the Bayesian approach requires complex integral and expectation computations to obtain posterior distributions when models have latent variables. The acoustic model in speech recognition has the latent variables included in an HMM and a Gaussian Mixture Model (GMM) . Therefore, the Bayesian approach cannot be applied to speech recognition without losing the above advantages. For example, the Maximum A Posteriori based framework approximates the posterior distribution of the parameter, which loses two of the above advantages although MAP can utilize prior information. Bayesian Information Criterion and Bayesian Predictive Classication based i
ii
ABSTRACT
frameworks partially realize Bayesian advantages for model selection and robust classication, respectively, in speech recognition by approximating the posterior distribution calculation. However, these frameworks cannot benet from both advantages simultaneously. Recently, a Variational Bayesian (VB) approach was proposed in the learning theory eld, which avoids complex computations by employing the variational approximation technique. In the VB approach, approximate posterior distributions (VB posterior distributions) can be obtained effectively by iterative calculations similar to the expectation-maximization algorithm in the ML approach, while the three advantages provided by the Bayesian approaches are still retained. This thesis proposes a total Bayesian framework, Variational Bayesian Estimation and Clustering for speech recognition (VBEC), where all acoustic procedures of speech recognition (acoustic modeling and speech classication) are based on the VB posterior distribution. VBEC is based on the following four formulations: 1. Setting the output and prior distributions for the model parameters of the standard acoustic models represented by HMMs and GMMs (setting) 2. Estimating the VB posterior distributions for the model parameters based on the VB BaumWelch algorithm similar to the conventional ML based Baum-Welch algorithm (training) 3. Calculating VBEC objective functions, which are used for model selection (selection) 4. Classifying speech based on a predictive distribution, which is analytically derived as the Students t-distribution from the marginalization of model parameters based on the VB posterior distribution (classication). VBEC performs the model construction process, which includes model setting, training and selection (1st, 2nd and 3rd), and the classication process (4th) based on the Bayesian approach. Thus, VBEC can be regarded as a total Bayesian framework for speech recognition. This thesis introduces the above four formulations, and show the effectiveness of the Bayesian approach through speech recognition experiments. The rst set of experiments show the effectiveness of the Bayesian acoustic model construction including the prior utilization and model selection. This work shows the effectiveness of the prior utilization for the sparse training data problem. This thesis also shows the effectiveness of the model selection for clustering contextdependent HMM states and selecting the GMM components, respectively. The second set of experiments achieve the automatic determination of acoustic model topologies by expanding the Bayesian model selection function in the above acoustic model construction. The topologies are determined by clustering context-dependent HMM states and by selecting the GMM components simultaneously, and the process takes much less time than conventional manual construction with the same level of performance. The nal set of experiments focus on the classication process, and show the effectiveness of VBEC as regards the problem of the mismatch between training and input speech by applying the robust classication advantages to an acoustic model adaptation task.
ABSTRACT IN JAPANESE
, , (HMM) 2 , , . . , 3
() ()
HMM (GMM)
iii
iv
ABSTRACT IN JAPANESE
(VB ) VBEC(Variational Bayesian Estimation and Clustering for speech recognition) . VBEC 4
1. HMM GMM () 2. Baum-Welch VB Baum-Welch VB () 3. VB () 4. VB Student t ()

1 3 4 VBEC 4 (1 3) HMM GMM HMM GMM VBEC
Contents
ABSTRACT ABSTRACT IN JAPANESE CONTENTS LIST OF NOTATIONS LIST OF FIGURES LIST OF TABLES 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Formulation 2.1 Maximum likelihood and Bayesian approach . . . . . . . . . . . . . . . . . . 2.2 Variational Bayesian (VB) approach . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 VB-EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 VB posterior distribution for model structure . . . . . . . . . . . . . . 2.3 Variational Bayesian Estimation and Clustering for speech recognition (VBEC) 2.3.1 Output and prior distributions . . . . . . . . . . . . . . . . . . . . . . 2.3.2 VB Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 VBEC objective function . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 VB posterior based Bayesian predictive classication . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian acoustic model construction 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 3.2 Efcient VB Baum-Welch algorithm . . . . . . . . . . 3.3 Clustering context-dependent HMM states using VBEC 3.3.1 Phonetic decision tree clustering . . . . . . . . v i iii v ix xiii xv 1 1 3 3 7 7 10 10 12 13 14 16 20 22 24 25 25 26 27 29
. . . . . . . . . .
. . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
vi 3.3.2 Maximum likelihood approach . . . . . . . . . . . . . . . 3.3.3 Information criterion approach . . . . . . . . . . . . . . . 3.3.4 VBEC approach . . . . . . . . . . . . . . . . . . . . . . Determining the number of mixture components using VBEC . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Prior utilization . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Prior parameter dependence . . . . . . . . . . . . . . . . 3.5.3 Model selection for HMM states . . . . . . . . . . . . . . 3.5.4 Model selection for Gaussian mixtures . . . . . . . . . . . 3.5.5 Model selection over HMM states and Gaussian mixtures . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 32 33 34 36 37 40 41 42 44 45 47 47 48 48 50 51 52 54 54 56 57 57 60 63 64 65 65 66 66 69 71 71 72 74 76 77
3.4 3.5
3.6 4
Determination of acoustic model topology 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Determination of acoustic model topology using VBEC . . . . . 4.2.1 Strategy for reaching optimum model topology . . . . . 4.2.2 HMM state clustering based on Gaussian mixture model 4.2.3 Estimation of inheritable node statistics . . . . . . . . . 4.2.4 Monophone HMM statistics estimation . . . . . . . . . 4.3 Preliminary experiments . . . . . . . . . . . . . . . . . . . . . 4.3.1 Maximum likelihood manual construction . . . . . . . . 4.3.2 VBEC automatic construction based on 2-phase search . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Determination of acoustic model topology using VBEC 4.4.2 Computational efciency . . . . . . . . . . . . . . . . . 4.4.3 Prior parameter dependence . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
Bayesian speech classication 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Bayesian predictive classication using VBEC . . . . . . . . . . . . . 5.2.1 Predictive distribution . . . . . . . . . . . . . . . . . . . . . 5.2.2 Students t-distribution . . . . . . . . . . . . . . . . . . . . . 5.2.3 Relationship between Bayesian prediction approaches . . . . 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Bayesian predictive classication in total Bayesian framework 5.3.2 Supervised speaker adaptation . . . . . . . . . . . . . . . . . 5.3.3 Computational efciency . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
CONTENTS
6 Conclusions 6.1 Review of work 6.2 Related work . 6.3 Future work . . 6.4 Summary . . .
vii 79 79 79 80 81 83 85 87 93 97 97 97 98 99 99 100 101 102 103 103 105 106
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
ACKNOWLEDGMENTS ACKNOWLEDGMENTS IN JAPANESE BIBLIOGRAPHY LIST OF WORK APPENDICES A.1 Upper bound of Kullback-Leibler divergence for posterior distributions A.1.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.3 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Variational calculation for VB posterior distributions . . . . . . . . . . A.2.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.3 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 VB posterior calculation . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Model parameter . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Latent variable . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Students t-distribution using VB posteriors . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
LIST OF NOTATIONS
Abbreviations
ML HMM GMM VB VBEC EM MAP BIC MDL BPC VB-BPC LVCSR MFCC RHS MLC IWR JNAS MMIXTURE MSINGLE AMP BPC UBPC SOLON CSJ SI : : : : : : : ; : : : : : : : : : : : : : : : : : Maximum Likelihood (page i) Hidden Markov Model (page i) Gaussian Mixture Model (page i) Variational Bayes (page ii) Variational Bayesian Estimation and Clustering for speech recognition (page ii) Expectation-Maximization (page 1) Maximum A Posteriori (page 1) Bayesian Information Criterion (page 3) Minimum Description Length (page 3 ) Bayesian Predictive Classication (page 3) VB posterior based BPC (page 4) Large Vocabulary Continuous Speech Recognition (page 8) Mel Frequency Cepstrum Coefcients (page 10) Right Hand Side (page 23) ML-based Classication (page 23) Isorated Word Recognition (page 36) Japanese Newspaper Article Sentences (page 36) GMM based phonetic decision tree method utilizing Gaussian mixture statistics of monophone HMM (page 53) GMM based phonetic decision tree method utilizing single Gaussian statistics of monophone HMM (page 53) Acoustic Model Plant (page 61) Dirac posterior based BPC (page 67) Uniform posterior based BPC (page 67) NTT Speech recognizer with OutLook On the Next generation (page 72) Corpus of Spontaneous Japanese (page 75) Speaker Independent (page 75)
ix
LIST OF NOTATIONS
Abbreviations of organizations
ASJ JEIDA IEEE SSPR NIPS ICSLP ICASSP IEICE : : : : : : : : Acoustical Society of Japan (page 37) Japan Electronic Industry Development Association (page 37) Institute of Electrical and Electronic Engineers Spontaneous Speech Processing and Recognition Neural Information Processing Systems International Conference on Spoken Language Processing International Conference on Acoustics, Speech, and Signal Processing Institute of Electronics, Information and Communication Engineers
General notations
p(), q() O x m Z c : : : : : : : Probabilistic distribution functions Set of feature vectors of training data Set of feature vectors of input data Set of model parameters Model structure index Set of latent variables Category index
Speech recognition notations

e E t Te d D O t RD e O = {O t |t = 1, .., Te , e = 1, ..., E} e xt RD x = {xt |t = 1, .., T } W : : : : : : : : : : : Speech example index Number of speech examples Frame index Number of frames in example e Dimension index Number of dimensions Feature vector of training speech at frame t of example e Set of feature vectors of training speech Feature vector of input speech at frame t Set of feature vectors of input speech Sentence (word sequence)
LIST OF NOTATIONS
Acoustic model notations
i, j J k L st e S = {st |t = 1, ..., Te , e = 1, ..., E} e t ve t V = {ve |t = 1, ..., Te , e = 1, ..., E} aij wjk jk jk t e,j t e,j t e,ij
t e,jk
xi
: : : : : : : : : : : : : : : : : : : :
O M V
HMM indices Number of temporal HMM states in a phoneme Mixture component index Number of mixture components in an HMM state HMM state index at frame t of example e Set of HMM states Mixture component index at frame t of example e Set of mixture components State transition probability from state i to state j k-th weight factor of mixture component for state j Gaussian parameter for mean vector of component k in state j Gaussian parameter for covariance matrix of component k in state j Forward probability at frame t of example e in state j Backward probability at frame t of example e in state j Transition probability from state i to state j at frame t of example e Occupation probability of mixture component k in state j at frame t of example e 0-th order statistics (occupation count) 1st order statistics 2nd order statistics Set of sufcient statistics
Notations of prior and posterior parameters and VB values

{ij }J j=1 {jk }L k=1 jk jk jk Rjk,d Fm m F m FS,V : : : : : : : : : : Dirichlet distribution parameter for {aij }J j=1 Dirichlet distribution parameter for {wjk }L k=1 Normal distribution parameter for jk Normal distribution parameter for jk Gamma distribution parameter for jk,d Gamma distribution parameter for jk,d Set of prior or posterior parameters VB objective function for model structure m VB objective function of model parameter for model structure m VB objective function of latent variables S and V for model structure m
Notations of phonetic decision tree clustering

n r Q LQ(n) Q(n) LBIC/M DL F Q(n) : : : : : : Tree node index Root node index Phonetic question index Gain of log likelihood for question Q in node n Gain of BIC/MDL objective function for question Q in node n Gain of VB objective function for question Q in node n
xii
LIST OF NOTATIONS
Special function notations

() () () () : : : : Gamma function Digamma function Dirac delta function Cumulative function of the standard Gaussian distribution
Denitions of probabilistic distribution functions and normalization constants

Normal distribution N (O|, ) CN () Dirichlet distribution D({aj }J |{j }J ) j=1 j=1 CD ({j }J ) j=1 Gamma distribution G( 1 |, r) CG (, r) Students t-distribution St(x|, , ) CSt (, ) CG (, r 1 2 (r/2) 2 /(/2)
CN () exp 1 (O ) 1 (O ) 2 1 (2||) 2 CD (j )
j P ( J j ) j=1 QJ j=1 (j )
(aj )j 1
r exp 2
1 CSt 1 + (x )2 +1 1 ( 2 ) 1 2 1 ( 2 )( 2 )
+1 2
List of Figures
1.1 1.2 2.1 2.2 Automatic speech recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter ow of thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General scheme of statistical model construction and classication. . . . . . . . . Hidden Markov model for each phoneme unit. A standard acoustic model for phoneme /a/. T, S, G and D denote search spaces of HMM-temporal, HMMcontextual, GMM and feature vector topologies, respectively. . . . . . . . . . . . Hidden Markov model for each phoneme unit. A state is represented by the Gaussian mixture below it. There are three states and three Gaussian components in this gure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VB Baum-Welch algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total speech recognition frameworks based on VBEC and ML-BIC/MDL. . . . . . 2 5 8
. 10
2.3
2.4 2.5 3.1 3.2
. 14 . 17 . 24 28 29 29 30 35
Efcient VB Baum-Welch algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . A set of all triphone HMM states */ai /* in i th state sequence is clustered based on the phonetic decision tree method. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Splitting a set of triphone HMM states in node n into two sets in yes node nQ and Y Q no node nN by answering phonetic question Q. . . . . . . . . . . . . . . . . . . . 3.4 Tree structure in each HMM state . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Acoustic model selection of VBEC: two-phase procedure. . . . . . . . . . . . . . 3.6 The left gure shows recognition rates according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right gure shows an enlarged view of the left gure for 25 1,500 utterances. The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Number of splits according to amount of training data (23,000 sentences). . . . . 3.8 The left gure shows recognition rates according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right gure shows an enlarged view of the left gure for more than 1,000 utterances The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Total number of clustered states according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The horizontal and vertical axes are scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Objective functions and recognition rates according to the number of clustered states. xiii
39 39
40
40 42
xiv
LIST OF FIGURES
3.11 Objective functions and word accuracies according to the increase in the number of total clustered triphone HMM states. . . . . . . . . . . . . . . . . . . . . . . . 43 3.12 Total objective function F m and word accuracy according to the increase in the number of mixture components per state. . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 4.2 4.3 4.4 Distributional sketch of the acoustic model topology. . . . . . . . . . . . . . . . Optimum model search for an acoustic model. . . . . . . . . . . . . . . . . . . . Estimation of inheritable GMM statistics during the splitting process. . . . . . . Model evaluation test using Test1 (a) and Test2 (b). The contour maps denote word accuracy distributions for the total number of clustered states and the number of components per state. The horizontal and vertical axes are scaled logarithmically. Determined model topologies and their recognition rates (MSINGLE). The horizontal and vertical axes are scaled logarithmically. . . . . . . . . . . . . . . . . . Determined model topologies and their recognition rates (MMIXTURE). The horizontal and vertical axes are scaled logarithmically. . . . . . . . . . . . . . . . . Word accuracies and objective functions using GMM state clustering (MSINGLE). The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . Word accuracies and objective functions using GMM state clustering (MMIXTURE). The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . 48 . 49 . 51
. 54 . 58 . 58 . 60 . 60
4.5 4.6 4.7 4.8
5.1
5.2 5.3
(a) shows the Gaussian (Gauss(x)) derived from BPC, the uniform distribution based predictive distribution (UBPC(x)) derived from UBPC in Eq. (5.6), the variance-rescaled Gaussian (Gauss2(x)) derived from VB-BPC-MEAN in Eq. (5.9), and two Students t-distributions (St1(x) and St2(x)) derived from VB-BPC in Eq. (5.9). (b) employs the logarithmic scale of the vertical axes in (a) to emphasize the behavior of each distribution tail. The parameters corresponding to mean and variance are the same for all distributions. The hyper-parameters of UBPC are set at C = 3.0 and = 0.9. The rescaling parameter of Gauss2(x) () is 1. The degrees of freedom (DoF) of the Students t-distributions ( = ) are 1 for St1(x) and 100 for St2(x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Recognition rate for various amounts of training data. The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Word accuracy for various amounts of adaptation data. The horizontal axis is scaled logarithmically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
List of Tables
1.1 Comparison of VBEC and other Bayesian frameworks in terms of Bayesian advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Speech recognition terms corresponding with statistical learning theory terms . . . 7 Training specications for ML and VB . . . . . . . . . . . . . . . . . . . . . . . . 12 Examples of questions for phoneme /a/ . . . . . . . . . . . . . . . . . . . . . . . Experimental conditions for isolated word recognition task . . . . . . . . . . . . Experimental conditions for LVCSR (read speech) task . . . . . . . . . . . . . . Prior distribution parameters. Or , Mr and V r denote the 0th, 1st, and 2nd statistics of a root node (monophone HMM state), respectively. . . . . . . . . . . . . . Recognition rates in each prior distribution parameter. The model was trained using data consisting of 10 sentences. . . . . . . . . . . . . . . . . . . . . . . . Recognition rates in each prior distribution parameter. The model was trained by using data consisting of 150 sentences . . . . . . . . . . . . . . . . . . . . . . . Word accuracies for total numbers of clustered states and Gaussians per state. The contour graph on the right is obtained from these results. The recognition result obtained with the best manual tuning with ML was 92.0 and that obtained automatically with VBEC was 91.1. . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental conditions for LVCSR (read speech) task . . . . . . . . . . . . . . Experimental conditions for isolated word recognition . . . . . . . . . . . . . . . Comparison with iterative and non-iterative state clustering . . . . . . . . . . . . Prior parameter dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robustness of acoustic model topology determined by VBEC for different speech data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relationship between BPCs . . . . . . . . . . . . . . . . . . Experimental conditions for isolated word recognition task . Prior distribution parameters . . . . . . . . . . . . . . . . . Conguration of VBEC and ML based approaches . . . . . Experimental conditions for LVCSR speaker adaptation task Prior distribution parameters . . . . . . . . . . . . . . . . . xv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . 37 . 38 . 38 . 41 . 41
. 44 . . . . 56 62 63 64
4.1 4.2 4.3 4.4 4.5
. 64 . . . . . . 71 72 72 74 75 75
5.1 5.2 5.3 5.4 5.5 5.6
xvi 5.7
LIST OF TABLES
Experimental results for model adaptation experiments for each speaker based on VB-BPC, VB-BPC-MEAN, UBPC and BPC(MAP). The best scores among the four methods are highlighted with a bold font . . . . . . . . . . . . . . . . . . . . 78 Technical trend of speech recognition using variational Bayes . . . . . . . . . . . . 80
6.1
Chapter 1 Introduction
1.1 Background
Speech information processing is one of the most important human interface topics in the eld of computer science. In particular, speech recognition, which converts speech information into text information, as shown in Figure 1.1, is the core technology for allowing computers to understand the human intent. Speech recognition has been studied for a number of years, and the preliminary technique of phoneme recognition has now progressed to word recognition and large vocabulary continuous speech recognition [14] where the vocabulary size of the state-of-the-art recognizer amounts to 1.8 million [5]. The current successes in speech recognition are based on pattern recognition, which uses statistical learning theory. Maximum Likelihood (ML) methods have become the standard techniques for constructing acoustic and language models for speech recognition. ML methods guarantee that ML estimates approach the true values of the parameters. ML methods have been used in various aspects of statistical learning, and especially for acoustic modeling in speech recognition since the Expectation-Maximization (EM) algorithm [6] is a practical way of obtaining the local optimum solution for the training of latent variable models. Therefore, acoustic modeling based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) have been developed greatly by using the ML-EM approach [79]Other training methods have also been proposed with which to train acoustic model parameters, such as discriminative training methods [1013], Maximum A Posteriori (MAP) methods [14, 15], and quasi-Bayes methods [16, 17]. However, the performance of current speech recognition systems is far from satisfactory. Specifically, the recognition performance is much poorer than the human recognition ability since speech recognition has a distinct lack of robustness, which is crucial for practical use. In a real environment, there are many uctuations originating from various factors such as the speaker, context dependence, speaking style and noise. In fact, the performance of acoustic models trained using read speech decreases greatly when the models are used to recognize spontaneous speech due to the mismatch between the read and spontaneous speech environments [18]. Therefore, most of the problems posed by current speech recognition techniques result from a lack of robustness. This lack of robustness is an obstacle in terms of the practical application of speech recognition technol1
INTRODUCTION
. e b a n at a W si e m a n y M
ledoM egaugnaL ecnetneS erocs egaugnaL )redoced( rezingocer hceepS erocs citsuocA noitcartxe erutaeF ledom citsuocA
ogy, and improving robustness has been a common worldwide challenge in the eld of acoustic and language studies. Acoustic studies have taken mainly two directions: the improvement of acoustic models beyond the conventional HMM, and the improvement of the acoustic model learning method beyond the conventional ML approach. This thesis addresses the challenge in terms of improving the learning method by employing a Bayesian approach. This thesis denes the term Bayesian approach to mean that it considers the posterior distribution of any variable, as well as the prior distribution. That is to say, all the variables introduced when models are parameterized, such as model parameters and latent variables, are regarded as probabilistic variables, and their posterior distributions are obtained based on the Bayes rule. The difference between the Bayesian and ML approaches is that the target of estimation is the distribution function in the Bayesian approach whereas it is the parameter value in the ML approach. Based on this posterior distribution estimation, the Bayesian approach can generally achieve more robust model construction and classication than an ML approach [19, 20]. In fact, the Bayesian approach has the following three advantages: (A) Effective utilization of prior knowledge through prior distributions (prior utilization) (B) Model selection in the sense of maximizing a probability for the posterior distribution of model complexity (model selection) (C) Robust classication by marginalizing model parameters (robust classication) However, the Bayesian approach requires complex integral and expectation computations to obtain posterior distributions when models have latent variables. The acoustic model in speech recognition has the latent variables included in an HMM and a Gaussian Mixture Model (GMM). Therefore, the Bayesian approach cannot be applied to speech recognition without losing the above advantages. For example, the Maximum A Posteriori based framework approximates the posterior distribution of the parameter, which loses two of the above advantages although MAP can
. e b a n at a W si e m a n y M
Figure 1.1: Automatic speech recognition.
1.2. GOAL
utilize prior information. Bayesian Information Criterion (BIC)1 and Bayesian Predictive Classication (BPC) based frameworks partially realize Bayesian advantages for model selection and robust classication, respectively, in speech recognition by approximating the posterior distribution calculation [15, 21, 22]. However, these frameworks cannot benet from both advantages simultaneously, as shown in Table 1.1.
1.2
Goal
One goal of this work is to provide speech recognition based on a Bayesian approach to overcome the lack of robustness described above by utilizing the three Bayesian advantages. Recently, a Variational Bayesian (VB) approach was proposed in the learning theory eld that avoids complex computations by employing the variational approximation technique [2326]. With this VB approach, approximate posterior distributions (VB posterior distributions) can be obtained effectively by iterative calculations similar to the Expectation-Maximization algorithm used in the ML approach, while the three advantages of the Bayesian approaches are still retained. Therefore, to realize the goal, a new speech recognition framework is formulated using VB to replace the ML approaches with the Bayesian approaches in speech recognition. A total Bayesian framework is proposed, Variational Bayesian Estimation and Clustering for speech recognition (VBEC), where all acoustic procedures for speech recognition (acoustic model construction and speech classication) are based on the VB posterior distribution. VBEC includes the three Bayesian advantages unlike the conventional Bayesian approaches, as shown in Table 1.1. Therefore, this study also conrms the effectiveness of the three Bayesian advantages, prior utilization, model selection and robust classication, in VBEC experimentally.
1.3
Overview
This subsection provides an overview of the work (Figure 1.2) with reference to related paper. Chapter 2 discusses the formulation of VBEC compared with those of the conventional ML approaches. VBEC is based on the following four formulations: 1. Setting the output and prior distributions for the model parameters of the standard acoustic models represented by HMMs and GMMs (setting) Table 1.1: Comparison of VBEC and other Bayesian frameworks in terms of Bayesian advantages
Bayesian advantage (A) Prior utilization (B) Model selection (C) Robust classication VBEC MAP BIC/MDL Conventional BPC
BIC and Minimum Description Length (MDL) criterion have been independently proposed, but they are practically the same. Therefore, they are identied in this thesis and referred to as BIC/MDL.
INTRODUCTION
2. Estimating the VB posterior distributions for the model parameters based on the VB BaumWelch algorithm similar to the conventional ML based Baum-Welch algorithm (training) 3. Calculating VBEC objective functions, which are used for model selection (selection) 4. Classifying speech based on a predictive distribution, which is analytically derived as the Students t-distribution from the marginalization of model parameters based on the VB posterior distribution (classication).
Therefore, VBEC performs the model construction process, which includes model setting, training and selection (1st, 2nd and 3rd), and the classication process (4th) based on the Bayesian approach [27, 28]. Thus, VBEC can be regarded as a total Bayesian framework for speech recognition. Based on the above four formulations, this thesis shows the effectiveness of the Bayesian advantages, through speech recognition experiments. Chapter 3 describes the construction of the acoustic model through the consistent use of Bayesian approaches based on the 1st, 2nd and 3rd formulations [2730]. The VB BaumWelch algorithm is applied to acoustic modeling to estimate the VB posteriors after setting the prior distributions. The effectiveness of the prior utilization is shown in cases where there is a small amount of speech recognition training data. In addition, Bayesian model selection is applied to the phonetic decision tree clustering and the selection of GMM components. Thus, the effectiveness of VBEC for acoustic model construction is conrmed experimentally. Chapter 4 describes the automatic determination of acoustic model topologies achieved by expanding the VBEC function of the Bayesian model selection presented in the acoustic model construction in Chapter 3 [31,32]. The determination is realized by clustering contextdependent HMM states and by selecting the GMM components simultaneously, and the process takes much less time than conventional manual construction with the same level of performance. Chapter 5 focuses on speech classication based on Bayesian Predictive Classication using VB posteriors (VB-BPC), and compares VB-BPC with the other classication methods theoretically and experimentally. The chapter reveals the superior performance of VBEC compared with the other classication methods in the practical tasks by applying robust classication to acoustic model adaptation [33, 34]. Finally, Chapter 6 reviews this thesis and discusses related and future work.
noitacifissalc hceeps naiiseyaB ..5rettpahC hceeps na seyaB 5re pahC
noitacifissalc hceepS
1.3. OVERVIEW
Figure 1.2: Chapter ow of thesis.

snoisulcnoC .6retpahC ygolopot ledom citsuoca fffo noiiitttaniiimrettteD ...4retttpahC o no an mre eD 4re pahC o no an mre eD 4re pahC no cur snoc edom no cur snoc edom noiiitttcurtttsnoc llledom citsuoca naiseyaB .3retpahC noitcurtsnoc ledom citsuocA noitazilamroF .2retpahC noitcudortnI .1retpahC
Chapter 2 Formulation
This chapter begins by describing the difference between Bayesian and conventional Maximum Likelihood (ML) approaches using general terms of statistical learning theory in Section 2.1. Particular attention is paid to the advantages and disadvantages of the Bayesian approaches over the conventional ML approaches. Then, in Section 2.2, the general solution of the Variational Bayesian (VB) approach is explained, which is an approximate realization of the Bayesian approaches. Here, the general terms of statistical learning theory are also used. Finally, Section 2.3 explains the application of the VB approach to acoustic model construction and classication. The rst two sections deal with a general scheme for statistical model construction and classication. Namely, as shown in Figure 2.1, a model is obtained with model parameter and model structure m using training data O, and unknown data x is classied into category c based on the model. Readers engaged in speech recognition may nd it easier to follow these two sections by regarding the statistical learning theory terms as speech recognition terms, as shown in Table 2.1. Table 2.1: Speech recognition terms corresponding with statistical learning theory terms
Statistical learning theory terms Training data Category Model parameter Latent variable Model structure Speech recognition terms Speech feature vector Word, phoneme, triphone, etc. State transition probability, weight factor, Gaussian parameters, etc. Sequences of HMM states, sequences of Gaussian mixture components, etc. Number of HMM states, number of Gaussian mixture components, prior parameters, etc.
O c Z m
2.1 Maximum likelihood and Bayesian approach

This section briey reviews the Bayesian approach in contrast with the ML approach by addressing the three Bayesian advantages explicitly in terms of general learning issues, as shown in Figure 2.1. The Bayesian approach is based on posterior distributions, while the ML approach is based on model parameters. Let O be a training data set of feature vectors. The posterior distribution for a 7
CHAPTER 2. FORMULATION
Figure 2.1: General scheme of statistical model construction and classication. set of model parameters (c) of category c is obtained with the famous Bayes theorem as follows: p((c) |O, m) = p(O|, m)p(|m) (c) d , p(O|m) (2.1)
where p(|m) is a prior distribution for , and m denotes the model structure index, for example, the number of Gaussian Mixture Model (GMM) components and Hidden Markov Model (HMM) states. Here, c represents the set of all categories without c. In this thesis, we can also regard the prior parameter setting as the model structure, and include its variations in index m. From Eq. (2.1), prior information can be utilized via the estimation of the posterior distribution, which depends on prior distributions. Therefore, the Bayesian approach is superior to the ML approach for the following reasons: (A) Prior utilization Familiar applications based on prior utilization in speech recognition are Large Vocabulary Continuous Speech Recognition (LVCSR) using language and lexicon models as priors [7], and speaker adaptation using speaker independent models as priors [15]. In addition, the Bayesian approach has two major advantages over the ML approach: (B) Model selection, (C) Robust classication. These two advantages are derived from the posterior distributions. First, by regarding m as a probabilistic variable, we can consider the posterior distribution p(m|O) for model structure m.
noitceles dna ,gniniart ,gnittes ledoM noitacifissalC yrogetaC ledoM atad nwonknU
atad gniniarT
2.1. MAXIMUM LIKELIHOOD AND BAYESIAN APPROACH
Once p(m|O) is obtained, an appropriate model structure that maximizes the posterior probability can be selected 1 as follows: m = argmax p(m|O).
m
(2.2)
Second, once the posterior distribution p((c) |O, m) is estimated for all categories, the category for input data x is determined by: c = argmax
c
p(x|(c) , m)p((c) |O, m)d(c) .
(2.3)
The parameters are integrated out in Eq. (2.3) so that the effect of over-training is mitigated, and robust classication is obtained. Although the Bayesian approach is often superior to the ML approach because of the above three advantages, the integral and expectation calculations make any practical use of the Bayesian approach very difcult. In particular, when a model includes latent (hidden) variables, the calculation becomes more complex. Let Z be a set of discrete latent variables. Then, with a xed model structure m, posterior distributions for model parameters p((c) |O, m) and p(Z (c) |O, m) are expressed as follows: p((c) |O, m) =
Z
p(O, Z|, m)p(|m) (c) d p(O|m)
(2.4)
and p(Z (c) |O, m) =

Z (c)
p(O, Z|, m)p(|m) d. p(O|m)
(2.5)
The posterior distributions for the model structure p(m|O) are expressed as follows: p(m|O) =
Z
p(O, Z|, m)p(|m)p(m) d, p(O)
(2.6)
where p(m) denotes a prior distribution for model structure m. These equations cannot be solved analytically. The acoustic model for speech recognition includes latent variables in HMMs and GMMs, and the total number of model parameters amounts to more than one million. In addition, these parameters depend on each other hierarchically, as shown in Figure 2.2. Solving all integrals and expectations numerically requires huge amounts of computation time. Therefore, when applying the Bayesian approach to acoustic modeling for speech recognition, an effective approximation technique is necessary.
In a strict Bayesian sense, probabilistic variable m should be marginalized using the posterior distribution for a model structure p(m|O). However, this means that we should prepare various structure models in parallel, which would require a lot of memory and computation time. This would be unsuitable for such a large task as speech recognition. Therefore, in this thesis, one appropriate model is selected rather than dealing with various models based on p(m|O).
1
10
Figure 2.2: Hidden Markov model for each phoneme unit. A standard acoustic model for phoneme /a/. T, S, G and D denote search spaces of HMM-temporal, HMM-contextual, GMM and feature vector topologies, respectively.
2.2 Variational Bayesian (VB) approach

This section focuses on the VB approach and derives general solutions for VB posterior distributions q(|O, m), q(Z|O, m), and q(m|O) to approximate the true corresponding posterior distributions. To begin with, it is assumed that q(, Z|O, m) =
c
q((c) |O (c) , m)q(Z (c) |O (c) , m).
This assumption means that probabilistic variables associated with each category are statistically independent from other categories. The speech data used in this thesis is well transcribed and the label information is reliable. In addition, the frequently-used feature extraction (e.g. Mel Frequency Cepstrum Coefcients (MFCC) ) from the speech is good enough for the statistical independence of the observation data to be guaranteed. Therefore, the assumption of class independence is reasonable.
2.2.1 VB-EM algorithm

VB posterior distributions for model parameters This subsection discusses VB posterior distributions for model parameters with xed model structure m. Initially, the arbitrary posterior distribution q((c) |O, m) is introduced and the KullbackLeibler (KL) divergence [35] between q((c) |O, m) and true posterior distribution p((c) |O, m) is considered: KL[q((c) |O, m)|p((c) |O, m)] = q((c) |O, m) log q((c) |O, m) (c) d . p((c) |O, m) (2.8)
Substituting Eq. (2.4) into Eq. (2.8) and using Jensens inequality, the inequality of Eq. (2.9) is obtained as follows: KL[q((c) |O, m)|p((c) |O, m)] log p(O|m) F m [q(|O, m), q(Z|O, m)], (2.9)
ecaps erutaeF
MMG
gniretsulc txetnoc citenohP
V/ a/ C
V/ a/ V V/ a/ V
/a/
N/ a/
MMH
/a/
(2.7)
2.2. VARIATIONAL BAYESIAN (VB) APPROACH

where, F m [q(|O, m), q(Z|O, m)] log p(O, Z|, m)p(|m) q(|O, m)q(Z|O, m) .
q(|O,m),q(Z|O,m)
11
(2.10)
Here, the brackets denote the expectation i.e. g(y) p(y) g(y)p(y)dy for continuous variable y and g(n) p(n) n g(n)p(n) for discrete variable n. The derivation of the inequality is shown in detail in Appendix A.1.1. The inequality (2.9) is strict unless q(|O, m) = p(|O, m) and q(Z|O, m) = p(Z|O, m) i.e. the arbitrary posterior distribution q is equivalent to the true posterior distribution p. From the assumption Eq. (2.7), F m is decomposed into each category as follows: F m [q(|O, m), q(Z|O, m)] =
c
log
p(O (c) , Z (c) |(c) , m)p((c) |m) q((c) |O (c) , m)q(Z (c) |O (c) , m)
q((c) |O (c) ,m),q(Z|O (c) ,m)
=
c
F m,(c) [q((c) |O (c) , m), q(Z (c) |O (c) , m)]. (2.11)
This indicates that the total objective function is calculated by summing up all objective functions for each category. From inequality (2.9), q((c) |O, m) approaches p((c) |O, m) as the right-hand side decreases. Therefore, the optimal posterior distribution can be obtained by a variational method, which results in minimizing the right-hand side. Since term log p(O|m) can be disregarded, the minimization is changed to the maximization of F m with respect to q((c) |O, m), and is given by the following variational equation: q((c) |O, m) = q((c) |O, m) F m [q(|O, m), q(Z|O, m)] F m,(c) [q((c) |O (c) , m), q(Z (c) |O (c) , m)] = 0. (2.12)
From this equation, the optimal VB posterior distribution q((c) |O, m) is obtained as follows: q((c) |O, m) p((c) |m) exp log p(O(c) , Z (c) |(c) , m)
q(Z (c) |O(c) ,m)
(2.13) is added
This variational calculation is shown in detail in Appendix A.2.1. In this thesis, a tilde to indicate variationally optimized values or functions. VB posterior distributions for latent variables
A similar method is used to obtain the optimal VB posterior distribution q(Z (c) |O, m). An inequality similar to Eq. (2.12) is obtained by considering the KL divergence between the arbitrary posterior distribution q(Z (c) |O, m) and the true posterior distribution p(Z (c) |O, m) as follows: KL[q(Z (c) |O, m)|p(Z (c) |O, m)] log p(O|m) F m [q(|O, m), q(Z|O, m)], The derivation of the inequality is detailed in Appendix A.1.2. (2.14)
12
The optimal VB posterior distribution q(Z (c) |O, m) is also obtained by maximizing F m with respect to q(Z (c) |O, m) with the variational method as follows: q(Z (c) |O, m) exp log p(O(c) , Z (c) |(c) , m)
q((c) |O(c) ,m)
(2.15)
This variational calculation is shown in detail in Appendix A.2.2. VB-EM algorithm Equations (2.13) and (2.15) are closed-form expressions, and these optimizations can be effectively performed by iterative calculations analogous to the Expectation and Maximization (EM) algorithm [6], which increases F m at every iteration up to a converged value. Then, Eqs. (2.13) and (2.15), respectively, correspond to the Maximization step (M-step) and the Expectation step (E-step) in the VB approach. Therefore, by substituting q into q, these equations can be represented as follows: q((c) |O, m) p((c) |m) exp log p(O(c) , Z (c) |(c) , m) q (Z (c) |O(c) ,m) e . (2.16) (c) (c) (c) (c) q(Z |O, m) exp log p(O , Z | , m) (c) |O (c) ,m) q ( e Note that optimal posterior distributions for a particular category can be obtained simply by using the categorys variables i.e. we are not concerned with the other categories in the calculation, since Eq. (2.16) only depends on category c, which is based on the assumption given by Eq. (2.7). Finally, to compare the VB approach with the conventional ML approach for training latent variable models, the training specications for ML and VB are summarized, as shown in Table 2.2. Table 2.2: Training specications for ML and VB
Training ML-EM VB-EM Min-max optimization Differential method Variational method Objective function Q function F m functional
ML VB
2.2.2 VB posterior distribution for model structure

The VB posterior distributions for a model structure are derived in the same way as in Section 2.2.1, and model selection is discussed by employing the posterior distribution. Arbitrary posterior distribution q(m|O) is introduced and the KL divergence between q(m|O) and the true posterior distribution p(m|O) is considered: KL[q(m|O)|p(m|O)] =
m
q(m|O) log
q(m|O) . p(m|O)
(2.17)
2.3 VBEC
13
Substituting Eq. (2.6) into Eq. (2.17) and using Jensens inequality, the inequality of Eq. (2.17) can be obtained as follows: KL [q(m|O)|p(m|O)] log p(O) + log q(m|O) F m [q(|O, m), q(Z|O, m)] p(m) .
q(m|O)
(2.18)
Similar to the discussion in Section 2.2.1, from the inequality (2.18), q(m|O) approaches p(m|O) as the right-hand side decreases. The derivation of the inequality is detailed in Appendix A.1.3. Therefore, the optimal posterior distribution for a model structure can be obtained by a variational method that results in minimizing the right-hand side as follows: q(m|O) p(m) exp (F m [q(|O, m), q(Z|O, m)]) . (2.19)
This variational calculation is shown in detail in Appendix A.2.3. Assuming that p(m) is a uniform distribution 2 , the proportion relation between q(m|O) and F m is obtained as follows based on the convexity of the logarithmic function: F m F m q(m |O) q(m|O). (2.20)
Therefore, an optimal model structure in the sense of maximum posterior probability estimation can be selected as follows: m = argmax q(m|O) = argmax F m .
m m
(2.21)
This indicates that by maximizing total F m with respect to both q(|O, m), q(Z|O, m), and m, we can obtain the optimal parameter distributions and select the optimal model structure simultaneously [25, 26].
2.3 Variational Bayesian Estimation and Clustering for speech recognition (VBEC)
The four formulations are obtained by using the VB framework with which to perform acoustic model construction (model setting, training and selection) and speech classication consistently based on the Bayesian approach. Consequently, the conventional formulation based on the ML approaches to the formulation based on the Bayesian approach is replaced as follows: Set output distributions Set output distributions and prior distributions (Section 2.3.1) ML Baum-Welch algorithm VB Baum-Welch algorithm (Section 2.3.2)
2
We can set an informative prior distribution for p(m).
14
Figure 2.3: Hidden Markov model for each phoneme unit. A state is represented by the Gaussian mixture below it. There are three states and three Gaussian components in this gure. Log likelihood VBEC objective function (Section 2.3.3) ML based classication VB-BPC (Section 2.3.4) These four formulations are explained in the following four subsections, by applying the acoustic model for speech recognition to the general solution in Section 2.2.
2.3.1 Output and prior distributions

Setting of the output and prior distributions is required when calculating the VB posterior distributions. The output distribution is obtained based on a left-to-right HMM with a GMM for each state, as shown in Figure 2.3, which has been widely used to represent a phoneme category in acoustic models for speech recognition. Output distribution Let O e = {O t RD |t = 1, ..., Te } be a sequential speech data set for an example e of a phoneme e category and O = {O e |e = 1, ..., E} be the total data set of a phoneme category. Since the formulations for the posterior distributions are common to all phoneme categories, the phoneme category index c is omitted from this section to simplify the equation forms. D is used to denote the dimension number of the feature vector and Te to denote the frame number for an example e. The output distribution, which represents a phoneme acoustic model, is expressed by
E Te
t t ast1 st wst ve bst ve (O t ), e e e e e
p(O, S, V |, m) =
e=1 t=1
3= 3=i
33
a a
i
etats rof erutxim naissuaG
32 a
2= 2=i
22
a a
21a
1= i 1=
11
a a
(2.22)
2.3 VBEC
15
where S is a set of sequences of HMM states, V is a set of sequences of Gaussian mixture comt ponents, and st and ve denote the state and mixture components at frame t of example e. Here, e S and V are sets of discrete latent variables, which are the concrete forms of Z in Section 2.1. The parameter aij denotes the state transition probability from state i to state j, and wjk is the k-th weight factor of the Gaussian mixture for state j. In addition, bjk (O t )(= N (O t |jk , jk )) e e denotes the Gaussian with the mean vector jk and covariance matrix jk dened as:
D 1 1 N (O t |jk , jk ) (2) 2 |jk | 2 exp (Ot jk ) 1 (Ot jk ) , e e jk 2 e
(2.23)
where | | and denote the determinant and the transpose of the matrix, respectively, while = {aij , wjk , jk , 1 |i, j = 1, ..., J, k = 1, ..., L} is a set of model parameters. Here, J denotes the jk number of states in an HMM sequence and L denotes the number of Gaussian components in a state.
Prior distribution Conjugate distributions, which are based on the exponential function, are easy to use as prior distributions since the function forms of prior and posterior distributions become the same [15, 19, 20]. Then, a distribution is selected where the probabilistic variable constraint is the same as that of the model parameter. The state transition probability aij and the mixture weight factor wjk has the constraint that j aij = 1 and k wjk = 1. Therefore, the Dirichlet distributions for aij and wjk are used, where the variables of the Dirichlet distribution satisfy the above constraint. Similarly, the diagonal elements of the inverse covariance matrix 1 is always positive, and the jk Gamma distribution is used. The mean vector jk goes from to , and the Gaussian is used. Thus, the prior distribution for acoustic model parameters is expressed as follows:
J J L
p(|m)
i=1 j=1 k=1
p({aij }J =1 |m)p({wjk }L =1 |m)p(jk , jk |m) j k D({aij }J =1 |{0 }J =1 )D({wjk }L =1 |{0 }L =1 ) j ij j k jk k

i,j,k D 0 N (jk | 0 , (jk )1 jk ) jk d=1 0 0 G(1 |jk , Rjk,d ), jk,d
(2.24)
0 0 0 Here, 0 {0 , 0 , jk , 0 , jk , Rjk |i, j = 1, ..., J, k = 1, ..., L} is a set of prior parameters. ij jk jk In Eq. (2.24), D denotes a Dirichlet distribution and G denotes a gamma distribution. The prior distributions of aij and wjk are represented by the Dirichlet distributions, and the prior distribution of jk and jk is represented by the normal gamma distribution. If the covariance matrix elements are off the diagonal, a normal-Wishart distribution is used as the prior distribution of jk and jk .
16 The explicit forms of the distributions are dened as follows:
ij D({aij }J |{0 }J ) CD ({0 }J ) j (aij )0 1 j=1 ij j=1 ij j=1 0 1 D({w }L |{0 }L ) C ({0 }L ) jk jk k=1 D jk k=1 jk k=1 k (wjk ) 0 1 jk 0 0 N (jk | 0 , (jk )1 jk ) CN (jk )|jk | 2 exp 2 (jk 0 ) 1 (jk 0 ) , (2.25) jk jk jk jk 0 jk 0 R 1 G(1 | 0 , R0 ) 0 0 CG (jk , Rjk,d ) 1 2 exp 2jk,d jk,d jk,d jk,d jk jk,d where CD ({0 }J ) ( J 0 )/( J (0 )) ij j=1 ij j=1 ij j=1 C ({0 }L ) ( L 0 )/( L (0 )) D jk jk k=1 k=1 jk k=1 D . 0 0 (jk /2) 2 CN (jk ) 0 jk C ( 0 , R0 ) R0 /2 2 /( 0 /2) G jk jk,d jk,d jk
(2.26)
In the Bayesian approach, an important problem is how to set the prior parameters. In this thesis, two kinds of prior parameters of 0 and R0 are set using sufcient amounts of data from: Statistics of higher hierarchy models in acoustic models for the acoustic model construction task. Statistics of speaker independent models for the speaker adaptation task. The other parameters (0 , 0 , 0 and 0 ) have a meaning as regarding tuning the balance between the values obtained from training data and the above statistics. These parameters are set appropriately based on experiments, and the dependence of these prior parameters is discussed in the experimental chapters.
2.3.2 VB Baum-Welch algorithm

This subsection introduces concrete forms of the VB posterior distributions for model parameters q(|O, m) and for latent variables q(Z|O, m) in acoustic modeling, which are efciently computed by VB iterative calculations within the VB framework. This calculation is effectively computed by the VB Baum-Welch algorithm shown in Figure 2.4.
VB M-step First, the VB M-step for acoustic model training is explained. This is solved by substituting the acoustic model setting in Section 2.3.1 into the general solution for the VB M-step in Section 2.2. The derivation is found in Appendix A.3.1. The calculated results for the optimal VB posterior
2.3 VBEC
17
Figure 2.4: VB Baum-Welch algorithm. distributions for the model parameters are summarized as follows: q(|O, m)
i,j,k
q({aij }J =1 |O, m)q({wjk }L =1 |O, m)q(jk , jk |O, m) j k D({aij }J =1 |{ij }J =1 )D({wjk }L =1 |{jk }L =1 ) k k j j
i,j,k
N (jk | jk , (jk )1 jk )
d
G(1 |jk , Rjk,d ). jk,d
The concrete forms of the distributions are dened as follows: e D({aij }J |{ij }J ) CD ({ij }J ) j (aij )ij 1 j=1 j=1 j=1 jk 1 e D({wjk }L |{jk }L ) CD ({jk }L ) k=1 k=1 k=1 k (wjk )
e jk , (2.28) 1 1 1 N (jk | jk , (jk ) jk ) CN (jk )|jk | 2 exp 2 (jk jk ) jk (jk jk ) jk e e 1 R G(1 | , R ) C ( , R ) 1 2 exp jk,d jk,d jk jk,d G jk jk,d jk,d 2jk,d
a hcleW-muaB BV ) 0 3. 2 ( n oit a u q E )pets-M( sretemarap roiretsop gnitadpU se t baborp no tapucco dna no t snarT seiitiilliibaborp noiitapucco dna noiitiisnarT tnemgduj ecnegrevnoC ) 2 4. 2 ( ~ ) 9 3. 2 ( s n oit a u q E noitcnuf evitcejbo BV gnitaluclaC ) 1 3. 2 ( n oit a u q E scitsitats tneiciffus gnitalumuccA ) 8 3. 2 ( , ) 6 3. 2 ( ~ ) 3 3. 2 ( s n oit a u q E ) p et s - E( dr a w k c a b- dr a wr of B V sretemarap roiretsoP sc ts tats tne c ffuS sciitsiitats tneiiciiffuS
g nitt e s l aiti nI l e d o m cit s u o c A
erutaef hceepS
(2.27)
18 where
CD ({ij }J ) ( J ij )/( J (ij )) j=1 j=1 j=1 CD ({jk }L ) ( L jk )/( L (jk )) k=1 k=1 k=1 D . CN (jk ) (jk /2) 2 jk e CG (jk , Rjk,d ) Rjk,d /2 2 /(jk /2)
(2.29)
Note that Eqs. (2.24) and (2.27) are members of the same function family, and the only difference is that the set of prior parameters 0 in Eq. (2.24) is replaced with a set of posterior distribution parameters {ij , jk , jk , jk , jk , Rjk |i, j = 1, ..., J, k = 1, ..., L} in Eq. (2.27). The conjugate prior distribution is adopted because the posterior distribution is theoretically a member of the same function family as the prior distribution and is obtained analytically, which is a characteristic of the exponential distribution family. Here, are dened as: ij = 0 + ij ij jk = 0 + jk jk jk = 0 + jk jk 0 f jk 0 +Mjk jk = 0jk e jk + jk 0 jk = jk + jk Rjk = diag R0 + Vjk
jk
, (2.30)
0 e jk jk e 0 +jk jk
1 e Mjk (Mjk ) jk
f Mjk e jk
0 jk
f Mjk e jk
0 jk
where diag[] denotes the diagonalization operation by setting off diagonal elements as zero. ij , jk , Mjk and Vjk denote 0th, 1st and 2nd order sufcient statistics, respectively, and are dened as follows: t ij e,t e,ij t jk e,t e,jk . (2.31) t t Mjk e,t e,jk O e t t t Vjk e,t e,jk O e (O e ) These sufcient statistics {ij , jk , Mjk , Vjk |i, j = 1, ..., J, k = 1, ..., L} are computed by t t using e,ij and e,jk dened as follows:
t e,ij q(st1 = i, st = j|O, m) e e . t t e,jk q(st = j, ve = k|O, m) e
(2.32)
t Here, e,ij is a VB transition posterior distribution, which denotes the transition probability from a t state i to a state j at a frame t of an example e, and e,jk is a VB occupation posterior distribution, which denotes the occupation probability of a mixture component k in a state j at a frame t of an t t example e, in the VB approach. Therefore, can be calculated from 0 , e,ij and e,jk , enabling q(|O, m) to be obtained.
2.3 VBEC
VB transition probability (VB E-step) By using variational calculation, VB transition probability is obtained as follows:
t e,ij = q(st1 = i, st = j|O, m) e e
19
exp
log p(O, st1 = i, st = j|, m) e e
q (|O,m) e
(2.33)
t1 exp log e,i aij k
t wjk bjk (O t ) e,j e q (|O,m) e
t1 t e,i is a forward probability at frame t of example e in state j, and e,j is a backward probability t at frame t of example e in state j. Therefore, e,ij is obtained as follows: t1 e,i aij t wjk bjk (O t ) e,j e j Te e,j
t e,ij
(2.34)
Here, aij , wjk and bjk (O t ) are dened as follows: e exp (ij ) ( j ij ) aij wjk exp (jk ) ( k jk ) jk e b (O t ) exp 1 D log + (jk )1 2 2 jk e + log Rjk + jk (O t jk ) (Rjk )1 (O t jk ) e e
(2.35)
where (y) is a digamma function dened as (y) /y log (y). These are solved by substituting the acoustic model setting in Section 2.3.1 into the general solution for q(Z|O, M ) in Section 2.2. The derivation is found in Appendix A.3.1. and are VB forward and backward probabilities dened as:
t e,j t e,j i i t1 e,i aij k k
wjk bjk (O t ) e
aji
t+1 wik bik (O t+1 ) e,i e
(2.36)
t=T t=0 e,j and e,j e are initialized appropriately.
VB occupation probability (VB E-step)

t Similarly, e,jk is obtained from the variational calculation as follows: t t e,jk = q(st = j, ve = k|O, m) e
exp
t log p(O, st = j, ve = k|, m) e
q (|O,m) e
(2.37)
exp log
i
t1 e,i aij
t wjk bjk (O t )e,j e q (|O,m) e
20 Therefore,
t1 t e,i aij wjk bjk (O t )e,j e i Te e,i
t e,jk
(2.38)
t t Thus, e,ij and e,jk are calculated efciently by using a probabilistic assignment via the familiar forward-backward algorithm. This algorithm is called the VB forward backward algorithm.
Similar to the VB forward-backward algorithm, the Viterbi algorithm is also derived within the VB approach by exchanging a summation over i for a maximization over i in the calculation of the t VB forward probability e,j in Eq. (2.36). This algorithm is called the VB Viterbi algorithm. Thus, VB posteriors can be calculated iteratively in the same way as the Baum-Welch algorithm even for a complicated sequential model that includes latent variables such as HMM and GMM for acoustic models. These calculations are referred to as a VB Baum-Welch algorithm, as proposed in [27, 28]. VBEC is based on the VB Baum-Welch algorithm.
2.3.3 VBEC objective function

This section discusses the VB objective function F m for a whole acoustic model topology, i.e., the VBEC objective function, and provides general calculation results. The VBEC objective function is a criterion for both posterior distribution estimation, and model topology optimization in acoustic model construction. This section begins by focusing on one phoneme category. By substituting the VB posterior distribution obtained in Section 2.3.2, we obtain analytical results for F m , and therefore, this calculation also requires a VB iterative calculation based on the VB BaumWelch algorithm used in the VB posterior calculation. We can separate F m into two components: one is composed solely of q(S, V |O, m), whereas the other is mainly composed of q(|O, m). m m Therefore, we dene F and FS,V , and represent F m as follows: p(O, S, V |, m)p(|m) q(|O, m)
Fm =
log
m m = F FS,V ,
q (|O,m) e q (S,V |O,m) e
log q(S, V |O, m)
q (S,V |O,m) e
(2.39)
where m F log p(O,S,V |,m)p(|m) q (|O,m) e

m FS,V log q(S, V |O, m) q (|O,m) e q (S,V |O,m) e
(2.40)
q (S,V |O,m) e
2.3 VBEC
m FS,V includes the effect of latent variables S and V , and is represented as follows: m FS,V = i,j
21
ij (ij ) ( 1 2 jk
j,k
ij ) +
j,k
jk (jk ) ( jk 2 jk ) + log Rjk
jk )
D log() +
1 jk
(O t e
1 2
e
(2.41) jk (O t e
e,t
jk )(Rjk ) .
log
j
Te e,j
The fourth term on the right hand side in Eq. (2.41) is composed of the VB forward probability obtained in Eq. (2.36), and requires us to compute Eq. (2.35) iteratively using all frames of data. This part corresponds to the latent variable effect for the VBEC objective function. m Next, F is represented as follows:
m F
=
i
log
( (
j j
0 ) ij ij )
(ij ) + 0 j (ij )
j 0 jk
D 2
( log (
jk e 2
0 jk 2
0 ) jk k jk )
k 0 Rjk
0 jk 2
+
j,k
e jk D log () 2
(jk ) 0 k (jk )
k
jk
jk e
2
(2.42)
Rjk
m From Eq. (2.42), F can be calculated by using the statistics of the posterior distribution parameters given in Eq. (2.30). This part is equivalent to the objective function for model selection based on Akaikes Bayesian information criterion [36]. The whole F m for all categories is obtained by simply summing up the F m results obtained in this section for all categories as in Eq. (2.11). Strictly speaking, this operation often requires complicated summation because of the shared structure of the model parameters. Thus, the analytical result of the VBEC objective function F m for the acoustic model construction is provided. The VBEC objective function is derived analytically so that it retains the effects of the dependence among model parameters and of the latent variables, dened in the output distribution in Eq. (2.22), unlike with the conventional Bayesian Information Criterion and Minimum Description Length (BIC/MDL) approaches. From this standpoint, we can recognize the VBEC objective function as a global criterion for the selection of acoustic model topologies i.e. the model topology that maximizes the VBEC objective function is globally optimal. Therefore, the VBEC objective function can compare any acoustic models with respect to all topological aspects and their combinations, e.g. contextual and temporal topologies in HMMs, the number of components per GMM in an HMM state, and the dimensional size of feature vectors, based on the following equation:
m=
argmax
m(TSGD)
Fm
(2.43)
22
Here T, S, G and D denote search spaces of HMM-temporal, HMM-contextual, GMM and feature vector topologies, respectively, as shown in Figure 2.2. Based on the discussion in Section 2.3, the following eight steps provide a VB training algorithm for acoustic modeling. Step 1) Set posterior parameter [ = 0] from initialized transition probability [ = 0], occupation probability [ = 0] and model structure m (prior parameter 0 is included) for each category. Step 2) Compute a[ + 1], w[ + 1] and b(O)[ + 1] using [ ]. (By Eq. (2.35)) Step 3) Update [ + 1] and [ + 1] via the Viterbi algorithm or forward-backward algorithm. (By Eqs. (2.34) and (2.38)) Step 4) Accumulate sufcient statistics [ + 1] using [ + 1], [ + 1]. (By Eqs. (2.31) Step 5) Compute [ + 1] using [ + 1] and 0 . (By Eq. (2.30)) Step 6) Calculate total F m [ + 1] for all categories. (By using Eqs. (2.42) and (2.41) and summing up all categories F m ) Step 7) If |(F m [ + 1] F m [ ])/F m [ + 1]| , then stop; otherwise set + 1 and go to Step 2. Step 8) Calculate F m for all possible m and nd m(= argmaxm F m ). Here, denotes an iteration count, and denotes a threshold that checks whether F m converges. Thus, the posterior distribution estimation in the VBEC framework can be effectively constructed based on the VB Baum-Welch algorithm, which is analogous to the ML Baum-Welch algorithm. In addition, VBEC can recognize the model selection using the VB objective function as shown in Step 8. Thus, VBEC can construct an acoustic model consistently based on the Bayesian approach. Note that if we change (a value with attached indicates an ML estimate), and F m Lm (where Lm means the log-likelihood for a model m), this algorithm becomes an ML-based framework, except for the model selection. Therefore, in the implementation phase, the VBEC framework can be realized in the conventional systems of acoustic model construction by adding the prior distribution setting and by changing the estimation procedure and objective function calculation.
2.3.4 VB posterior based Bayesian predictive classication

This subsection deals with the classication based on the Bayesian approach. Here, the phoneme category index c is restored because the category index is important when discussing speech classication. In recognition, the feature vector sequence x = {xt RD |t = 1, ..., T } of input speech is classied as the optimal phoneme category c using a conditional probability function p(c|x, O) given input data x and training data O, as follows: c = argmax p(c|x, O) = argmax p(c|O)p(x|c, O)
c
argmax p(c)p(x|c, O). =

c
(2.44)
2.3 VBEC
23
Here, p(c) is the prior distribution of phoneme category c obtained by language and lexicon models. c is assumed to be independent of O (i.e., p(c|O) p(c)). p(x|c, O) is called predictive = distribution because this distribution predicts the probability of unknown data x conditioned by training data O. We focus on the predictive distribution p(xt |c, i, j, O) of t-th frame input data xt at the HMM transition from i to j of category c. Then, by introducing an output distribution with a set of distribution parameters and a model structure m, p(xt |c, i, j, O) is represented as follows: p(xt |c, i, j, O) =
m
p(xt |c, i, j, , m)p(|c, i, j, O, m)p(m|O)d (2.45) p(x

m (c) t (c) (c) (c) |ij , m)p(ij |O, m)p(m|O)dij ,
=
(c) (c) (c) (c)
where ij {aij , wjk , jk , jk |k = 1, ..., L} is a set of model parameters in category c. By selecting an appropriate model structure m, the predictive distribution is approximated as follows: p(xt |c, i, j, O) = p(xt |ij , m)p(ij |O, m)dij ,
(c) (c) (c)
(2.46)
Therefore, by calculating the integral in Eq. (2.46), the accumulated score of a feature vector sequence x can be computed by summing up each frame score based on the Viterbi algorithm, which enables input speech to be classied. This predictive distribution based approach, which (c) involves considering the integrals and true posterior distributions p(ij |O, m) in Eq. (2.46), is called the Bayesian inference or Bayesian Predictive Classication (BPC) approach [19, 20]. After acoustic modeling in Section 2.3.3, the optimal VB posterior distributions are obtained for the optimal model structure q(|O, m). Therefore, VBEC can deal with the integrals in Eq. (c) (2.46) by using the estimated VB posterior distributions q(ij |O, m) as follows: p(xt |c, i, j, O, m) =
(c)
p(xt |ij , m)q(ij |O, m)dij .
(c)
(c)
(c)
(2.47)
The integral over ij can be solved analytically by substituting Eqs. (2.22) and (2.27) into Eq. (5.8). If we consider the marginalization of all the parameters, the analytical result of the Right Hand Side (RHS) in Eq. (2.47) is found to be a mixture distribution based on the Students tdistribution St(), as follows: RHS in Eq. (2.47) = ij
j (c) (c) (c) (c)
jk
k k
ij
(c)
jk
(c) d
St
xt d
(c) jk,d ,
(1 + jk )Rjk,d jk jk
(c) (c)
, jk
(c)
(2.48)
This approach is called VB posterior based BPC (VB-BPC). The use of VB-BPC makes VBEC a total Bayesian framework for speech recognition that possesses a consistent concept whereby all acoustic procedures (acoustic model construction and speech classication) are carried out based on posterior distributions, as shown in Figure 2.5. Figure 2.5 shows the VBEC framework compared with a conventional approach, ML-BIC/MDL: the model parameter estimation, model selection and speech classication are based on ML, BIC/MDL and the conventional ML-based Classication (MLC) , respectively.
24
LM n oit u bi rt si d t u pt u o : g nitt e s l e d o M m hti r o gl a h cl e W - m u a B L M L D M / CI B n o desab noitceles ledoM
ledom citsuocA stluseR
Figure 2.5: Total speech recognition frameworks based on VBEC and ML-BIC/MDL.
2.4 Summary
This chapter described the difference between Bayesian and conventional ML approaches, explained the general solution of the VB approach, and formulated the total Bayesian framework for speech recognition, VBEC. VBEC is based on four formulations, i. e. model setting, training, selection, and speech classication. Therefore, VBEC performs the model construction process, which includes model setting, training and selection (1st, 2nd and 3rd), and the classication process (4th) based on the Bayesian approach. Thus, we can say that VBEC is a total Bayesian framework for speech recognition that includes three Bayesian advantages, i. e., prior utilization, model selection, and robust classication. The following three chapters conrm the effectiveness of the Bayesian advantages using speech recognition experiments and an explanation of their implementation.
CL M
atad noitingoceR
ledom lacixeL ledom egaugnaL
atad gn n arT atad gniiniiarT
n oit u bi rt si d r oi r p + n oit u bi rt si d t u pt u o : g nitt e s l e d o M
m hti r o gl a h cl e W - m u a B B V
no desab noitceles ledoM
ledom citsuocA
stluseR
C P B- B V
CEBV
( D e c o d e r ) A c c o o n u s s t t r c u mi c t o oi d n e l S p e e c h c al s si f ci a t oi n
Chapter 3 Bayesian acoustic model construction

3.1 Introduction
The accurate construction of acoustic models contributes greatly to improving speech recognition performance. The conventional framework for acoustic model construction is based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) where the statistical model parameters are estimated by the familiar Maximum Likelihood (ML) approach. The ML-based framework presents problems as regards the construction of acoustic models of real speech recognition tasks. The reasons for the problems are that ML often estimates parameters incorrectly when the amounts of training data assigned for the parameters are small (over-training), and cannot select an appropriate model structure in principle. This degrades the speech recognition performance, and is especially true for spontaneous speech recognition tasks, which exhibit greater speech variability than read speech [18]. The Bayesian approach is based on a posterior distribution estimation, which can utilize prior information via a prior distribution, and can select an appropriate model structure based on a posterior distribution for a model structure, which excludes the over-trained models. Together, these two advantages totally mitigate the effects of over-training. By using these advantages, the Bayesian approach-based speech recognition framework is expected to solve the problems originating from the MLs limitations. The Variational Bayesian Estimation and Clustering for speech recognition (VBEC) framework in acoustic model construction includes prior setting, an estimation of the Variational Bayesian (VB) posterior distributions based on the VB Baum-Welch algorithm and the selection of the appropriate model structure using the VBEC objective function, as discussed in Section 2.3 1 . Consequently, VBEC can totally mitigate the over-training effects when constructing acoustic models used in speech recognition. This chapter mainly introduces an implementation of Bayesian acoustic model construction within the VBEC framework. The implementation is discussed in relation to the following three aspects: Section 3.2 describes the derivation of an efcient VB Baum-Welch algorithm designed to
VBEC also includes Bayesian predictive classication in the speech classication process, and this will be discussed in Chapter 5.
1
25
26
CHAPTER 3. BAYESIAN ACOUSTIC MODEL CONSTRUCTION

improve the computational efciency of the original VB Baum-Welch algorithm in Section 2.3.2. Section 3.3 reports the application of VBEC model selection to the context-dependent HMM clustering based on the phonetic decision tree method. Section 3.4 describes the application of VBEC model selection to the determination of mixture components.
Section 3.5 conrms the effectiveness of Bayesian acoustic model construction based on the above implementations using speech recognition experiments. Experiments show the Bayesian advantages of prior utilization and model selection.
3.2 Efcient VB Baum-Welch algorithm

With acoustic model training, the VB forward-backward algorithm imposes the greatest computational load, which is proportional to the product of the number of HMM states I and the number of frames T , similar to the conventional ML forward-backward algorithm. However, the original VB forward-backward algorithm described in Section 2.3.2 requires much more computation than the ML forward-backward algorithm in each frame, and is not suitable for practical use. Therefore, this section introduces an efcient VB forward-backward algorithm to reduce the amount of computation required in each frame by summarizing the factors, which does not depend on frame t. Consequently, a new VB Baum-Welch algorithm is proposed based on the efcient VB forward-backward algorithm. VB transition probability The VB forward-backward algorithm is obtained by computing the VB transition probability and VB occupation probability , as discussed in Section 2.3.2. From Eqs. (2.34) and (2.36), VB transition probability is obtained by computing aij k wjk bjk (O t ) dened in Eq. (2.35) for e each frame t. By substituting (2.35) into aij k wjk bjk (O t ), and by employing the logarithmic e function, the concrete form of the equation is represented as follows: log aij
k
wjk bjk (O t ) e ij ) log

k
= (ij ) ( exp
exp (jk ) ( 1 2 log

d
jk ) + (O t jk,d )2 jk e,d Rjk,d (3.1)
D 1 jk log 2 + 2 2 jk
Rjk,d 2
Equation (3.1) is very complicated, and in addition, the computation of the digamma function () in (3.1) is very heavy [37]. Therefore, the factors, which do not depend on frames, are summarized
3.3. CLUSTERING CONTEXT-DEPENDENT HMM STATES USING VBEC
27
for computation in advance in the initialization or VB M-step, similar to the calculation of the normalization constant in the conventional ML Baum-Welch algorithm. Then, Eq. (3.1) is simplied as follows: log aij
k
wjk bjk (O t ) e
= Hij + log
k
exp Ajk + (O t Gjk ) Bjk (O t Gjk ) e e
(3.2)
where
Hij A jk : Gjk B jk
(ij ) ( (jk ) ( jk jk 1 e 2 Rjk
j k
ij ) jk ) +
D 2
log 2
1 e jk
jk e 2
1 log 2
e Rjk 2
(3.3)
{H, A, G, B} are the frame invariant factors, which do not depend on frames. Therefore, by using instead of , only requires the computation of Eq. (3.2) for each frame, which is equivalent to computing the likelihood of the mixture of Gaussians with the state transition in the conventional ML computation. Thus, the VB transition probability computation can be simplied to a conventional ML transition probability computation. VB occupation probability In a similar way to that used for VB transition probability , the VB occupation probability computation required in each frame can also be simplied to that of the conventional ML approach by using frame invariant factors {H, A, G, B}. The logarithmic function of aij wjk bjk (O t ), e which is required in Eq. (2.38) can be simplied by using as follows: log aij wjk bjk (O t ) = Hij + Ajk + (O t Gjk ) Bjk (O t Gjk ). e e e (3.4)
These efcient computation methods are available for the VB Viterbi algorithm. Thus, an efcient VB forward-backward algorithm is realized based on the new computation of VB transition probability and VB occupation probability , which is computed in the same way as the ML forward-backward algorithm. Therefore, an efcient VB Baum-Welch algorithm is also realized based on the VB forward-backward algorithm, which is computed in almost the same way as the ML Baum-Welch algorithm, since most of the computation time is used for the forward-backward algorithm. This procedure is shown in Figure 3.1.
3.3
Clustering context-dependent HMM states using VBEC
Recently, a context-dependent HMM has been widely used as a standard acoustic model in speech recognition. The triphone HMM is often adopted as the context-dependent model, which considers the preceding and following phoneme contexts as well as the center phoneme. Since there are a large number of triphone contexts, it is almost impossible to collect a sufcient amount of training
28
Figure 3.1: Efcient VB Baum-Welch algorithm. data to estimate all the parameters of triphone HMM states, and this data insufciency causes overtraining. To solve these problems, there are certain methods for sharing parameters over several triphone HMM states by the clustering approach [3841]. Clustering triphone HMM states corresponds to the appropriate selection of the sharing structure of states and the total number of shared states. Therefore, this clustering can be regarded as model selection. Conventionally, the ML criterion has been used as the model selection criterion. However, the ML criterion requires the number of shared states or the likelihood gain to be experimentally set as a threshold. This is because the likelihood value increases monotonically as the number of model parameters increases, and always leads to the selection of a model structure with the largest number of parameters, in the sense of ML. Therefore the ML criterion is not suitable for such a model structure selection. Information criterion approaches typied by the Akaike information criterion [42], the Minimum Description Length (MDL) [43] and the Bayesian Information Criterion (BIC) [44] can be used to select an appropriate model structure. In speech recognition, BIC/MDL based acoustic modeling approaches have been widely used [21, 4548] to deal with acoustic model selection. However, these criteria are derived based on an asymptotic condition, and are only effective when there is a sufcient amount of training data. Practical acoustic mod-
) 3. 3 ( n oit a u q E srotcaf tnairavni emarf gnitadpU
srotcaf tnairavni emarF
a hcleW-muaB BV tneiciffE )4.3(,)2.3(,)83.2(,)63.2(,)43.2( snoitauqE ) p et s - E( dr a w k c a b- dr a wr of B V ) 0 3. 2 ( n oit a u q E )pets-M( sretemarap roiretsop gnitadpU seitilibaborp noitapucco dna noitisnarT tnemgduj ecnegrevnoC ) 2 4. 2 ( ~ ) 9 3. 2 ( s n oit a u q E noitcnuf evitcejbo BV gnitaluclaC ) 9 2. 2 ( n oit a u q E scitsitats tneiciffus gnitalumuccA sretemarap roiretsoP sc ts tats tne c ffuS sciitsiitats tneiiciiffuS
g nitt e s l aiti nI lledom ciitsuocA edo m c tsuoc A
erutaef hceepS erutaef hceepS
29
Figure 3.2: A set of all triphone HMM states */ai /* in i th state sequence is clustered based on the phonetic decision tree method.
Figure 3.3: Splitting a set of triphone HMM states in node n into two sets in yes node nQ and no Y node nQ by answering phonetic question Q. N eling often encounters cases where the amount of training data is small, and therefore, a method is desired that is not limited by the amount of training data.. On the other hand, VBEC includes prior utilization, as well as model selection, as shown in Table 1.1, and can overcome the problems found with the conventional ML and BIC/MDL approaches. This section explains the application of VBEC model selection to the triphone HMM clustering by using the VBEC objective function described in Section 2.3.3
3.3.1 Phonetic decision tree clustering

The phonetic decision tree method has been widely used to cluster triphone HMM states effectively by utilizing the phonetic knowledge-based constraint and the binary-tree search. This section provides a general discussion of a conventional phonetic decision tree method. The original ideas behind phonetic decision tree construction are familiar because its rst pro-
)r=n( edon tooR
edon faeL
noitseuq citenohP
oN
oN
oN seY oN
i i
gn/ a/hc m/ a/st
*/ a/*
i
oN seY oN
seY
seY
seY seY
o/ a/k i/ a/k
i i
30
Figure 3.4: Tree structure in each HMM state posal used likelihood as the objective function [41]. An appropriate choice of phonetic question at each node split allows a decision tree to grow properly, and appropriate state clusters become represented in its leaf nodes, as shown in Figure 3.2. The phonetic question concerns the preceding and following phoneme context, and is obtained through knowledge of the phonetics. Table 3.1 shows example questions. When node n is split into yes (nQ ) and no (nQ ) nodes according to Y N question Q, as shown in Figure 3.3, an appropriate question Q(n) is chosen from a set of questions so that the split gives the largest gain in an arbitrary objective function Hm , i.e., : Q(n) = argmax HQ(n) ,
Q
(3.5)
where HQ(n) HnY + HnN Hn

Q Q
(3.6)
is the overall gain in objective function when node n is split by Q. Thus, a decision tree is produced specically for each state in the sequence, and the trees are independent of each other, as shown in Figure 3.4. The arbitrary objective function Hn in node N is computed by the sufcient statistic n in node n by assuming the following constraint. (C1) Data alignments for each state are xed during the splitting process. Table 3.1: Examples of questions for phoneme /a/ Question Preceding phoneme is vowel? Following phoneme is plosive ? . . . Yes No {a, i, u, e, o}/ a /{ all } otherwise { all }/ a /{p, b, t, d, k, g} otherwise . . . . . .
31
Under this constraint, 0-th, 1st and 2nd statistics of node n (n {On , Mn , V n }) are computed by simply summing up sufcient statistics of j (Oj , Mj and Vj ) by using the following equation: On = jn Oj (3.7) Mn = jn Mj , n V = jn Vj where, j represents a non-clustered triphone HMM state, and is included in a set of triphone HMM states in node n. Here, Oj , Mj and Vj are calculated by using occupation probability and feature vector O as follows: t Oj = e,t e,j t . (3.8) M = e,t e,j O t e j t t t Vj = e,t e,j O e (O e ) Therefore, once statistics j {Oj , Mj , Vj } are prepared for all possible triphone HMM states, the statistics for any node can be easily calculated using Eq. (3.7) under constraint (C1). Here, t occupation probability e,j is obtained by the forward-backward or Viterbi algorithm within the VB or ML framework. This reduces the computation time to a practical level. The following three sections derive the concrete form of gain objective function HQ(n) based on the ML, BIC/MDL, and VBEC approaches.
3.3.2 Maximum likelihood approach

The gain of log-likelihood LQ(n) is simply obtained using the sufcient statistics under the following two constraints: (C2) The output distribution in a state is represented by a normal distribution. (C3) The contribution of the state transition to the objective functions is disregarded. Let O j = {O t RD |e, t j} be a set of feature vectors that are assigned to state j by the Viterbi e algorithm, and O n = {O j |j n} be a set of feature vectors that are assigned to node n. From the constraints, log-likelihood Ln for a training data set, which is assigned to a state set in node n, is expressed by the following D-dimensional normal distribution: Ln = log p(O n |n , n ) = log
jn e,tj
N (O t |n , n ) e (3.9) n
jn e,tj 1 2
log
exp
1 O t n (n )1 O t n e e 2
where n and n denote a D dimensional mean vector and a D D diagonal covariance matrix for a data set O n in n, respectively. From Eq. (3.9), ML estimates n and n can be obtained using the sufcient statistics n in node n in Eq. (3.7) as follows: n =
Mn On
n = diag
Vn On
Mn On
Mn On
(3.10)
32
Therefore, the gain of log-likelihood LQ(n) can be solved as follows [41]: LQ(n) = LnY + LnN Ln = l(nQ ) + l(nQ ) l(n). Y N Here l(n) in Eq. (3.11) is dened as: l(n) = 1 On log n 2 1 Vn Mn On log diag = n 2 On O
Q Q
(3.11)
Mn On
(3.12)
Equations (3.11) and (3.12) show that LQ(n) can be calculated using the sufcient statistics n in node n. Therefore, an appropriate question Q(n) for node n can be selected by: Q(n) = argmax LQ(n) .
Q
(3.13)
However, LQ(n) is always positive for any split in the ML criterion, and always selects the model structure in which the number of states is the largest. Namely, no states are shared at all. To avoid this, the ML criterion requires the following threshold to be set manually to stop splitting: LQ(n) Threshold. (3.14)
There are other manual approaches that stop splitting by setting the number of total states, or the maximum depth of the tree, as well as a hybrid approach. However, the effectiveness of the thresholds in all of these manual approaches has to be judged on the basis of experimental results.
3.3.3 Information criterion approach

This subsection considers automatic model selection based on an information criterion using BIC/MDL, which is widely used as the model selection criterion for various aspects of statistical modeling. Q(n) The gain of objective function LBIC/M DL using BIC/MDL is obtained while splitting a state set by using question Q, as follows: LBIC/M DL = LQ(n)
Q(n)
#(n ) log Or , 2
(3.15)
where is a tuning parameter in BIC/MDL, and Or denotes the frame number of data assigned to a root node. #(n ) is the number of free parameters used in node n. From the constraints, the free parameters are a D-dimensional mean vector and a D D diagonal covariance matrix, and therefore, #(n ) = 2D. Equation (3.15) suggests that the BIC/MDL objective function penalizes the gain in log-likelihood on the basis of the balance between the number of free parameters and the amount of training data, and the penalty can be controlled by varying (in the original denitions
33
= 1 [44] [43]). Model structure selection is achieved according to the amount of training data Q(n) by using LBIC/M DL instead of using LQ(n) in Eq. (3.13), and by stopping splitting when: LBIC/M DL 0.
Q(n)
(3.16)
Therefore, the BIC/MDL approach does not require a threshold unlike the ML approach. BIC/MDL is an asymptotic criterion that is theoretically effective only when the amount of training data is sufciently large. Therefore, for a small amount of training data, model selection does not perform well because of the uncertainty of the ML estimates. The next section aims at solving the problem caused by a small amount of training data by using VBEC.
3.3.4 VBEC approach

Similar to Eq. (3.13), an appropriate question at each split is chosen to increase the VBEC objective function F m in the VBEC framework. When node n is split into Yes node (nQ ) and No node (nQ ) Y N by question Q, the appropriate question Q(n) is chosen from a set of questions as follows: Q(n) = argmax F Q(n) ,
Q
(3.17)
where F Q(n) is the gain in the VBEC objective function when node n is split by Q. The question is chosen to maximize the gain in F m by splitting. The VBEC objective function for phonetic decision tree construction is also simply calculated under the same constraints as the ML approach ((C1), (C2), and (C3)). By using these conditions, the objective function is obtained without iterative calculations, which reduces the calculation time. Under condition (C1), the latent variable part of F m can be disregarded, i.e., m F m F . (3.18)
m In the VBEC objective function of model parameter F (Eq. (2.42)), the factors of posterior parameters and can also be disregarded under conditions (C1) and (C2), where and are related to the transition probability and weight factor, respectively. Therefore, the objective function m F n in node n for assigned data set O n can be obtained from F in Eq. (2.42) as follows:
On D F n = log (2) 2
n,0 n
D 2
2 2
n D 2
n 2 n,0 2
|Rn,0 |
D
n,0 2
n,0 D 2
|Rn |
n 2
(3.19)
where { n , n , n , Rn }( n ) is a subset of the posterior parameters in Eq. (2.30), and is represented by: n = n,0 + On n = n,0 n,0 +Mn n,0 +On . (3.20) n n,0 = + On n R = diag Rn,0 + V n 1n Mn (Mn ) + n,0 Onn Mnn n,0 Mnn n,0 n,0
O +O O O
34
On , Mn and V n are the sufcient statistics in node n, as dened in Eq. (3.7). Here, { n,0 , n,0 , n,0 , Rn,0 }( n,0 ) is a set of prior parameters. In our experiments, prior parameters n,0 and Rn,0 are set by using monophone (root node) HMM state statistics (Or , Mr and V r ) as follows: n,0 = Mr O Rn,0 = n,0
r
Vr Or
n,0 ( n,0 )
(3.21)
The other parameters n,0 and n,0 are set manually. By substituting Eq. (3.20) into Eq. (3.19), the gain F Q(n) is obtained when n is split into nQ , nQ by question Q, Y N F Q(n) = f ( nY ) + f ( nN ) f ( n ) f ( n,0 ). Here, f () is dened by: f () D log log |R| + D log . 2 2 2 (3.23)
Q Q
(3.22)
Note that the terms that do not contribute to F Q(n) are disregarded. Node splitting stops when the condition: F Q(n) 0, (3.24) is satised similar to the BIC/MDL approach. A model structure based on the VBEC framework can be obtained by executing this construction for all trees, resulting in the maximization of total F m . This implementation based on the phonetic decision tree method does not require iterative calculations, and can construct clustered-state HMMs efciently. There is another major method for the construction of clustered-state HMMs that uses successive state splitting algorithm, and which does not remove latent variables in HMMs [39, 40]. Therefore, this requires the VB BaumWelch algorithm and the calculation of latent variable part of the VBEC objective function for each splitting. This is realized as the VB SSS algorithm by [49]. The relationship between the VBEC model selection and the conventional BIC/MDL model selection based on Eqs. (3.22) and (3.15), respectively, is discussed. Based on the condition of a sufciently large data set, n , n On , n Mn /On , and Rn V n Mn (Mn ) /On , in addition, log ( n /2) (On /2) log(On /2) On /2 in Eq.(3.22). Then, an asymptotic form of Eq.(3.22) is composed of a log-likelihood gain term and a penalty term depending on the number of free parameters and the amount of training data i.e. the asymptotic form becomes the BIC/MDLtype objective function form 2 . Therefore, VBEC theoretically involves the BIC/MDL objective function and the BIC/MDL is asymptotically equivalent to VBEC, which displays the advantages of VBEC, especially for small amounts of training data.
3.4 Determining the number of mixture components using VBEC

Once the clustered-state model structure is obtained, the acoustic model selection is completed by determining the number of mixture components per state. GMMs include latent variables, and its
Strictly speaking, the penalty term of Eq. (3.15) is slightly different from that of the asymptotic form in Eq. (3.22) since [21] [46] use Or in the penalty term rather than On .
2
3.4. DETERMINING THE NUMBER OF MIXTURE COMPONENTS USING VBEC

erudecorp esahp-tsriF
3
35
Figure 3.5: Acoustic model selection of VBEC: two-phase procedure. determination requires the VB Baum-Welch algorithm and the computation of the latent variable part of the VBEC objective function, unlike the clustering triphone HMM states in Section 3.3. Therefore, this section deals with the determination of the number of GMM components per state by considering the latent variable effects. This is the rst research to apply VB model selection to GMMs in speech recognition3 , which corresponds to the rst research showing the effectiveness of VB model selection in latent variable models in speech recognition [52, 53]. Then, the effectiveness of VB model selection in latent variable models is conrmed in [49] for the successive state splitting algorithm, and the effectiveness of VB model selection for GMMs is re-conrmed in [54]. In general, there are two methods for determining the number of mixture components, as shown in Figure 3.5. With rst method, the number of mixture components per state is the same for all states. The objective function F m is calculated for each number of mixture components, and the number of mixture components that maximizes the total F m is determined as being the appropriate one (xed-number GMM method). With second method, the number of mixture components per state can vary by state; here, Gaussians are split and merged to increase F m and determine the number of mixture components in each state (varying-number GMM method). A model obtained by the varying-number GMM method is expected to be more accurate than one obtained by the xed-number GMM method, although the varying-number GMM method requires more computation time. We require the VBEC objective functions for each state to determine the number of mixture components. In this case, the state alignments vary and states are expressed as GMMs. Therefore, m the model includes latent variables and the component FS,V cannot be disregarded, unlike the case of triphone HMM state clustering. However, since the number of mixture components is determined for each state and the state alignments do not change greatly, the contribution of the state transitions to the objective function is expected to be small, and can be ignored. Therefore, the objective function F m for a particular state j is represented as follows:
m m (F m )j = (F )j (FV )j
Other applications for determining the number of mixture components using VB are already proposed in [50, 51].
stnenopmoc erutxim fo rebmun gninimreteD
rebmun gniiyraV )B( rebmun gn yraV )B(

(3.25)
)etats rep naissuaG elgniS( setats MMH enohpirt gniretsulC
erudecorp esahp-dnoceS
rebmun dexiiF )A( rebmun dex F )A(
36 where
m (F )j
= log
(L0 )
L k =1
L k =1
(jk ) (0 )L 0 jk
D 2
jk
+
k
e jk D log (2) 2
2 +
jk D e 2
jk e 2 0 2
0 Rjk
0 2 jk e 2
(3.26)
0 D 2
Rjk
and
m (FV )j = k
jk log wjk +
k e,t
t e,jk log be,jk (O t ). e
(3.27)
Therefore, with the xed-number GMM method, the total F m is obtained by summing up all states (F m )j , which determines the number of mixture components per state. With the varying-number GMM method, the change of (F m )j per state is compared after merging or splitting the Gaussians, which also determines the number of mixture components. The number of mixture components is also automatically determined by using the BIC/MDL objective function [47] [48]. However, the BIC/MDL objective function is based on the asymptotic condition, and VBEC theoretically involves this function, which parallels the discussion in the last paragraph in Section 3.3.4.
3.5 Experiments
This section deals with the two Bayesian advantages realized by VBEC experimentally. Sections 3.5.1 and 3.5.2 examine prior utilization for acoustic model construction. Sections 3.5.3, 3.5.4, and 3.5.5 examine the validity of model selection by comparing the value of the VBEC objective function with the recognition performance. Experimental conditions Two sets of speech recognition tasks were used in the experiments. The rst set was an Isolated Word Recognition (IWR) task as shown in Table 3.2 where ASJ continuous speech was used as the training data and the JEIDA 100 city name speech corpus was used as the test data. For the IWR task, the model parameters were trained based on the ML approach, i.e., the IWR task only adopted the prior setting and the model selection for acoustic model construction to evaluate the sole effectiveness of the VBEC model selection. The second set was a Large Vocabulary Continuous Speech Recognition (LVCSR) task as shown in Table 3.3 where Japanese Newspaper Article Sentences (JNAS) were used as training and test data. Unlike the IWR task, the LVCSR task adopted all Bayesian acoustic model procedures (setting, training, and selection) for the model construction.
3.5. EXPERIMENTS
Table 3.2: Experimental conditions for isolated word recognition task
Sampling rate Quantization Feature vector 16 kHz 16 bit 12-order MFCC + 12-order MFCC (24 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (left-to-right HMM) Number of phoneme categories 27 Training data ASJ continuous speech sentences 3,000 sentences (male 30) Test data JEIDA 100 city names 2,500 words (male 25) ASJ: Acoustical Society of Japan JEIDA: Japan Electronic Industry Development Association
37
Prior parameter setting Prior parameters 0 and R0 were set from the mean and covariance statistics of the monophone HMM state, respectively. The other parameters were set as 0 = 0 = 0.01 and 0 = 2.0, as shown in Table 3.4. The tuning of the prior parameters is discussed in Section 3.5.2, and the above setting was used in the other experimental sections.
3.5.1 Prior utilization

Two experiments using the IWR and LVCSR tasks were conducted to compare the VBEC model selection with the conventional BIC/MDL selection for various amounts of training data. Several subsets were randomly extracted from the training data set, and each of the subsets was used to construct a set of clustered triphone HMMs. Then, the selected models were trained by using the ML Baum-Welch algorithm with IWR, and by using the VB Baum-Welch algorithm with LVCSR. All output distributions were organized by single Gaussians, so that the effect of model selection could be evaluated exclusively. Figure 3.6 shows the recognition rate and Figure 3.7 shows the total number of states in a set of clustered triphone HMMs, according to the amount of training data in the IWR task. Figures 3.8 and 3.9 show the same information for the LVCSR task. In the IWR task, as shown in Figure 3.6, when the number of training sentences was less than 60, our method greatly outperformed the ML-BIC/MDL ( = 1) method by as much as 50%. With such a small amount of training data, the total number of clustered states differed greatly for the VBEC and BIC/MDL model selections (Figure 3.7). Similar behavior was also conrmed from Figures 3.8 and 3.9. These suggest that VBEC model selection determined the model structure more appropriately than ML-BIC/MDL ( = 1) selection. Next, in the IWR task, the penalty term of ML-BIC/MDL in Eq. (3.15) was adjusted to = 4
38

Table 3.3: Experimental conditions for LVCSR (read speech) task
Sampling rate Quantization Feature vector 16 kHz 16 bit 12 - order MFCC + MFCC + Energy + Energy (26 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (Left to right) Number of phoneme categories 43 Training data JNAS 20,000 utterances, 34 hours (male) Test data JNAS 100 utterances, 1,583 words (male) Language model Standard trigram (10 years of newspapers) Vocabulary size 20,000 Perplexity 64.0 JNAS: Japanese Newspaper Article Sentences
Table 3.4: Prior distribution parameters. Or , Mr and V r denote the 0th, 1st, and 2nd statistics of a root node (monophone HMM state), respectively.
Prior parameter 0 0 0 0 R0 Setting value 2.0 0.01 0.01 Mean statistics in root node r (Mr /Or ) Variance statistics in root node r 0 ( 0 (V r /Or 0 ( 0 ) ))
so that the total numbers of states and recognition rates with a small amount of data were as close as possible to those of VBEC model selection. Nevertheless, VBEC model selection resulted in an improvement of about 2 % in the word recognition rates when the number of training sentences was 25 1,500, as shown in Figure 3.5.1, which is an enlarged graph of Figure 3.6. This is because BIC/MDL ( = 4) selected a smaller number of shared states due to the higher penalty, and the model structure was less precise than with VBEC model selection. In fact, Figure 3.7 shows that there is a great difference between the numbers of states for the VBEC and BIC/MDL ( = 4) model selections. Similarly, in the LVCSR task, the tuning parameter in ML-BIC/MDL was altered from = 1 to = 4, which tuned ML-BIC/MDL to select the total number of states that was closest to that of VBEC and to score the highest word accuracy for small amounts of data (less than 1,000 utterances). Although the tuned ML-BIC/MDL improved slightly for small amounts of data (fewer than 600 utterances), VBEC still scored about 10 points higher. This suggests that VBEC could cluster HMM states more appropriately than ML-BIC/MDL, i.e., VBEC selected appropriate questions in Eq. (3.17), even when the resultant total numbers of clustered states were similar.
3.5. EXPERIMENTS
39
ww| ww| zw zw w9 w9
w9 w9
Figure 3.6: The left gure shows recognition rates according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right gure shows an enlarged view of the left gure for 25 1,500 utterances. The horizontal axis is scaled logarithmically.
01 T o t a nl u m b e r o f c ul s t 0001 e r e d s t a t e s
Figure 3.7: Number of splits according to amount of training data (23,000 sentences).
Consequently, VBEC enabled the automatic selection of triphone HMM state clustering with any amount of training data, unlike the ML-BIC/MDL methods, and exhibited considerable superiority especially with small amounts of training data. This provides experimental proof that there is a relationship between VBEC and ML-BIC/MDL where VBEC theoretically involves ML-BIC/MDL, and that the ML-BIC/MDL is asymptotically equivalent to VBEC, which guarantees that VBEC is theoretically superior to ML-BIC/MDL in model selection because of the prior utilization advantage. The small-data superiority of VBEC should be effective for acoustic model adaptation [55] and for extremely large recognition tasks where the amount of training data per acoustic model parameter would be small because of the large speech variability.
00001
0001
secnetnes # 001
4= LDM 1= LDM seyaB 01 1
40
02 03 W 04 o r d 05 a c c 06 u r a 07 c 08 y 09

88
).nim 000,01(
Figure 3.8: The left gure shows recognition rates according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The right gure shows an enlarged view of the left gure for more than 1,000 utterances The horizontal axis is scaled logarithmically.
Figure 3.9: Total number of clustered states according to the amounts of training data based on VBEC, ML-BIC/MDL and tuned ML-BIC/MDL. The horizontal and vertical axes are scaled logarithmically.
3.5.2 Prior parameter dependence

The dependence of the recognition rate on prior parameter values was examined experimentally. The IWR task was used for the experiment because the amount of training data was smaller than that of the LVCSR task, and it is easier to examine various prior parameter values. 0 and R0 can be obtained from the mean vector and the covariance matrix statistics in the root node, and heuristics for the parameter setting were removed. However, 0 and 0 remain as heuristic parameters. Therefore, it is important to examine how robustly the triphone HMMs are clustered by VBEC model selection against changes in prior parameter values with various amounts of training data. The values of prior parameters 0 and 0 were varied from 105 to 10, and the word recognition rates were examined in two typical cases; one where the amount of data was very small (10 sentences) and another where it was fairly large (150 sentences). Tables 3.5 and 3.6 indicate the recognition rates for each combination of prior parameters. The prior parameter values for acceptable performance seem to be distributed broadly for both very small and fairly large amounts of
000,001
CEBV LDM/CIB-LM denuT LDM/CIB-LM
secnarettu fo rebmuN
).nim 000,1(
000,01
).nim 000,01(
000,001
).nim 001(
).nim 000,1(
000,01
000,1
secnarettu fo rebmuN
78 68
28W o r 38 d a 48 c c 58 u r a c y
).nim 001(
000,1
secnarettu fo rebmuN 000,001 000,01 000,1 001

).nim 000,01( ).nim 01(
001

).nim 000,1( ).nim 1(
T 01 o t 001 a nl u m s b t a e t r e o s f 000,1 c ul s t e r e d
000,01
).nim 001(
).nim 01(
).nim 1(
01
3.5. EXPERIMENTS
41
training data. Here, approximately 20 of the best recognition rates are highlighted in each table. The combinations of prior parameter values that yielded the best recognition rates were alike for the two different amounts of training data. Namely, appropriate combinations of prior parameter values can consistently achieve high performance regardless of the amount of training data. In summary, the values of prior parameters do not greatly inuence the quality of the clustered triphone HMMs. This suggests that it is not necessary to be very careful when selecting the values of prior parameters when the VBEC objective function is applied to speech recognition. Table 3.5: Recognition rates in each prior distribution parameter. The model was trained using data consisting of 10 sentences.
0 101 101 100 101 102 103 104 105 1.0 2.0 1.5 1.7 1.7 1.7 1.7 100 1.0 1.0 2.2 31.2 60.3 66.5 66.1 101 47.2 66.3 65.9 66.1 66.2 66.6 66.7 0 102 66.8 65.9 66.2 66.5 66.7 66.3 66.1 103 65.6 66.5 66.7 66.3 66.1 65.5 65.5 104 66.3 66.1 66.1 65.5 65.5 64.6 64.6 105 66.1 65.5 66.5 64.6 64.6 64.6 64.6
Table 3.6: Recognition rates in each prior distribution parameter. The model was trained by using data consisting of 150 sentences
0 101 101 100 101 102 103 104 105 14.7 17.2 17.3 17.5 18.3 21.3 23.4 100 23.3 22.0 49.3 83.5 92.5 94.1 94.0 101 94.7 93.5 94.3 94.4 93.8 93.2 93.4 0 102 94.0 94.0 93.9 93.2 93.3 92.3 92.5 103 94.0 93.1 93.3 92.3 92.5 92.3 92.4 104 93.3 92.3 92.5 92.3 92.4 92.2 92.2 105 92.1 92.2 92.4 92.2 92.2 92.3 92.3
3.5.3
Model selection for HMM states
Since VBEC selects the model that gives the highest value of the VBEC objective function F m , the validity of the model selection can be evaluated by examining the relationship between F m and recognition performance. The validity was examined with an IWR task (in Figure 3.10) and LVCSR tasks (in Figure 3.11). Figure 3.10 (a) plots the recognition rates and the summation of F m for all phonemes (hereafter, F m is used to indicate the summation of F m for all phonemes) for several sets of clustered-state triphone HMMs in the IWR task. Here, a set of HMMs was obtained by controlling the maximum tree depth uniformly for all trees without arresting the splitting, and all
42
ww| ww|
zw
zw
(a) VBEC
(b) ML
Figure 3.10: Objective functions and recognition rates according to the number of clustered states. output pdfs were organized by single Gaussians, so that the effect of clustering could be evaluated exclusively. The results clearly showed that F m and word accuracy behaved very similarly i.e. both continued to increase until they reached their peaks at around 1,500 states and then decreased. The same type of examination was carried out for log-likelihood and recognition rate (Figure 3.10 (b)). The log-likelihood continued to increase monotonically while the word accuracy decreased after reaching its peak at around 1,500 states, and the log-likelihood could not provide any information for the automatic selection of an appropriate structure. This was due to the nature of the likelihood measure i.e. the more complicated the model structure is, the higher the likelihood becomes. Similar behavior was obtained using the LVCSR task (Figure 3.11). These results indicate that the VBEC objective function is valid for model selection as regards clustering the triphone HMM states.
3.5.4 Model selection for Gaussian mixtures

The above experiments show the effectiveness of triphone HMM state clustering by using VBEC. The next experiment shows the effectiveness of another aspect of acoustic model selection, i.e., the determination of the number of mixture components by using VBEC. In the following sections, only the LVCSR task was used for the experiments because the determination of the number of mixture components requires the VB Baum-Welch algorithm, as discussed in Section 3.4, which was only used in the LVCSR task. In VBEC, the number of mixture components is determined so that F m has the highest value, similar to the triphone HMM state clustering described in Section 3.5.3. Therefore, it is important to examine the relationship between F m and recognition perfor-
3.5. EXPERIMENTS
09
09 setats deretsulc fo rebmun latoT 000,02 000,01 0 56
43
(a) VBEC
Figure 3.11: Objective functions and word accuracies according to the increase in the number of total clustered triphone HMM states.
mance for each number of mixture components. 30 sets of GMMs were prepared with the same clustered-state structure obtained using the all training data in Section 3.5.1. Figure 3.12 shows the word accuracy and the objective function F m for each number of mixture components. In this experiment, the number of mixture components was the same for all clustered states (xednumber GMM method). From Figure 3.12, the behavior of F m and the word accuracy had similar contours, i.e., from 1 to 9 Gaussians, both increased gradually as the number of mixture components increased, and, at 9 Gaussians, both peaks were identical, and nally, for more than 9 Gaussians, both decreased gradually as the number of mixture components decreased due to the over-training effects. Therefore, VBEC could also select an appropriate model structure automatically for GMMs and the selected model scored a word accuracy of 91.1 points, which is sufcient for practical use. The varying-number GMM method is a promising approach to more accurate modeling, and takes full advantage of automatic selection by using the VBEC objective function. Namely, it is almost impossible to obtain manually an appropriate combination of varying-number GMMs, each of which can have a different number of components. The varying-number GMM method was applied to the same clustered-state triphone HMMs. Then, the total number of mixture components was 20,226 while the total number was 39,725 for the xed-number GMM, and the word accuracy improved 0.4 point to 91.5. The varying-number GMM, thus, improved the performance with a smaller total number of mixture components.
setats deretsulc fo rebmun latoT
(b) ML
000,02
doohilekil-goL ycaruccA droW
000,01
58
W o 57r d a c c u 08r a c y
07 56
m_F ycaruccA droW
58
W o 57r d a c c u c y
08r a 07
44

29
Figure 3.12: Total objective function F m and word accuracy according to the increase in the number of mixture components per state. Table 3.7: Word accuracies for total numbers of clustered states and Gaussians per state. The contour graph on the right is obtained from these results. The recognition result obtained with the best manual tuning with ML was 92.0 and that obtained automatically with VBEC was 91.1.
9 1 5.
129
500
1000
2000
3000
4000
6000
8000
3.5.5 Model selection over HMM states and Gaussian mixtures

Sections 3.5.3 and 3.5.4 conrmed that VBEC could construct appropriate clustered-state triphone HMMs with GMMs. The interest then turned to how closely an automatically selected VBEC model over clustered states and Gaussian mixtures would approach the best manually obtained model provided by the ML framework. Here, the automatic selection was carried out with a twophase procedure; triphone HMM states were clustered and then the number of mixture components per state were determined. Although, this two-phase procedure might select a local optimum structure since the model structure was only optimized at each phase, this was a realistic solution for selecting a model structure in a practical computation time. The ML framework examined word accuracies for several combinations of each state cluster and mixture size and obtained results with good accuracies as summarized in Table 3.7. Comparative results corresponding to Table 3.7 are shown on a contour map. From this experiment, this
0008
erudecorp esahp-tsriF 0.09 erudecorp esahp-dnoceS
1.19 CEBV
0006
setats deretsulc fo rebmun latoT
0004
0 . 0 9
5 . 1 9
5.19
0.29
0002
9 0. 0
03
52
02
51
01
40 30 20 16 12 8 4 1
79.7 90.9 91.0 91.5 89.6 88.7 85.0 78.6 90.6 91.7 92.0 91.4 90.1 86.5 77.3 90.4 91.7 91.9 91.4 91.6 88.5 75.4 90.0 90.8 91.7 91.8 91.2 88.1 74.7 89.5 91.3 91.5 91.5 90.4 90.3 71.4 88.4 91.2 90.7 90.8 90.9 90.1 67.2 87.7 88.9 91.2 90.6 90.4 90.7 49.9 82.1 84.3 85.5 84.8 86.0 85.7 # G : number of mixture components per state # S : total number of clustered states
81.5 84.5 86.6 87.7 89.0 89.6 89.4 85.5
53
#G
#S
N u m b e r o mf xi ut r e c o m p o n e n st p e r s at et
29
03
etats rep stnenopmoc erutxim fo rebmuN 02 01 0
09
04
mF ycarucca droW
19 78 W o r 88 d a c c 98 u r a c y 09 68
3.6. SUMMARY
45
map divides the word accuracy into black, gray and white areas scoring less than 90 points, from 90 to 91.5 points and more than 91.5 points, respectively. Models that score more than 90.0 points would be regarded as sufcient for practical use in this task. The selected VBEC model (5,675 states and 9 Gaussians) scored 91.1 points and this score was within the high performance area (more than 90.0 points). Even if we include the selected model structure and other model structures that scored within the 5-best F m values in Figure 3.12, the average accuracy was 90.5 i.e., it was also within the high performance area. These results conrmed that the model structures with high F m values provided high levels of performance compared with the ML results. However, the model structure that VBEC selected (5,675 states and 9 Gaussians per state) could not match the best manually obtained model (2,000 states and 30 Gaussians per state), and the VBEC model could not match the highest accuracy (92.0). The main reason for this inferiority was the two-phase model selection procedure, as shown in Table 3.7. In this case, in the rst phase, the selected model was appropriate for a single Gaussian per state, but not for multiple Gaussians per state, and it had too many states, which caused the degradation in performance. Therefore, triphone HMM state clustering with multiple Gaussians is required to select the optimum model structure. Consequently, although the word accuracy did not reach the highest value obtained with manual tuning, the automatically selected VBEC model could provide a satisfactory performance for practical use for the LVCSR task. This suggests the successful construction of a Bayesian acoustic model, which includes model setting, training, and selection by using VBEC.
3.6 Summary
This chapter introduced the implementation of VBEC for acoustic model construction. VBEC includes prior utilization and model selection, which can automatically select an appropriate model structure in clustered-state HMMs and in GMMs according to the VBEC objective function with any amount of training data. In particular, when the amounts of training data were small, the VBEC model signicantly outperformed the ML-BIC/MDL model, and as the amount of training data increased, the performance determined by VBEC and ML-BIC/MDL converged. This superiority of VBEC is based on the prior utilization advantage over the BIC/MDL criterion. Furthermore, VBEC could determine appropriate sizes for Gaussian mixture models, as well as the clustered triphone HMM structure. Thus, VBEC totally mitigates the effect of over-training, which may occur when the amount of training data per Gaussian is small, by utilizing prior information and by selecting an appropriate model structure. There is room for further improvement in selecting the model structure. The two-phase model selection procedure employed in the experiments, i.e., rst clustering triphone HMM states and then determining the number of mixture components per state, could only locally optimize the model structure. The VBEC performance is expected to improve if the selection procedure is improved so as to globally optimize the model structure. Therefore, the next chapter considers such a model optimization for acoustic models.
Chapter 4 Determination of acoustic model topology

4.1 Introduction
The acoustic model has a very complicated structure: a category is expressed by a set of clusteredstate triphone Hidden Markov Models (HMMs) that possess an output distribution represented by a Gaussian Mixture Model (GMM). Therefore, only experts who well understand this complicated model structure (model topology) can design the models. Although certain algorithms have been proposed for dealing with the model topology [3841, 56], these algorithms require heuristic tuning since they are based on the Maximum Likelihood (ML) criterion, which cannot determine the model topology because ML increases monotonically as the number of model parameters increases. If we are to eliminate the need for heuristic tuning we must nd a way to determine the acoustic model topology automatically. Some partially successful approaches to the automatic determination of the acoustic model topology have been reported that use such information criteria as the Minimum Description Length or Bayesian Information Criterion (BIC/MDL). However, the BIC/MDL approach cannot theoretically determine the total acoustic model topology since the acoustic model includes latent variables. Therefore, these approaches only determine the model by simplifying the model topology determination to a single Gaussian clustering in a model with the constraint that the acoustic model has no latent variables [21, 4547]. VBEC can theoretically determine a complicated model topology by using the VBEC objective function, which corresponds to the VB posterior for a model structure, even when latent variables are included. In the previous chapter, automatic determination using VBEC was conrmed by clustering triphone HMM states with a single Gaussian model, and then determining the number of components per state xing the clustered state topology, where the latent variables exist, as shown by the dashed lines and boxes in Figure 4.1. This procedure is called VBEC 2-phase search or simply 2-phase search in this chapter. Although this procedure is capable of determining the model topology within a practical computation time, the determined topology is only locally optimized at each phase and the obtained performance is not the best. The goal of this chapter is to reach an optimum topology area without falling into a local optimum area using VBEC. Two characteristics of acoustic model topology are utilized to reach this 47
48
CHAPTER 4. DETERMINATION OF ACOUSTIC MODEL TOPOLOGY

etats rep stnenopmoc fo rebmuN aera mumitpO ledom naissuaG elgnis a htiw etats MMH enohpirt gniretsulC
esahp tsriF
Figure 4.1: Distributional sketch of the acoustic model topology. goal. The rst is that the appropriate model would be distributed in a band where the total number of model parameters ( the total number of Gaussians) is almost constant because the amount of data is xed, as shown in the inversely proportional band in Figure 4.1. The second characteristic is that an optimum model topology area would be in the band and nearly unimodal, as shown in Figure 4.1. The characteristics of the acoustic model were experimentally conrmed in [31] by using an isolated word speech recognition task. Therefore, by constructing a number of acoustic models in the band, and then selecting the most appropriate of the in-band models, namely the one that maximizes the VBEC objective function, we can determine an optimum model topology efciently. This search algorithm is called in-band model search. To obtain in-band models, GMMbased HMM state clustering is employed using the phonetic decision tree method. Although the construction of the GMM-based decision tree is also automatically determined within an original VBEC framework, as described in Chapter 2, the construction requires an unrealistic number of computations because the VBEC objective function is obtained by a VB iterative calculation using all frames of data for each clustering. To reduce the number of computations to a practical level, this chapter proposes new approaches for realizing the GMM-based decision tree method within a VBEC framework by utilizing monophone HMM state statistics as priors.
4.2 Determination of acoustic model topology using VBEC

4.2.1 Strategy for reaching optimum model topology
This section describes how to determine the acoustic model topology automatically by using VBEC. In acoustic modeling, the specications of the model topology are often represented by the number of clustered states and the number of GMM components per state, as shown in Figure
setats deretsulc fo rebmuN
erutcurts etats deretsulc eht gnixif elihw stnenopmoc fo rebmun eht gnisaercnI
hcraes esahp-2 CEBV

esahp dnoceS
aera mumitpo lacoL 1
4.2. DETERMINATION OF ACOUSTIC MODEL TOPOLOGY USING VBEC

etats rep stnenopmoc fo rebmuN
49
Figure 4.2: Optimum model search for an acoustic model. 4.2. Then, the good models that provide good performance would be distributed in the inverseproportion band where the total number of distribution parameters (approximately equal to the total number of Gaussians) is constant because the amount of data is xed. Moreover, there would be a unimodal optimum area in the band where the model topologies are represented by an appropriate number of pairs of clustered states and components, as shown in Figure 4.2. In order to realize an optimum model topology, two characteristics of the acoustic model are utilized, the inverse-proportion band and the unimodality. By preparing a number of acoustic models in the band, and by choosing the model that has the best VBEC objective function score, an optimum model topology can be determined, as shown in Figure 4.2 (in-band model search). There are at least two conceivable approaches for constructing in-band models: one involves increasing the number of mixture components from single Gaussian based triphone models, and the other involves increasing the number of clustered triphone HMM states from GMM based monophone models. The former is obviously an extension of the 2-phase search described in Chapter 3 and Section 4.1. Whereas the original 2-phase search determines only the one state clustering topology that has the best VBEC objective function (F m ) score in its rst phase, the extended approach retains a number of single Gaussian based triphone models as candidates for a globally optimum topology. Then, the number of mixture components is increased for each candidate, so that each of the triphone models reaches the inverse-proportion band at the best F m (state-rst approach). This produces a number of in-band models of several topologies (see the dashed arrows in Figure 4.2). The latter approach proceeds in a way that is symmetrical with respect to the state-rst approach. Namely, it prepares a number of GMM based monophone models as candidates with several numbers of mixture components in its rst phase. Then, the number of clustered states is increased for each candidate, so that each of the triphone models reaches the inverse-proportion band at the best F m (mixture-rst approach). This also can produce
setats deretsulc fo rebmuN
hcaorppa tsrif-erutxiM
hcaorppa tsrif-etatS aera mumitpo lacoL

h cr a e s
aera mumitpO ledom naissuaG elgnis a htiw gniretsulc etats MMH enohpirT
l e d o m d n a bnI
MMG a htiw gniretsulc setats MMH enohpirT 1
50
a number of in-band models (see the solid arrows in Figure 4.2) in the same way as the state-rst approach. Here, we note the potential of the mixture-rst approach for obtaining accurate state clusters, which comes from a precise representation of output distributions not by single Gaussians but by GMMs during clustering, i.e., GMM based state clustering. Even if two acoustic models produced separately by the state-rst approach and the mixture-rst approach have the same quantitative specications as regards the numbers of states and mixture components, triphone HMM states might be clustered differently by the two approaches due to the difference in the representation of the output distributions. In general, the mixture-rst approach is advantageous for accurate clustering because the precise representation of the output distribution is expected to achieve proper decisions in the clustering process [56]. Accordingly, this chapter employs the mixture-rst approach to construct the in-band models. The original VBEC framework, as formulated in Chapter 2, already involves the theoretical realization of GMM based state clustering. However, the straightforward implementation of the realization requires an impractical computation time. Therefore, a key technique for the construction of accurate in-band models involves reducing the computation time needed for GMM based state clustering to a practical level.
4.2.2 HMM state clustering based on Gaussian mixture model

This subsection rst describes the phonetic decision tree method for clustering context-dependent HMM states, as explained in Section 3.3. The success of the phonetic decision tree is due largely to the efcient binary tree search under constraints of phonetic knowledge based questions, and because of the following three additional constraints: (C1) Data alignments for each state are xed during the splitting process. (C2) The output distribution in a state is represented by a normal distribution. (C3) The contribution of the state transition to the objective functions is disregarded. These constraints are already introduced in Section 3.3. Two constraints (C1) and (C2) play a role in eliminating the latent variables involved in an acoustic model. Therefore, under these constraints, the 0-th, 1st and 2nd statistics of node n (On , Mn and V n , respectively) are computed simply by summing up the sufcient statistics of state j (Oj , Mj and Vj ) as shown in Eq. (3.7), where j represents a non-clustered triphone HMM state, and is included in node n. Therefore, once we have prepared statistics Oj , Mj and Vj for all possible triphone HMM states by using Eq. (3.8), we can easily calculate the statistics for each node, and this reduces the computation time to a practical level. In contrast, the goal of GMM based state clustering is to obtain even more accurate state clusters at the expense of losing the benets of constraint (C2). As a result, the splitting process has to proceed with the latent variables that remain in the model. Consequently, we require the GMM n n statistics for each k component Ok , Mn and Vk , which are obtained by the VB iterative calk culation described in Sections 2.3.2 and 2.3.3, to calculate VB posteriors and the gain of VBEC
51
Figure 4.3: Estimation of inheritable GMM statistics during the splitting process. objective function F Q(n) to examine all possible combinations of node n and question Q. It is inevitable that the overall computation time needed to construct phonetic decision trees will become huge and impractical.
4.2.3 Estimation of inheritable node statistics

n n In order to avoid VB iterative calculation for the GMM statistics Ok , Mn and Vk during the k splitting process, the ratio of the n-th order statistics for component k is assumed to be conserved. n That is to say, the ratio of the 0-th order statistics Ok for a node n is related to the ratios of the 0-th order statistics of its yes-node nY and no-node nN for a question Q as follows: Q Q
Ok Q
k
nY
Ok
nY Q
Ok Q
k
nN
Ok
nN Q
n Ok . n k Ok
Employing the relation (4.1) for the upper node in a phonetic tree successively, the assumption r r r yields the fact that the ratio at each node is equivalent to the ratio Ok / k Ok ( wk : weighting factor ) at the root node statistics in the tree (i.e., the ratio of the monophone HMM state statistics) as follows:
n Ok = n k Ok r Ok r wk for any node n, r k Ok
n where the sufx r indicates a root node. Therefore, from Eq. (4.2), the 0-th order statistics Ok of component k in node n is estimated as follows: r n Ok = wk k n r O k = wk O n .
noitseuq citenohP oN
sedon dlihC
seY
edon tneraP
(4.1)
(4.2)
(4.3)
52
This approach is based on the knowledge that the monophone state statistics are phonetically simn ilar to the clustered state statistics. Similarly, the 1st and 2nd order statistics k of component k in node n are estimated as follows:
r Mn = wk Mn k n r Vk = wk V n .
(4.4)
n n Thus, we can estimate the GMM statistics of each node Ok , Mn and Vk without using the VB k r iterative calculation, but using the k component ratio of the monophone statistics wk . We call this r the estimation of inheritable node statistics because the ratio wk is passed from a parent node to child nodes, as shown in Figure 4.3. Consequently, VB posteriors and VBEC objective function can also be calculated without using a VB iterative calculation during the splitting process. The concrete form of the parameters for VB posteriors is derived by substituting Eqs. (4.3) and (4.4) into the general solution Eq. (2.30), as follows: r n = 0 + wk O n k n = 0 + w r O n k k n 0 r n n : k = + wk O . (4.5) k n,0 r n = 0 k +wk Mn k r 0 +wk On n R = diag Rn,0 + V n 1n Mn (Mn ) + n,0 Onn Mnn n,0 Mnn n,0 n,0 O +O O O
This equation omits the n part because the contribution of the state transition to the objective functions is disregarded (condition (C3)). For a similar reason, state sufx j is reduced. In addition, 0 0 0 0 this chapter assumes 0 , k , and k to be constants for any k, i.e., {0 , k , k } {0 , 0 , 0 } k k Based on the above VB posteriors n {n |k = 1, ..., L} and the constraints (C1), (C2), and k (C3), F Q(n) is derived from Eqs. (2.39), (2.41), and (2.42) as follows: F Q(n) = f (nQ ) + f (nQ ) f (n ) f (n,0 )
k
Y N
r r wk log wk ,
(4.6)
where f () log
k
k +
k
log (k )
D k k log k log |Rk | + D log . 2 2 2
(4.7)
r To calculate F Q(n) , we must estimate wk and set the prior parameters 0 appropriately.
4.2.4 Monophone HMM statistics estimation

MMIXTURE
r In this section, weighting factor wk of component k in root node r is estimated from monophone r HMM statistics. wk is needed to estimate the inheritable GMM statistics of each node and to calculate the gain of VBEC objective function F Q(n) . The GMM statistics of monophone HMM
53
r r Ok , Mr and Vk can be obtained by the VB iterative calculation without much computation, as k follows: r t Ok = e,t e,j=r,k t . (4.8) Mr = e,t e,j=r,k O t e rk t t t Vk = e,t e,j=r,k O e (O e ) t where e,j is an occupation probability obtained by the forward-backward or Viterbi algorithm r r r within the VB or ML framework. Therefore, wk is estimated by Ok / k Ok . Moreover, Mr and k n,0 n,0 n,0 n,0 r r Vk are used to set the prior parameters k and Rk . Then, wk , k and Rk are represented by r r Ok , Mr and Vk as follows: k
Or r wk = P k r k Ok Mr n,0 = Ork k k n,0 R = diag 0 k
.
r Vk r Ok
(4.9)
n,0 n,0 k k
r This approach utilizes the Gaussian mixture statistics of monophone HMM to obtain Ok , Mr and k r Vk , so this approach is called MMIXTURE.
MSINGLE
n,0 r This section introduces another approach for obtaining wk , n,0 and Rk more easily, which was k r rst proposed in [31]. This approach assumes that wk is the same for all the components in an r L-component GMM, and is represented by wk = 1/L, instead of calculating the GMM statistics of monophone HMM. In addition, single Gaussian statistics of monophone HMM are employed to n,0 set n,0 and Rk . These are easily computed by summing up the sufcient statistics Oj , Mj and k n,0 r Vj for all triphone HMM states. Then, wk , n,0 and Rk are represented as follows: k
1 r wk = L P n,0 jr Oj Mj k = P Oj jr P n,0 jr R = diag 0 P Oj Vj n,0 n,0 k k k Oj

jr
(4.10)
This approach utilizes only single Gaussian statistics of monophone HMM, so this approach is called MSINGLE. It is easier to realize MSINGLE than MMIXTURE because the former does not require the preparation of the GMM statistics of monophone HMM. However, because of the rough estimation, the MSINGLE approach would be less accurate than MMIXTURE. Thus, we can construct a number of in-band model topologies by using MMIXTURE or MSINGLE, i.e., we realize the solid arrows seen in Figure 4.2. Finally, in order to determine an appropriate model from the in-band models (in-band model search), the exact VBEC objective function is calculated by dropping the constraints (C1), (C2), and (C3) in Section 4.2.2 and the inheritable statistics assumption in Section 4.2.3. Then, the calculation is performed by VB iteration as described in Sections 2.3 and 3.2, unlike the non-iterative approximation described in Eqs. (4.6) and (4.7).
54
.CCAW 2 88
WACC

WACC
100
100
50 40
50 40
{2000, 40}
Number of components per state
30 20 16 12 8
{1000, 30}
30 20 16 12 8
{7790, 8}
{7790, 8}
2nd-phase (#G = 8)
2nd-phase (#G = 8)
1st-phase (#S = 7790)

129 500 1000 2000 3000 4000 6000 8000
1st-phase (#S = 7790)

129 500 1000 2000 3000 4000 6000 8000
Number of clustered states
(a) Test1
(b) Test2
Figure 4.4: Model evaluation test using Test1 (a) and Test2 (b). The contour maps denote word accuracy distributions for the total number of clustered states and the number of components per state. The horizontal and vertical axes are scaled logarithmically. As regards the setting of the prior parameters, we follow the statistics based setting for 0 and R0 as described in this section, and the constant value setting for the remaining parameters 0 , 0 and 0 . Both settings are widely used in the Bayesian community (e.g., [15, 2426]), and their effectiveness has also already been conrmed by the speech recognition experiments described in Chapter 3.
4.3 Preliminary experiments

4.3.1 Maximum likelihood manual construction
The effectiveness of the proposed method is proved in Section 4.4. First, this section describes preliminary experiments that were conducted to show the way in which the performance is distributed over the numbers of states and components per state to conrm the two acoustic model characteristics inverse-proportional band and unimodality described in Section 4.2.1. The recognition performance of conventional ML-based acoustic models with manually varied model topologies is also examined to provide baselines with which to compare the performance of the automatically determined model topology. Several topologies were produced under manually varied conditions as regards the number of states, i.e., the sizes of the phonetic decision trees, and the number of components per state. A total of 10 (1, 4, 8, 12, 16, 20, 30, 40, 50 and 100 components) 8 (129 1 , 500, 1,000, 2,000, 3,000, 4,000, 6,000, 8,000 clustered states) = 80 acoustic models was
1
The number of monophone HMM states.
5.3
S# 3
5.2
5.3
S# 3
5.2
1.19
.CCAW 2 88 5.1 5.0
# 1G
1.19
92 91 90 89 88
92 91 90 89 88
# 1G
5.1 5.0
4.3. PRELIMINARY EXPERIMENTS
55
obtained. In order to examine topologies with widely distributed specications, the numbers of clustered states and components were arranged at irregular intervals, i.e., wider intervals between the larger numbers. Then, the examined points along the inversely proportional band were located at nearly regular intervals, so that the search for an appropriate model topology would be carried out evenly over the band. The experimental conditions are summarized in Table 4.1. The training data consisted of about 20,000 Japanese sentences (34 hours) spoken by 30 males, and two test sets (Test1 and Test2) were prepared from Japanese Newspaper Article Sentences (JNAS) 2 . One was used as a performance criterion for obtaining an appropriate model from the various models (model evaluation test) and the other was used to measure the performance of the obtained model (performance test). By exchanging the roles of the two test sets, two sets of results were obtained, and these were utilized to support the certainty of the discussion. The test sets each consisted of 100 Japanese sentences spoken by 10 males and taken from JNAS (a total of 1,898 and 1,897 words, respectively), as shown in Table 4.1. Figure 4.4 (a) and (b), respectively, are contour maps that show the results of model evaluation tests for the examined word accuracy (WACC) obtained using Test1 and Test2, where the horizontal and vertical axes are scaled logarithmically. We can see a high performance area along a negative slope band in both maps (an inversely proportional band in the linear scale maps). The band satised the relationship whereby the product of the numbers of states and components per state ranged approximately from 104 to 105 . Therefore, the results conrmed the acoustic model characteristic namely that a high performance area was distributed in the inverse-proportion band where the total number of distribution parameters (approximately equal to the total number of Gaussians) is constant. Next, we focus on the other characteristic, namely the unimodality in the band. The top scores were 91.1 and 91.6 WACC for the model evaluation test using Test1 and Test2, respectively, where the numbers of states and components per state were {1,000, 30} and {2,000, 40}, respectively ({ , } denotes the model topology by {the number of clustered states, the number of components per state}). From Figure 4.4, we can see that high performance areas were distributed across the regions around the top scoring topologies, and the unimodality of the performance distributions were conrmed experimentally. Thus, the two characteristics of the acoustic model were conrmed, which indicates the feasibility of the proposed in-band model search. Finally, a performance test was undertaken, and 91.0 and 91.4 WACC were obtained by recognizing Test1 and Test2 data using the Test2-obtained model {2,000, 40} and Test1-obtained model {1,000, 30}, respectively. Since both Test1 and Test2 scored more than 91.0 points when using manually obtained models, our goal is to reach a score of more than 91.0 points when using automatically determined models.
Although this task is almost the same as the LVCSR task described in Chapter 3, the test sets are different, which makes the experimental results slightly different, e.g., recognition performance.
56

Table 4.1: Experimental conditions for LVCSR (read speech) task
Sampling rate Quantization Feature vector 16 kHz 16-bit 12 - order MFCC + MFCC + Energy + Energy (26 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (Left to right) Number of phoneme categories 43 Training data JNAS: 20,000 utterances, 34 hours (122 males) Test1 data JNAS: 100 utterances, 1,898 words (10 males) Test2 data JNAS: 100 utterances, 1,897 words (10 males) Language model Standard trigram (10 years of Japanese newspapers) Vocabulary size 20,000 Test1: 94.8 (OOV rate: 2.1 %), Test2: 96.3 (OOV rate: 2.4 %) Perplexity JNAS: Japanese Newspaper Article Sentences, OOV: Out Of Vocabulary words
4.3.2 VBEC automatic construction based on 2-phase search

This subsection demonstrates another automatic determination method, which does not require a model evaluation test using test data. The method is based on the conventional 2-phase search, which is the same procedure as that demonstrated in Section 3.5.5. The 2-phase search was proposed as a way to determine acoustic models within a practical computation time by clustering triphone HMM states with a single Gaussian model, and then determining the number of components per state, as described in Chapter 3. Prior parameters 0 and R0 / 0 were set from the single Gaussian statistics in the monophone HMM state as prior parameters 0 in a 2-phase search because the monophone HMM states have a sufcient amount of data, and we can estimate their statistics accurately without worrying about over-training. For the other parameter setting, the conventional Bayesian approach of 0 = 0 = 2.0 for the Dirichlet distribution and 0 = 0 = 0.1 for the Normal Gamma distribution were employed. It is noted that the dependences of 0 , 0 , 0 and 0 are not particularly sensitive in terms of performance as regards the acoustic model construction, as discussed in Section 3.5.2. These settings of the prior parameters 0 , 0 , 0 and 0 were used for all the VBEC experiments, and nally, Section 4.4.3 examines the dependence of the prior parameter settings. In addition, here and below in a VBEC framework, the conventional ML-based decoding for recognition was employed rather than the Bayesian Predictive Classication (BPC) based decoding of VBEC, VBBPC. This made it possible to evaluate the pure effect of the automatic determination of the model topology using the proposed determination technique. The triphone states were automatically clustered and there were 7,790 clustered states, which had the best VBEC objective function F m in the 1st-phase procedure. Then, 10 acoustic models were made by using VBEC where each model had a xed clustered state topology and 1, 4, 8, 12, 16, 20, 30, 40, 50 and 100 components. The acoustic model was nally determined as {7790,
4.4. EXPERIMENTS
57
8 } , which had the best VBEC objective function F m in the 2nd-phase procedure automatically. The dashed line in Figure 4.4 overlaps the same model topology determined by the 2-phase search in the two ML contour maps. The determined model topology {7,790, 8} scored 88.9 points in Test1 and 89.2 points in Test2. The model topology {7,790, 8} determined by the 2-phase search could not match the best manually obtained models ({# S, # G} = {1,000, 30}, {2,000, 40}), and the VBEC model could not reach the performance goal obtained with the ML-manual approach (91.0 points). The main reason for this inferiority was the local optimality of the model topology in the 2-phase search. That is to say, in the rst phase of state clustering, the selected model was appropriate for a single Gaussian per state, but had too many states for multiple Gaussians per state. In short, the model topology determined by a 2-phase search was a local optimum, and this caused the degradation in performance. Optimum performance cannot be determined by the conventional 2-phase search procedure, as found with the results described in Section 3.5.5.
4.4 Experiments
This section describes experiments conducted to prove the effectiveness of the proposals. There are three subsections. First, Section 4.4.1 conrms the automatic determination of the model topology using the proposed approach and compares its performance with conventional approaches in large vocabulary continuous speech recognition (LVCSR) tasks. Section 4.4.2 describes experiments designed to compare the computation time needed for the proposed approach, an ML-manual approach, a 2-phase search and the straightforward method of GMM phonetic tree clustering with the VB iterative calculation. Section 4.4.3 examines the prior parameter dependence of the proposals and discusses the difference between the proposals of MSINGLE and MMIXTURE. The second and third experiments undertake relatively small tasks using isolated word recognition in order to examine the topology search as properly as in the previous experiments. This is despite the fact that the straightforward method of GMM phonetic tree clustering requires a huge computation time in Section 4.4.2 and the examination of prior parameter dependence requires extra search spaces for prior parameter setting in Section 4.4.3.
4.4.1 Determination of acoustic model topology using VBEC

This subsection conducts experiments to prove the effectiveness of the proposed procedure, namely an in-band model search using GMM state clustering, which is determined by VBEC. The experimental conditions were the same as those described in Section 4.3. The prior parameters 0 , R0 / 0 and 0 were set in accordance with the MSINGLE and MMIXTURE algorithms described in Section 4.2.4, and the other prior parameters were set in the same way as described in Section 4.3.2. First, the in-band models were constructed using the GMM decision tree clustering proposed in Section 4.2.2, 4.2.3 and 4.2.4 and it was examined whether the constructed model topologies were in band. The two proposed clustering algorithms MSINGLE and MMIXTURE were used to produce a set of clustered-state triphone HMMs, which made a total of 10 sets of clustered-state
58
100
WACC
90.1 90.9 91.2 91.0 91.8 91.4 90.5 90.9
100
WACC
90.8 91.4 91.7 91.6 91.7 91.5 91.3 90.7
50 40
50 40
30 20 16 12 8
30 20 16 12 8
89.5
89.6
129
500
1000
2000 3000 4000 6000 8000
129
500
1000
2000 3000 4000 6000 8000
(a) Test1
(b) Test2
Figure 4.5: Determined model topologies and their recognition rates (MSINGLE). The horizontal and vertical axes are scaled logarithmically.
92 91 90 89 88 92 91 90 89 88
100
WACC.
90.9 91.3 91.4 90.9 90.6 90.7 90.0 90.7
100
WACC
91.1 91.3 91.7 91.4 91.4 91.2 91.2 90.9
50 40
50 40
30 20 16 12 8
30 20 16 12 8
90.3
90.1
129
500
1000
2000 3000 4000 6000 8000
129
500
1000
2000 3000 4000 6000 8000
(a) Test1
(b) Test2
Figure 4.6: Determined model topologies and their recognition rates (MMIXTURE). The horizontal and vertical axes are scaled logarithmically.
5.3
S# 3
5.2
5.3
S# 3
5.2
86.6
87.7
1.19
5.3
S# 3
5.2
.CCAW 2 88
# 1G
5.1
5.0
1.19
5.3
S# 3
5.2
.CCAW 2 88
86.6
87.6
1.19
.CCAW 2 88 5.1 5.0
# 1G
1.19
92 91 90 89 88
92 91 90 89 88
.CCAW 2 88 5.1 5.0 5.1 5.0
# 1G # 1G
4.4. EXPERIMENTS
59
HMMs (1, 4, 8, 12, 16, 20, 30, 40, 50 100 components). The determined model topologies of MSINGLE and MMIXTURE are plotted with crosses in Figures 4.5 and 4.6, respectively, where the plots are overlaid on the contour map of Figure 4.4. All the determined models were located on the band along a negative slope line. Therefore, these experiments conrmed that the determined models were located in the band. Moreover, a speech recognition test was undertaken by using the in-band models to measure how well the determined topologies performed. The obtained word accuracies are also plotted in Figures 4.5 and 4.6 for each determined model topology. Almost all the WACC values were more than 90.0 points, which supports the evidence indicating that the model topologies determined using MSINGLE and MMIXTURE were good. Thus, it was conrmed that each of the model topologies was selected appropriately using MSINGLE and MMIXTURE because they were located in a band where the product of the numbers of states and components per state was a constant, and almost all the WACC values were above 90.0 points. This indicates the validity of the approximations of MSINGLE and MMIXTURE. The proposed procedure was nalized by selecting the set of in-band models with the highest F value as an optimum acoustic model. This was accomplished by utilizing the unimodal characteristic without seeing the performance of the model evaluation test using test data. Since VBEC selects the model that gives the highest value of the VBEC objective function, the validity of the model selection can be evaluated by examining the relation between F m and recognition performance. Figure 4.7 shows the VBEC objective function F m values and WACCs for both Test1 and Test2 for MSINGLE along a line connecting the points of the determined topologies in Figure 4.5, where the horizontal axis is the number of components per state, which are logarithmically scaled. Figure 4.8 shows the same values for MMIXTURE. With both MSINGLE (Figure 4.7) and MMIXTURE (Figure 4.8), WACC and F m behaved similarly, which suggests that the proposed search algorithm worked well. In fact, the VBEC objective function and WACC behaved almost unimodally for both MSINGLE and MMIXTURE. This indicates that VBEC could determine the appropriate in-band model by using the VBEC objective function, which supports the effectiveness of this proposal, i.e., in-band model search.
m
Next, experiments were undertaken that focused on the suitability of the nally determined model topology using the in-band model search. From Figure 4.7, MSINGLE could determine the model topologies {912, 40}, and obtained 91.2 for Test1 and 91.7 for Test2. Similarly, from Figure 4.8, MMIXTURE could also determine the model topologies {878, 40}, and the WACCs obtained using the performance test were 91.4 for Test1 and 91.7 for Test2. They all exceed not only the values of 88.9 and 89.2 obtained by the conventional 2-phase search, but also the performance goal of 91.0 points, and so we can say that MSINGLE and MMIXTURE can provide high levels of performance. For the model topology, (the number of clustered states and GMM components), the MSINGLE and MMIXTURE models were similar to each other, and matched one of the best manually obtained models {1,000, 30} described in Section 4.3.1. In contrast, the MSINGLE and MMIXTURE models were different from the other best manually obtained model {2,000, 40}. However, the topology of manually obtained models varies depending on the test set data, and therefore, the determined models do not always have to correspond to the manually obtained models. In fact, the performance of MSINGLE and MMIXTURE reached the goal of 91.0
60
points, and therefore, we can also say that MSINGLE and MMIXTURE are capable of determining an optimum model topology. Furthermore, the total numbers of Gaussians in MSINGLE and MMIXTURE were smaller than those obtained in ML-manual, and the determined models were more compact, which could improve the decoding speed. Thus, these experiments proved that the proposed method can automatically determine an optimum acoustic model topology with the highest performance. In these experiments, the two proposed algorithms, MSINGLE and MMIXTURE determined similar topologies, and achieved similar performance levels, which indicates that there was no great difference between the two proposals in this experimental situation. Section 4.4.3 comments on the difference.
0.29 0.29
Figure 4.7: Word accuracies and objective functions using GMM state clustering (MSINGLE). The horizontal axis is scaled logarithmically.
Figure 4.8: Word accuracies and objective functions using GMM state clustering (MMIXTURE). The horizontal axis is scaled logarithmically.
4.4.2 Computational efciency

The previous experiments conrmed that MSINGLE and MMIXTURE enabled GMM based state clustering and that the in-band model search could determine an optimum model topology with the highest performance levels. However, GMM based state clustering is also realized with the original VBEC framework, as described in Chapter 2, even if there are latent variables, without using MSINGLE or MMIXTURE. However, as mentioned in Section 4.2, the VB iterative calculations require an impractical amount of computational time to obtain the true F m values. The advantage of MSINGLE and MMIXTURE is that they can determine an optimum model topology within a practical computation time. Therefore, to emphasize the advantages of MSINGLE and MMIXTURE compared with the straightforward implementation, we have to consider the computation time needed for constructing the acoustic model. This section examines the effectiveness of MSINGLE and MMIXTURE in comparison with the straightforward VBEC implementation, VBEC 2-phase search and the ML manual method. Here we compare these approaches not only in terms of the model topology and performance described in the previous section but also in terms of computation time. To consider the experiments in more detail, isolated word recognition tasks were used to test various situations. The experimental conditions are summarized in Table 4.2. The
n o ti c n u f e vi t c ej b o B V
001
04
etats rep stnenopmoc fo rebmuN 01
noitcnuf evitcejbo BV
1tseT rof CCAW
2tseT rof CCAW
0.19 0.09 0.88 0.78 0.68
W A 0.98 C C
n o ti c n u f e vi t c ej b o B V
001
04
etats rep stnenopmoc fo rebmuN 01
noitcnuf evitcejbo BV 2tseT rof CCAW 1tseT rof CCAW
0.19 0.09 0.88 0.78 0.68
A 0.98 C
W C
4.4. EXPERIMENTS
61
training data consisted of about 3,000 Japanese sentences (4.1 hours) spoken by 30 males. Two test sets were prepared as with the previous LVCSR experiments, which consisted of 100 Japanese city names spoken by 25 males (a total of 1,200 words for each), as shown in Table 4.23 . First, the straightforward implementation of VBEC is described that uses VB iterative calculation within the original VBEC framework to prepare in-band models. In this experiment, the VBEC iterative method was approximated by xing the frame-to-state alignments during the splitting process and by using a phonetic decision tree construction as well as MSINGLE and MMIXTURE. Even in this situation, the full version of the iterative algorithm is unrealistic because of the VB iterative calculation in GMM. So, a restricted version was examined that was implemented as ideally as possible by using a brute force computation. Namely, 45 personal computers with state-of-theart specications were used, so that the computation for all the phonetic decision trees could be carried out in parallel (this is called the VBEC iterative method within the original VBEC framework VBEC AMP (Acoustic Model Plant) because it is nally realized by such a large number of computers). Moreover, in order to reduce the computation time needed for the iterative calculation, we employed an approximation to reduce the number of decision branches when choosing the appropriate phonetic question. The 10 best questions were derived from 44 questions by applying all the questions to a state splitting with a single Gaussian based state clustering method, which did not require any iterative calculations. Then, the iterative calculations were performed for the derived 10 best questions. The trial suggested that the questions selected when using the 10 best questions covered about 95 % of those selected when using all the questions, and were sufcient when carrying out iterative calculations for all the GMMs to construct a set of clustered-state triphone HMMs. Finally, an optimum model was also determined from the in-band models using the in-band model search as well as MSINGLE and MMIXTURE. As with the LVCSR experiments, a total of 10 (1, 5, 10, 15, 20, 25, 30, 35, 40 and 50 components) 6 (100, 500, 1,000, 1,500, 2,000, 3,000 clustered states) = 60 acoustic models was prepared for the ML manual method, and a total of 10 sets of clustered-state HMMs (1, 5, 10, 15, 20, 25, 30, 35, 40 and 50 components) was prepared for the VBEC automatic methods. The obtained model topology, performance and computation time needed for constructing the acoustic models are listed in Table 4.3, and are discussed in order. Model topology: The model topologies determined using MSINGLE, MMIXTURE and AMP were almost the same as those obtained using ML-manual. This supports the view that the MSINGLE and MMIXTURE approximations for the VBEC objective function work well in GMM state clustering and can construct the appropriate model topology even when compared with the more exact AMP method. Recognition rate: The performance of MSINGLE, MMIXTURE and AMP was almost the same. This also supports the validity of the MSINGLE and MMIXTURE approximations for the VBEC objective function. In addition, it is also conrmed that the performance was comparable to the ML-manual performance (MSINGLE, MMIXTURE, AMP and ML-manual scored more than
Similar to the discussion for the LVCSR task, although this isolated word recognition task is almost the same as that in Chapter 3, the test sets are different, which makes the experimental results slightly different, e.g., recognition performance.
3
62
97.0 % ) and was higher than the conventional 2-phase search performance, even for a different task from that described in Section 4.4.1. Computation time: MSINGLE and MMIXTURE both took about 30 hours, and, as expected, this was much faster than AMP, which took 1,150 hours to nalize the calculation, even though the amount of training data (4.1 hours) was relatively small. Therefore, we can say that these approaches were very effective ways to construct models because they can obtain comparable recognition results as regards performance, and can construct models even more rapidly than AMP. In comparison with the conventional ML-manual approach, MSINGLE and MMIXTURE took a relatively short computation time, and are regarded as providing a relatively short computation time. The reason for the short computation times of MSINGLE and MMIXTURE is that these methods do not need an extra search for the dimension of the clustered states (i.e., the number of search combinations was reduced to 1/6 because 6 10 search combinations were reduced to 1 10 search combinations). Moreover, in LVCSR, the difference between the proposals (MSINGLE and MMIXTURE) and ML-manual as regards computation time would become larger because the model evaluation test in ML-manual requires more computation time in LVCSR than isolated word recognition. Focusing on the difference between MSINGLE and MMIXTURE, we can see that MSINGLE took slightly less time than MMIXTURE. The difference between them resulted from the monophone GMM training required in MMIXTURE. Thus, we can conclude that MSINGLE and MMIXTURE can also determine an optimum model topology while maintaining the highest level of performance even for an isolated word recognition task, and as regards computation time, MSINGLE and MMIXTURE can construct acoustic models more rapidly than ML-manual or AMP.
Table 4.2: Experimental conditions for isolated word recognition

16 kHz 16 bit 12-order MFCC + 12-order MFCC (24 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (left-to-right HMM) Number of phoneme categories 27 Training data ASJ: 3,000 utterances, 4.1 hours (male) Test data 1 JEIDA: 100 city names, 1,200 words (male) Test data 2 JEIDA: 100 city names, 1,200 words (male) ASJ: Acoustical Society of Japan, JEIDA: Japan Electronic Industry Development Association Sampling rate Quantization Feature vector
4.4. EXPERIMENTS
Table 4.3: Comparison with iterative and non-iterative state clustering
ML-manual 2-phase search AMP MSINGLE Model topology ({#S, #G}) {500, 30}, {500, 35} {2,642, 5} {548, 30} {253, 35} Recognition rate (%) 97.1, 97.6 96.3, 94.9 97.3, 98.0 97.6, 98.2 Time (hour) 244 56 1,150 30 # S: number of clustered states, # G: number of components per Gaussian
63
MMIXTURE {224, 35} 97.8, 97.8 37
4.4.3 Prior parameter dependence

A xed combination of prior parameter values, 0 = 0 = 0.1 and 0 = 0 = 2 were used throughout the construction of the model topology and the estimation of the posterior distribution. In the small-scale experiments that were conducted in previous research [2426], the selection of such values was not a major concern. However, when the scale of the target application is large, the selection of the prior parameter values might affect model quality. Namely, the best or better values might differ greatly. Moreover, estimating appropriate prior parameters for acoustic models needs so much time that it is impractical for speech recognition. Therefore, this subsection examines how robustly the acoustic models constructed by the proposed method performed against changes in prior parameter values. Here, 0 and 0 were examined, which correspond to the prior parameters of the mean and covariance for a Gaussian. This is because the effects of 0 and 0 for speech recognition would be stronger than those of 0 and 0 , which correspond to the prior parameters of the state transition probability and weight factor, analogous to the effects of the mean and covariance for a Gaussian and the state transition probability and weight factor for speech recognition. The values of prior parameters 0 and 0 were varied from 0.01 to 10, and the speech recognition rates examined. Table 4.4 shows the recognition rates for each combination of prior parameters. We can see that almost all the models for the given range of prior parameter values provide high levels of performance (more than 97.0 %). Here, we found that there was a difference between MSINGLE and MMIXTURE as regards model topology and performance. The model topologies of MSINGLE were incorrect when 0 = 0 = 1.0 and the performance signicantly degraded when 0 = 0 = 10 where the numbers of clustered states increased greatly. Here, the approximate GMM statistics approached the prior statistics of monophone HMM too closely in Eq. (4.5), which enhanced the rough setting of the prior parameters in MSINGLE. Therefore, these experiments conrm that MMIXTURE was better than MSINGLE as regards the robustness of the prior parameters, especially with large prior parameter values, because the prior parameter setting in MMIXTURE using GMM statistics is more appropriate than that in MSINGLE using single Gaussian statistics. Thus, although MMIXTURE is more difcult to use than MSINGLE because MMIXTURE requires the preparation of the GMM statistics of the monophone HMM states, MMIXTURE is more stable than MSINGLE when the prior parameters are varied.
64

Table 4.4: Prior parameter dependence
0 = 0 MSINGLE Model topology ({#S, #G}) Recognition rate (%) MMIXTURE Model topology ({#S, #G}) Recognition rate (%) 0.01 {325, 15 } 97.6, 97.5 {308, 15 } 97.5, 97.4 0.05 {278, 25 } 97.1, 98.1 {204, 35 } 97.6, 97.3 0.1 {253, 35 } 97.6, 98.2 {224, 35 } 97.8, 97.8 0.5 {343, 40 } 97.5, 97.4 {373, 25 } 97.8, 97.4 1.0 {962, 25 } 97.1, 97.1 {371, 30 } 98.3, 97.6 10.0 {7,149, 5} 94.1, 94.3 {257, 40 } 97.8, 97.2
4.5 Summary
This chapter proposed the automatic determination of an optimum topology in an acoustic model by using GMM-based phonetic decision tree clustering and an efcient model search algorithm utilizing the acoustic model characteristics. The proposal was realized by expanding the VBEC model selection function used in the Bayesian acoustic model construction in Chapter 3. Experiments showed that the proposed approach could determine an optimum topology with a practical computation time, and the performance was comparable to the best recognition performance provided by the conventional maximum likelihood approach with manual tuning. The effectiveness of the proposed methods has also been shown for various tasks, such as a lecture speech recognition task and a English read speech recognition task in [57], as shown in Table 4.5. Thus, by using the proposed method, VBEC can automatically and rapidly determine an acoustic model topology with the highest performance, enabling us to dispense with manual tuning procedures when constructing acoustic models. The next chapter focuses on the last Bayesian advantage, namely robust classication by marginalizing model parameters, obtained by using Bayesian prediction in VBEC,
Table 4.5: Robustness of acoustic model topology determined by VBEC for different speech data sets.
VBEC ML-manual Japanese read 91.7 91.4 Japanese isolated word 97.9 98.1 Japanese lecture 74.5 74.2 English read 91.3 91.3
Chapter 5 Bayesian speech classication

5.1 Introduction
The performance of statistical speech recognition is severely degraded when it encounters previously unseen environments. This is because statistical speech recognition is conventionally based on conventional Maximum Likelihood (ML) approaches, which often over-train model parameters to t a limited amount of training data (sparse data problem). On the other hand, a Bayesian framework is expected to deal even with unseen environments since it has the following three important advantages: effective utilization of prior knowledge, appropriate selection of model structure and robust classication of unseen speech, each of which works to mitigate the effects of the sparse data problem. Recently, we and others proposed Variational Bayesian Estimation and Clustering for speech recognition (VBEC), which includes all the above Bayesian advantages [28]. Previous VBEC studies have mainly examined and proven the effectiveness derived from the rst two advantages [28, 32]. This chapter focuses on the third Bayesian advantage, the robust classication of unseen speech. When we use a conventional classication method based on the ML approach (MLC), we prepare a probability function (for example f (x; ) with a set of model parameters ), which represents the distribution of features for a classication category, e.g. a phoneme category, and point-estimate using labeled speech data. However, the parameter is often estimated incorrectly because of the sparseness of the training data, which results in a mismatch between the training and input data. Therefore, MLC might be seriously affected by the incorrect estimation when classifying unseen speech data. In contrast, a classication method based on the Bayesian approach does not use the point-estimated value of the parameter, but assumes that the value itself also has a probability distribution represented by a function (for example, g()). Then, by employing the expectation of f (x; ) with respect to g(), we obtain a distribution of x, and can robustly predict the behavior of unseen data in the Bayesian approach, which is the so-called marginalization procedure of model parameters [20]. Some previous studies proposed classication methods based on the predictive distribution (Bayesian Predictive Classication, referred to as BPC) for speech recognition, and proved that they were capable of providing a much more robust classication than MLC [22, 58]. 65
66
CHAPTER 5. BAYESIAN SPEECH CLASSIFICATION
A major problem with BPC is how to provide g(). In [22, 58], they prepare the g() of mean parameters of f (x; ), while assuming that is distributed according to the constrained uniform posterior whose mean value is set by an ML or a Maximum A Posteriori (MAP) estimate obtained from training data, and its scaling parameter value is determined by setting prior parameters. Then, the predictive distribution has a at peak shape because of the scaling parameter, so that the distribution can cover a peak where the unseen speech feature might be distributed. Here, the coverage of the predictive distribution depends on the hyper-parameter setting. On the other hand, VBEC provides g() from training data for all model parameters of f (x; ) since the VBEC framework is designed to deal consistently with a posterior distribution of , which is a direct realization of g(), by variational Bayes (VB-posteriors) [25, 26]. As a result, the predictive distribution is analytically derived as the Students t-distribution. The tail of the Students t-distribution is wider than that of the Gaussian distribution, which also can cover the distribution of the unseen speech feature. The tail width depends on the training data, i.e., the tail becomes wider as the training data becomes sparser. Note that, in the VBEC framework, an appropriate coverage by the predictive distribution is automatically determined from the training data, and mitigation of the mismatch between the training and input speech is achieved without setting the hyper-parameters VB posteriors using the VBEC framework are introduced briey in Section 2.3. Then, in Section 5.2, a generalized formulation is provided for the conventional MLC, MAP and BPC and VB-BPC so that they form a family of BPCs. Section 5.3 describes two experiments. The rst aims to show the role of VB-BPC in the total Bayesian framework VBEC for the sparse data problem. Namely, we show how BPC using the VB posteriors (VB-BPC) contributes to solving the sparse data problem in association with the other Bayesian advantages provided by VBEC. The second experiment aims to compare the effectiveness of VB-BPC with conventional Bayesian approaches. There, we apply VB-BPC to a supervised speaker adaptation task within a direct parameter adaptation scheme as a practical example of the sparse data problem, and examine the effectiveness of VB-BPC compared with the conventional MAP and BPC approaches [15, 58].
5.2 Bayesian predictive classication using VBEC

5.2.1 Predictive distribution
This section provides a generalized formulation for the ML/MAP based classication, the uniform distribution based BPC and VB-BPC so that they form a family of BPCs. To focus on the BPC, the model structure m is omitted in this section. As discussed in Section 2.3.4, by calculating the integral in Eq. (2.46), the predictive distribution is obtained and BPC is realized. However, in (c) general, a true posterior distribution p(ij |O) is difcult to obtain analytically. On the other hand, the numerical approach requires a very long computation time and is impractical for use in speech recognition. Therefore, it is important to nd a way to approximate true posteriors appropriately if we are to realize BPC effectively. The following three types of BPCs were categorized according to the methods for approximating the true posteriors to Dirac posteriors, uniform posteriors and VB posteriors.
5.2. BAYESIAN PREDICTIVE CLASSIFICATION USING VBEC

Dirac posterior based BPC ( BPC)
67
Conventional ML-based classication is interpreted as an extremely simplied BPC that only utilizes the location parameter to represent a posterior. That is, we consider a Dirac posterior, which shrinks the model parameters so that they have deterministic values provided by ML esti(c) (c) (c) (c) (c) mates, ij {aij , wjk , jk , jk |k = 1, ..., L}. Then, the true posterior is approximated as: p(xt |c, i, j, O) = p(xt |ij )(ij ij )dij ,
(c) (c) (c) (c)
(5.1)
where (y z) is a Dirac function dened as g(y)(y z)dy = g(z). Obviously, the Right Hand Side (RHS) in Eq. (5.1) can be reduced to a common likelihood function: RHS in Eq. (5.1) = p(xt |ij ) = aij
(c) (c) k
wjk N xt jk , jk
(c)
(c)
(c)
(5.2)
Therefore, from Eq. (5.2), MLC is represented as a mixture of Gaussian distributions. MAP estimates are available as an alternative to ML estimates. The MAP estimates mitigate the sparse data problem by smoothing the ML estimates with reliable statistics obtained with a (c) sufcient amount of data. In this case, the ML estimates ij in the Dirac function in Eq. (5.2) (c) (c) (c) (c) (c) are replaced with MAP estimates ij {aij , wjk , jk , jk |k = 1, ..., L} as follows: p(xt |c, i, j, O) = p(xt |ij )(ij ij )dij .
(c) (c) (c) (c)
(5.3)
Therefore, the analytical result of the RHS in Eq. (5.3) is shown as follows: RHS in Eq. (5.3) = p(xt |ij ) = aij
(c) (c) k
wjk N xt jk , jk
(c)
(c)
(c)
(5.4)
MAP classication is also represented as a mixture of Gaussian distributions. These classications using ML and MAP estimates are based on point estimation simply obtained from training data via the Dirac posterior, and therefore, they cannot deal with mismatches between training and testing conditions. They are referred to as BPC 1 . Uniform posterior based BPC (UBPC) The mismatch problem caused by the point estimation in speech recognition is rst dealt with by in(c) troducing the distribution of a constrained uniform posterior for p(jk |O), which is also regarded as an approximation of the true posterior within the BPC formulation [22, 58]. Their methods are based on the prior knowledge that the mismatch between the two spectral coefcients is represented by the difference between the two coefcients of d dimension by Cd1 d experimentally,
1
These methods are often referred to as plug-in approaches (ex. [20, 22, 58])
68

(c)
where C and are hyper-parameters. Therefore, they assume p(jk |O) to be the constrained uniform posterior that has a location parameter set by MAP estimates jk 2 and a scaling parameter set by hyper-parameters Cd1 d . The other parameters are distributed as Dirac posteriors as well as BPC. Thus, the true posterior is approximated as follows: p(xt |c, i, j, O) = p(xt |ij )(aij aij )
k (c) (c) (c) (c)
(wjk wjk )(jk jk )

(c) (c)
(c)
(c)
(c)
(c)
(5.5)
U(jk,d |jk,d Cd1 d , jk,d + Cd1 d )dij .

(c)
(c)
Although, in [22], a normal approximation for the integral calculation of the RHS is used, in [58], (c) the integral with respect to ij of the RHS in Eq. (5.5) is analytically solved as follows: RHS in Eq. (5.5) = aij
(c) k
wjk
(c) d
fjk,d xt jk,d , jk,d , C, , d
(c)
(c)
(5.6)
where fjk,d is dened as follows: fjk,d xt jk,d , jk,d , C, d 1 2Cd1 d jk,d (jk,d xt + Cd1 d ) d jk,d (jk,d xt Cd1 d ) d
(c) (c) (c) (c) (c) (c)
(5.7) .
is the cumulative distribution function of the standard Gaussian distribution dened as (x) = 2 x 1 ey dy. Thus, [58] obtained the predictive distribution by considering the marginalization 2 of the Gaussian mean parameter using the uniform posterior, and we refer to this BPC approach as Uniform posterior based BPC (UBPC). VB posterior based BPC (VB-BPC) In VBEC, after the acoustic modeling described in Section 2.3, we obtain the appropriate VB posterior distributions q(|O). Therefore, VBEC can deal with the integrals in Eq. (2.46) by (c) using the estimated VB posterior distributions q(ij |O) as follows: p(xt |c, i, j, O) =
(c)
p(xt |ij )q(ij |O)dij .
(c)
(c)
(c)
(5.8)
The integral over ij can be solved analytically by substituting Eqs. (2.22) and (2.27) into Eq. (5.8). Similar to UBPC, if we only consider the marginalization of the mean parameter, the analyt2
ML estimates can also be used.
5.2. BAYESIAN PREDICTIVE CLASSIFICATION USING VBEC
69
ical result of the RHS in Eq. (5.8) is found to be a mixture of Gaussian distributions, as follows: RHS in Eq. (5.8) = ij
j (c)
jk
k k
(c)
ij
(c)
jk
N (c)
(c) jk ,
(1 + jk )Rjk jk jk
(c) (c)
(c)
(c)
.
(c) (c)
(5.9)
This corresponds to BPC(MAP) with a rescaled variance factored by (1 + jk )/jk , and this is referred to as VB-BPC-MEAN. If we consider the marginalization of all the parameters, the analytical result of the RHS in Eq. (5.8) is found to be a mixture distribution based on the Students t-distribution St(), as follows: RHS in Eq. (5.8) = ij
(c)
jk
k
(c)
(c) j ij
(c) k jk
St
d
xt d
(c) jk,d ,
(1 + jk )Rjk,d
(c) (c) jk jk
(c)
(c)
, jk
(c)
(5.10)
The details of the derivation of Eqs. (5.9) and (5.10) are discussed in A.4, and the property of the Students t-distribution is described in Section 5.2.2. This approach is called Bayesian Predictive Classication using VB posterior distributions (VB-BPC). VB-BPC achieves VBEC with a total Bayesian framework for speech recognition that possesses a consistent concept whereby all procedures (acoustic modeling and speech classication) are carried out based on posterior distributions, as shown in Figure 2.5. VBEC mitigates the sparse data problem by using the full potential of the Bayesian approach that is drawn out by this consistent concept, and VB-BPC contributes greatly as one of the components.
5.2.2 Students t-distribution

One of the specications of VB-BPC is that its classication function is represented by a nonGaussian Students t-distribution. Therefore, VB-BPC is related to the study of non-Gaussian distribution based speech recognition. Here, we discuss the Students t-distribution to clarify the difference between the Gaussian (Gauss(x)), the UBPC distribution (UBPC(x)), the variance-rescaled Gaussian (Gauss2(x)) and the Students t-distributions (St1(x) and St2(x)), as shown in Figure 5.1 (a), where the distribution parameters corresponding to mean and variance are the same. Figure 5.1 (b) employs a logarithmic scale for the vertical axes of the linear scale sketched in Figure 5.1 (a) to emphasize the behavior of the distribution tail. In speech recognition, acoustic scores are calculated based on the logarithmic scale, and therefore, the behavior in Figure 5.1 (b) contributes greatly to the acoustic score and is important to discuss. The Students t-distribution is dened as follows: +1 2 1 St(x|, , ) = CSt 1 + (x )2 , (5.11) where CSt =
2 +1 2
1 2
1 2
(5.12)
70

)x( 2tS )x( 1tS )x( 2ss ua G )x( CPB U )x(ss ua G )x( 2tS )x( 1tS )x( 2ss ua G )x( CPB U )x(ss ua G
(a) Linear scale
(b) Log scale
Figure 5.1: (a) shows the Gaussian (Gauss(x)) derived from BPC, the uniform distribution based predictive distribution (UBPC(x)) derived from UBPC in Eq. (5.6), the variance-rescaled Gaussian (Gauss2(x)) derived from VB-BPC-MEAN in Eq. (5.9), and two Students t-distributions (St1(x) and St2(x)) derived from VB-BPC in Eq. (5.9). (b) employs the logarithmic scale of the vertical axes in (a) to emphasize the behavior of each distribution tail. The parameters corresponding to mean and variance are the same for all distributions. The hyper-parameters of UBPC are set at C = 3.0 and = 0.9. The rescaling parameter of Gauss2(x) () is 1. The degrees of freedom (DoF) of the Students t-distributions ( = ) are 1 for St1(x) and 100 for St2(x). Here and correspond to the mean and variance of the Gaussian, respectively. The Students t-distribution has an additional parameter , which is referred to as a degree of freedom. This parameter represents the width of the distribution tail as shown in Figure 5.1 (b). If is small, the distribution tail becomes wider than the Gaussian, and if is large, it approaches the Gaussian. From Eq. (5.10), = jk , and is approximately proportional to the training data occupation counts jk from Eq. (2.30). jk is obtained for each Gaussian appropriately based on the VB Baum-Welch algorithm. Therefore, with dense training data, = jk becomes large and VB-BPC approaches the Gaussian-based BPC, as shown in St2(x) and Gauss(x) of Figure 5.1 (a) and (b), which is theoretically proved in [20]. On the other hand, when the training data is sparse, = jk becomes small, and the distribution tail becomes wider in St1(x) of Figure 5.1 (b). This behavior is effective in solving the mismatch problem because a wider distribution can cover regions where unseen speech might be distributed. Consequently VB-BPC mitigates the effects of the mismatch problem. This property shows that VB-BPC can automatically change the distribution tail by = jk in the Students t-distribution according to the amount of training data, which is the advantage of VB-BPC over the other BPCs. Although UBPC(x) has a atter peak than Gauss(x) in Figure 5.1 (a), the tail behavior is less exible than that of the Students t-distribution, and tends to be similar to that of Gauss2(x), which corresponds to VB-BPC-MEAN, from Figure 5.1 (b). This similar behavior is probably based on the fact that both UBPC and VB-BPC-MEAN marginalize only the mean parameter of the output
5.3. EXPERIMENTS
distribution.
71
5.2.3 Relationship between Bayesian prediction approaches

Lastly, the relationships between members of the BPC family are summarized as follows: BPC is actually equivalent to classications that utilize point-estimated values of model parameters, and does not have any explicit capability of mitigating the mismatch problem because BPC does not marginalize model parameters. BPC is an extremely simplied case of BPCs, i.e., any other BPC approaches BPC if its scale-parameter posterior value approaches zero. UBPC and VB-BPC-MEAN are similar in the sense that both marginalize only the mean parameters of models. UBPC provides a predictive distribution with a at-peak shape depending on the hyper-parameter setting while VB-BPC-MEAN provides a Gaussian predictive distribution whose variance is rescaled so that it spreads as the training data becomes sparse. The mitigation effect on the mismatch comes from the at-peak shape of the distribution for UBPC, and from the spread variance for VB-BPC-MEAN, which are determined by hyper-parameters and from the training data, respectively. VB-BPC provides a non-Gaussian and wide-tailed predictive distribution. Since the variance parameters of models are also marginalized by VB-BPC, the wide-tailed shape of its predictive distribution, which is analytically derived as the Students t-distribution, is obtained. In VB-BPC, the shape of the distribution is automatically determined from the training data, i.e. the tail becomes wider as the training data becomes sparser unlike with UBPC where the at-peak shape is determined by the hyper-parameter tuning. The relationship between BPCs is summarized in Table 5.1.
5.3 Experiments
Two experiments were conducted to show the effectiveness of VB-BPC. The rst experiment was designed to show the role of VB-BPC in total Bayesian framework VBEC for the sparse data problem. Namely, it is shown how VB-BPC contributes to solving the sparse data problem in association with the other Bayesian advantages provided by VBEC. The second experiment was designed Table 5.1: Relationship between BPCs
BPC UBPC VB-BPC-MEAN VB-BPC Posterior distribution Dirac Constrained uniform Gaussian Normal Gamma Marginalized Marginalized Marginalized Marginalized Predictive distribution Gaussian Error function Gaussian Students t
72

Table 5.2: Experimental conditions for isolated word recognition task
Sampling rate Quantization Feature vector 16 kHz 16 bit 12-order MFCC + 12-order MFCC (24 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (left-to-right HMM) Number of phoneme categories 27 Number of GMM components 16 Training data ASJ: 10,709 utterances, 10.2 hours (44 male) Test data JEIDA: 100 city names, 7,200 words (75 male) ASJ: Acoustical Society of Japan JEIDA: Japan Electronic Industry Development Association
Table 5.3: Prior distribution parameters

Prior parameter 0 jk 0 jk 0 jk 0 jk 0 Rjk Setting value jk 0.01 jk 0.01 mean vector of state j jk 0.01 0 covariance matrix of state j jk
to compare the effectiveness of VB-BPC with the conventional Bayesian approaches. VB-BPC is applied to speaker adaptation within the direct parameter adaptation scheme as a practical example of solving the sparse data problem, and the effectiveness of VB-BPC is examined by comparison with the conventional BPC(MAP) and UBPC approaches [15, 58]. All the experiments were performed using the SOLON speech recognition toolkit [4].
5.3.1 Bayesian predictive classication in total Bayesian framework

Isolated word recognition experiments were conducted and the fully implemented version of VBEC that includes VB-BPC was compared with other partially implemented versions. The experimental conditions are summarized in Table 5.2. The setting of the prior parameters for the VB and MAP training is shown in Table 5.3. The training data consisted of 10,709 Japanese utterances (10.2 hours) spoken by 44 males. The test data consisted of 100 Japanese city names spoken by 75 males (a total of 7,200 words). Several subsets of different sizes were randomly extracted from the training data set, and each of the subsets was used to construct a set of acoustic models. The acoustic models were represented by context-dependent HMMs and each HMM state had a 16component GMM. As a result, 36 sets of acoustic models were prepared for various amounts of
5.3. EXPERIMENTS
73
Figure 5.2: Recognition rate for various amounts of training data. The horizontal axis is scaled logarithmically. training data. Table 5.4 summarizes approaches that use a combination of methods for model selection (context-dependent HMM state clustering), training and classication, for each of which we employ either VB or other approaches. The combination determines how well the method includes the Bayesian advantages, i.e., effective utilization of prior knowledge, appropriate selection of model structure and robust classication of unseen speech. Here BIC/MDL indicates model selection using the minimum description length criterion (or the Bayesian information criterion), which we should recognize as a kind of ML-based approach [45], and BPC(MAP) is regarded as a partial implementation of the Bayesian prediction approach as discussed in Section 5.2. Note that all of the combinations, except for Full-VBEC, include an ML or a merely partial implementation of the Bayesian approach, and that the approaches are listed in the order of how well the Bayesian advantages are included. Figure 5.2 shows recognition results obtained using the combinations. We can see that the more the Bayesian advantages were included, the more robustly speech was recognized. In particular, when the training data were sparse (less than 100 utterances), Full-VBEC signicantly outperformed the other combinations. In addition, when the training data were even sparser (less than 50 utterances), Full-VBEC was better than VBEC-MAP by 2 9 %. Note that the only difference between them was the classication algorithm, i.e., VB-BPC or BPC(MAP). This improvement is clearly due to the effectiveness of VB-BPC, and perhaps due to the synergistic effect that results from exploiting the full potential of the Bayesian approach by incorporating
000,01
LM-LDM LM-CEBV PAM-CEBV CEBV-lluF secnarettu fo rebmuN 000,1 001 01
0 01 W 02o r 03 d r e c 04o g 05 n ti 06o ni r 07 a t e 08 % ( ) 09 001
74

Table 5.4: Conguration of VBEC and ML based approaches
Model selection VB VB VB BIC/MDL Training VB VB ML ML Classication VB-BPC BPC(MAP) BPC(MLC) BPC(MLC)
Full-VBEC VBEC-MAP VBEC-ML BIC/MDL-ML
all its advantages.
5.3.2 Supervised speaker adaptation

The effectiveness of VB-BPC for supervised speaker adaptation was examined as a practical application of the VB-BPC, which can demonstrate its superiority in terms of solving the sparse data problem. The improvements in the accuracy by the adaptation are compared using VB-BPC (Eq. (5.10)), VB-BPC-MEAN (Eq. (5.9)), UBPC (Eq. (5.6)) and BPC(MAP) (Eq. (5.4)), each of which belongs to the direct HMM parameter adaptation scheme. Table 5.5 summarizes the experimental conditions. The initial (prior) acoustic model was constructed by read sentences and this model was adapted using 10 lectures given by 10 males and their labels [59]. In this task, the mismatch between training and adaptation data is caused not only by the speakers, but also by the difference in speaking styles between read speech and lectures. The total training data for initial models consisted of 10,709 Japanese utterances spoken by 44 males. In the initial model training, 1,000 speaker-independent context-dependent HMM states were constructed using a phonetic decision tree method. The output distribution in each state was represented by a 16-component GMM, and the model parameters were trained based on conventional ML estimation. Each lecture was divided in half based on the utterance units, and the rst-half of the lecture was used as adaptation data and the second-half was used as recognition data. The total adaptation data consisted of more than 60 utterances for each male, and 1, 2, 4, 8, 16, 32, 40, 48 and 60 utterances were used as adaptation data. As a result, about 9 sets of adapted acoustic models for several amounts of adaptation data were prepared for each male. The prior parameter settings are shown in Table 5.6, and were used to estimate the MAP parameters in BPC(MAP) and UBPC, and the VB posteriors in VB-BPC-MEAN and VB-BPC. When setting the UBPC hyper-parameters, the hyper-parameters were preliminary optimized by trying eight kinds of combinations of C = 2, 3, 4 and 5 and = 0.7 and 0.9 with reference to the result in [58], and the combination of {C = 3, = 0.9} was adopted that provided the best average word accuracy. Throughout this experiment we used a beam search algorithm with sufcient beam width and a sufcient number of hypotheses to avoid search errors in decoding. The language model weight used in this experiment was optimized by the word accuracy of each result. Figure 5.3 compares the recognition results obtained with VB-BPC, VB-BPC-MEAN, UBPC and BPC(MAP) for several amounts of adaptation data with the baseline performance for the non-adapted speaker independent model (62.9 word accuracy). Table 5.7 shows the result from
5.3. EXPERIMENTS
Table 5.5: Experimental conditions for LVCSR speaker adaptation task
Sampling rate Quantization Feature vector 16 kHz 16 bit 12 order MFCC with energy ++ (39 dimensions) Window Hamming Frame size/shift 25/10 ms Number of HMM states 3 (Left to right) Number of phoneme categories 43 Number of GMM components 16 Initial training data ASJ: 10,709 utterances, 10.2 hours (44 males) Adaptation data CSJ: 1st-half lectures (10 males) CSJ: 2nd-half lectures (10 males) Test data Language model Standard trigram (made by CSJ transcription) 30, 000 Vocabulary size Perplexity 82.2 OOV rate 2.1 % CSJ: Corpus of Spontaneous Japanese
75
Table 5.6: Prior distribution parameters

Prior parameter 0 jk 0 jk 0 jk 0 jk 0 Rjk Setting value 10 10 SI mean vector of Gaussian k in state j 10 0 SI covariance matrix of Gaussian k in state j jk
SI: Speaker Independent
Figure 5.3 in detail by including the word accuracy scores and the time of the adaptation data for each speaker. First, we focus on the effectiveness of the marginalization of the model parameters in BPCs for the sparse data problem, as discussed in Section 5.2.3. Namely, the results of VB-BPC, VB-BPC-MEAN and UBPC were compared with that of BPC(MAP), which does not marginalize model parameters at all. Figure 5.3 shows that for a small amount of adaptation data (less than 8 adaptation utterances), VB-BPC, VB-BPC-MEAN and UBPC were better than BPC(MAP), which conrms the effectiveness of the marginalization of the model parameters. A more detailed examination of the results in this region showed that VB-BPC was better than UBPC by 0.7 1.5 points, and that VB-BPC-MEAN and UBPC behaved similarly. This suggests the effectiveness of the wide tail property of the Students t-distribution discussed in Section 5.2.3, which is obtained by the marginalization of the variance parameters in addition to the mean parameters. In particular, the results of the one utterance adaptation in Table 5.7, where VB-BPC scored the best for 9 of 10 speakers, supports the above suggestion of the effectiveness of the VB-BPC marginalization when
76
Figure 5.3: Word accuracy for various amounts of adaptation data. The horizontal axis is scaled logarithmically. the mismatch would be very large due to extreme data sparseness. Second, for any given amount of adaptation data, VB-BPC and VB-BPC-MEAN achieved comparable or better performance than UBPC, which required hyper-parameter (C and ) optimization. Therefore, we can say that VB-BPC and VB-BPC-MEAN could determine the shapes of their distributions automatically and appropriately from the adaptation data without tuning the hyper-parameters, as discussed in Section 5.2.3. Finally, VB-BPC was the best for almost all amounts of adaptation data. VB-BPC approached the BPC(MAP) performance asymptotically, and provided the highest word accuracy score of 72.9 for this task (the benchmark score obtained by the speaker independent acoustic model trained by using lectures is about 72.0 word accuracy in [59]). This conrms the steady improvement of the performance using VB-BPC. Thus, the effectiveness of VB-BPC based on the Students t-distribution for the sparse data problem has been shown in this as well as the previous experiments.
5.3.3 Computational efciency

This subsection comments on the computation time needed for the speech classication process. Full-VBEC took six times as long as the other approaches (VBEC-MAP, VBEC-MLC and MDLML), which was mainly because more computation time was needed for the acoustic score calculation for the Students t-distribution in VB-BPC than with the Gaussian in BPC. The current
001
secnarettu fo rebmuN 01
enilesaB
)PAM(CPB CPBU NAEM-CPB-BV CPB-BV

1
0.26 0.36 0.46 0.56 W 0.66o r 0.76d a c 0.86c u r 0.96a c 0.07y 0.17 0.27 0.37
5.4. SUMMARY
77
implementation of the Students t-distribution computation requires additional logarithmic computation time for each feature dimension compared with the Gaussian computation. In addition, the speed of the Gaussian computation was increased in a number of ways in our decoder (ex. utilizing cash memories), and the speed of the Students t-distribution computation must be increased similarly to reduce the difference in computation time.
5.4 Summary
This chapter introduced a method of Bayesian Predictive Classication using Variational Bayesian posteriors (VB-BPC) in speech recognition. An analytical solution is obtained based on the Students t-distribution. The effectiveness of this approach is conrmed by the recognition rate improvement obtained when VB-BPC is used in a total Bayesian framework, experimentally. In speaker adaptation experiments in the direct HMM parameter adaptation scheme, VB-BPC is more effective than the conventional maximum a posteriori and uniform distribution based Bayesian prediction approaches. Thus, we show the effectiveness of VB-BPC based on the Students tdistribution for the sparse data problem. This approach is related to the study of non-Gaussian distribution based speech recognition [60, 61] since it successfully applied the Students t-distribution to large vocabulary continuous speech recognition. By considering the Bayesian prediction for the various parametric models in speech recognition (ex. the conventional Bayesian prediction approach is applied to the transformation based parametric adaptation approach [62]), the next step is to study the application of the other non-Gaussian distributions to speech recognition.
78
Table 5.7: Experimental results for model adaptation experiments for each speaker based on VBBPC, VB-BPC-MEAN, UBPC and BPC(MAP). The best scores among the four methods are highlighted with a bold font
Speaker ID time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) time(s) VB-BPC VB-BPC-MEAN UBPC BPC(MAP) 1 1.7 84.2 83.4 83.0 82.5 5.3 75.9 75.6 73.0 73.8 4.3 68.6 68.5 68.9 68.3 1.5 50.2 48.8 48.9 47.1 4.4 76.1 75.1 73.6 73.7 2.2 48.8 45.6 45.0 44.5 6.5 71.7 70.9 70.5 70.0 2.1 68.0 66.0 66.4 65.4 2.5 63.7 63.7 63.2 62.7 2.2 63.5 62.6 63.5 61.4 3.3 65.7 64.5 64.2 63.4 2 25.8 84.7 85.0 84.6 85.0 7.1 75.6 76.9 75.0 74.8 7.0 69.4 68.6 69.0 68.6 8.8 52.1 50.8 51.0 49.7 9.3 76.1 75.3 74.1 74.0 7.3 48.3 46.4 46.4 45.5 7.5 71.7 71.2 71.2 70.1 6.1 68.7 66.8 66.8 66.6 15.7 64.1 65.1 65.7 64.9 12.9 66.6 66.1 67.3 65.1 10.8 66.4 65.7 65.7 65.0 Amount of adaptation data (utterance) 4 8 16 32 40 37.6 58.9 107.4 198.1 240.2 85.6 85.6 87.2 87.7 88.8 86.0 85.5 86.7 87.5 89.3 85.4 85.5 86.7 87.9 88.2 85.8 85.8 86.4 87.7 89.2 18.2 32.1 64.2 152.7 184.6 75.9 78.6 84.2 85.7 86.8 76.9 78.6 83.7 85.7 87.0 76.3 78.7 82.4 84.5 86.5 76.3 78.3 83.4 84.8 86.8 18.4 33.4 64.3 132.7 180.2 69.9 69.7 71.8 73.6 74.5 69.6 69.5 72.4 74.1 74.0 69.7 69.0 71.3 73.3 73.2 68.5 69.0 72.4 74.2 74.0 13.3 24.1 48.0 95.1 128.3 52.4 53.1 55.0 56.4 55.7 50.9 51.9 53.2 55.5 55.1 51.9 52.3 52.2 54.9 54.5 49.9 51.1 52.4 55.3 54.5 16.1 35.1 64.1 116.4 147.4 76.5 77.2 79.3 79.5 80.5 76.5 78.4 79.8 79.2 80.2 75.2 76.4 79.4 78.8 79.5 76.0 77.7 79.7 79.0 79.6 13.9 25.1 64.9 109.7 134.5 48.9 50.4 51.2 54.6 55.9 46.9 48.7 50.9 54.7 55.4 48.0 48.0 50.7 54.3 54.6 47.3 47.9 51.2 54.4 55.2 13.3 34.7 69.4 159.5 196.0 72.2 73.5 73.3 75.3 75.6 71.2 73.2 75.0 76.5 77.3 71.6 72.6 73.8 76.4 77.3 70.2 72.8 73.7 76.3 76.6 8.0 38.3 64.4 137.1 180.6 68.9 70.8 70.4 70.4 71.5 67.1 68.2 68.7 70.1 70.6 66.8 67.7 67.3 70.3 70.0 66.8 67.5 67.7 69.8 69.4 26.3 48.4 89.3 170.1 218.1 65.2 66.2 68.3 69.7 71.2 65.3 67.4 69.3 70.0 71.8 65.5 66.5 68.8 70.7 70.5 65.1 66.8 69.0 70.4 70.8 23.9 46.7 89.7 163.7 201.1 66.5 68.4 70.3 71.2 71.5 66.7 67.4 69.7 71.4 71.6 66.5 68.0 69.5 71.0 71.0 65.1 66.7 68.9 70.9 70.5 18.9 37.7 72.6 143.5 181.1 66.9 68.0 69.5 70.8 71.6 66.2 67.4 69.3 70.8 71.6 66.2 67.0 68.6 70.6 70.8 65.7 66.9 68.8 70.6 70.9 48 296.7 88.3 89.2 89.5 89.2 228.2 88.6 88.0 87.3 88.1 209.2 74.2 75.5 74.5 75.0 161.1 56.6 57.0 56.9 56.9 175.8 80.3 80.8 79.8 80.3 166.8 56.1 55.5 55.0 55.5 246.2 76.0 76.8 76.5 77.3 214.0 72.5 71.3 71.0 71.0 261.1 71.8 71.3 71.3 71.7 251.6 72.4 71.4 71.9 70.5 221.0 72.0 72.0 71.7 71.8 60 355.0 89.2 89.7 89.9 89.7 281.4 89.8 90.0 89.8 89.6 248.3 74.3 75.7 75.6 75.6 193.1 57.1 57.5 56.6 57.4 224.9 81.1 81.5 80.5 81.3 215.9 56.7 55.6 55.1 55.3 297.0 77.9 77.6 76.8 77.5 265.5 72.4 72.4 71.9 71.1 329.8 73.7 74.1 74.1 73.2 312.4 74.1 73.6 73.1 71.9 272.3 72.9 73.0 72.5 72.5
A01M0097 Baseline accuracy = 79.0
Average Baseline accuracy = 62.9
Chapter 6 Conclusions
6.1 Review of work
The aim of this thesis was to overcome a lack of robustness in the current speech recognition systems based on the Maximum Likelihood (ML) approach by introducing a Bayesian approach, both theoretically and practically. The thesis has achieved the following objectives by applying the Variational Bayesian (VB) approach to speech recognition: The formulation of Variational Bayesian Estimation and Clustering for speech recognition (VBEC), as a total framework for speech recognition, which covers both acoustic model construction and speech classication by consistently using the VB posteriors (Chapter 2). Bayesian acoustic model construction by consistently using VB based Bayesian formulations such as the VB Baum-Welch algorithm and VB model selection within the VBEC framework (Chapter 3). The automatic determination of acoustic model topologies by expanding the above Bayesian acoustic model construction (Chapter 4). Bayesian speech classication based on Bayesian predictive classication by using the Students t-distribution within the VBEC framework (Chapter 5). This thesis conrms the three Bayesian advantages (prior utilization, model selection and robust classication) over ML through the use of speech recognition experiments. Thus, VBEC totally mitigates the effect of over-training in speech recognition. In addition, VBEC automatic determination enables us to dispense with manual tuning procedures when constructing acoustic models. Thus, this thesis achieves Bayesian speech recognition through the realization of the total Bayesian speech recognition framework VBEC.
6.2
Related work
VB is a key technique in this thesis. Table 6.1 summarizes the technical trend in VB-applied speech information processing. Note that although the rst applications of VB to speech recognition were 79
80
CHAPTER 6. CONCLUSIONS
limited to the topics of feature extraction and acoustic models, recent applications have covered spoken language modeling. Therefore, VB has been widely applied to speech recognition and other forms of speech processing. Given such a trend, this work plays an important role in pioneering the main formulation and implementation of VB based speech recognition, which is a core technology in this eld. As regards Bayesian speech recognition, there have been many studies that did not employ VB based approaches. Although this thesis mainly compares maximum a posteriori, Bayesian information criterion, and Bayesian prediction approaches, a serious discussion should also be done for another major realization of Bayesian speech recognition, quasi-Bayes approaches [16, 17, 63].
6.3 Future work

Each summary section in the previous chapters suggests future works related to the technique described in that chapter. This section provides the future global directions triggered by this thesis. A major study must be undertaken to expand VBEC expansions to deal with recent topics in speech recognition such as discriminative training, the full covariance model, and feature extraction. Future work will also concentrate on advanced topics in relation to acoustic model adaptation techniques such on-line adaptation [71] and structural Bayes [72] by utilizing all the Bayesian advantages based on VBEC. To realize these approaches, the total Bayesian framework must be expanded to deal with the sequential updating of prior distributions and the model structure according to the time evolution. In addition, prior distribution setting must be carefully considered. Finally, new modeling that extends beyond the standard acoustic model (the hidden Markov model, Gaussian mixture model, and, current phoneme unit) must be studied. Recent progress on speech recognition is mainly due to advanced training methods, which are typied by discriminative and Bayesian analysis beyond ML. Although many attempts at new modeling have been unsuccessful (e.g. segment models [73]), these training methods can provide a breakthrough with respect to the new modeling problem.
Table 6.1: Technical trend of speech recognition using variational Bayes

Topic Feature extraction Clustering context-dependent HMM states Formulation of Bayesian speech recognition Selection of number of GMM components Acoustic model adaptation Determination of acoustic model topology Gaussian reduction Bayesian prediction Language modeling References [64, 65] [29, 30, 49] [27, 28] [5254] [34, 55, 66, 67] [31, 32] [68] [33, 34] [69, 70] Date 2002 2002 2002 2002 2003 2003 2004 2005 2003 -
6.4. SUMMARY
81
6.4 Summary
This thesis dealt with Bayesian speech recognition, and realized a total Bayesian speech recognition framework VBEC, both theoretically and practically. This thesis represents pioneering work with respect to the main formulation and implementation of VB based speech recognition, which is a core technology in the Bayesian speech recognition eld. The VBEC framework will be improved to deal with model adaptation techniques and new modeling in speech recognition and other forms of speech information processing. I shall be content if this thesis contributes to advances in worldwide studies of speech information processing through the additional progress of this Bayesian speech recognition.
ACKNOWLEDGMENTS
It gives me great pleasure to receive my doctorate from Waseda University, from whom I received my masters degree ve years ago. I would like to thank Professor Tetsunori Kobayashi who was my main supervisor, and Professors Yasuo Matsuyama, Toshiyasu Matsushima, and Yoshinori Sagisaka who acted as vice supervisors, for giving me this opportunity, as well as for their generous teaching during my PhD coursework. I particularly want to thank Professor Kobayashi for noticing my work in its early stages, and for offering me much advice, both general and detailed, as I pursued my research and constructed this thesis. I started my research career during my time in the Department of Physics at Waseda University. From the 4th year of my bachelors degree to the 2nd year of my masters degree I studied at the Ohba-Nakazato Laboratory where I established my research style of seeking out theories and uniqueness. I want to thank Professor Ichiro Ohba, Professor Hiromichi Nakazato, Dr. Hiroki Nakamura, and the other senior researchers for their valuable advice and direction, which stays with me to this day. I will never forget their seminars on scattering and neutrino theory. I have also been stimulated by many colleagues especially Dr. Tsuyoshi Otobe and Dr. Gen Kimura (currently at Tohoku University), even after my graduation. I hope we can continue this relationship. This thesis was conducted while belonging to NTT Communication Laboratories, NTT Corporation. My continuous 5-year research on speech information processing has been supported and allowed to continue by Dr. Shigeru Katagiri, Dr. Shoji Makino, and Dr. Masato Miyoshi as executive managers and group leaders thanks to their understanding of my work. In this period, my research has developed greatly through various research discussions and communications within the Speech Open Laboratory and the Signal Processing Research Group. Members of our speech recognition team, Dr. Erik McDermott, Dr. Mike Schuster, Dr. Daniel Willet (currently at TEMIC SDS), Dr. Takaaki Hori, Kentaro Ishizuka, and Takanobu Oba have always provided me with valuable technical knowledge with regard to speech recognition. In addition to these colleagues, members of the technical support staff, and in particular Ken Shimomura, have engaged in the development of the speech recognition research platform SOLON, which is a basic tool of my research. I am extremely grateful to all of them. Other members of my laboratory, such as Dr. Tomohiro Nakatani, Dr. Chiori Hori (currently at Carnegie Mellon University), Dr. Parham Zolfaghari (currently at BNP Paribas), Dr. Masashi Inoue (currently at the National Institute of Informatics), and Keisuke Kinoshita, have given me great pleasure through valuable discussions on speech processing and statistical learning theory. Each person has provided a different viewpoint on speech recognition, and has also encouraged me along the way. I am also grateful for the support and discussion given to my work by Atsushi Sako of Kobe University, Toshiaki Kubo of 83
84
ACKNOWLEDGMENTS
Waseda University, Wesley Arnquist of the University of Washington, and Hirokazu Kameoka of the University of Tokyo through their internship programs at NTT, and by David Meacock who has rened the English of most of my work, including this thesis. Dr. Satoshi Takahashi, Yoshikazu Yamaguchi (currently at NTT IT Corporation), and Atsunori Ogawa of NTT Cyber Space Laboratories have all given me valuable advice as researchers with experience in real applications of speech recognition. The part of my work dealing with automatic determination was emboldened by Yoshikazu Yamaguchi as he always pointed out the importance of this topic and its need on the development side, and encouraged me to work on this. To continue such work, I would like to maintain this good relationship between the research and development arms of NTT. I started the work described in this thesis with Dr. Naonori Ueda, a specialist in statistical learning theory, Dr. Atsushi Nakamura and Dr. Yasuhiro Minami, specialists in speech recognition, and myself, who, at the time, knew nothing of statistical learning theory or speech recognition. The hard work during that rst research period and their strict teaching has a treasured place in my memory. Their teaching has covered many aspects such as research and social postures, technical and business writing, how to conduct research, as well as research activities. They also showed me their own kindness and considerations, which was especially encouraging when I was still young in my research life. I truly feel it would have been hard to nd such a fortunate and advantageous learning environment with such knowledgeable supervisors. Finally I would like to thank all of my friends, Professor Yoshiji Horikoshi at Waseda University and his family as they treat me like a family member, and my family for their wonderful support throughout these years.
ACKNOWLEDGMENTS IN JAPANESE
. . . 4 2 3 () NTT 5 . ( TEMIC SDS ) . SOLON ( ) ( BNP ) ( NTT ()) NTT 4
85
86
ACKNOWLEDGMENTS IN JAPANESE
4 . . 3
Bibliography
[1] K. F. Lee, H. W. Hon, and R. Reddy. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38:3545, 1990. [2] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young. Large vocabulary continuous speech recognition using HTK. In Proc. ICASSP1994, volume 2, pages 125128, 1994. [3] T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, K. Itou, A. Ito, M. Yamamoto, A. Yamada, T. Utsuro, and K. Shikano. Japanese dictation toolkit - 1997 version -. Journal of the Acoustical Society of Japan (E), 20:233239, 1999. [4] T. Hori. NTT Speech recognizer with OutLook On the Next generation: SOLON. In Proc. NTT Workshop on Communication Scene Analysis, volume 1, SP-6, 2004. [5] T. Hori, C. Hori, and Y. Minami. Fast on-the-y composition for weighted nite-state transducers in 1.8 million-word vocabulary continuous speech recognition. In Proc. ICSLP2004, volume 1, pages 289292, 2004. [6] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statististical Society B, 39:138, 1976. [7] F. Jelinek. Continuous speech recognition by statistical methods. In Proc. IEEE, volume 64(4), pages 532556, 1976. [8] X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition. Edinburgh University Press, 1990. [9] X. D. Huang, A. Acero, and H. W. Hon. Spoken language processing, a guide to theory, algorithm, and system development. Prentice Hall PTR, 2001. [10] P. Brown. The acoustic-modeling problem in automatic speech recognition. PhD thesis, Carnegie Mellon University, 1987. [11] S. Katagiri, B-H. Juang, and C-H. Lee. Discriminative learning for minimum error classication. IEEE Transactions on Signal Processing, 40:30433054, 1992. [12] E. McDermott. Discriminative training for speech recognition. PhD thesis, Waseda University, 1997. 87
88
BIBLIOGRAPHY
[13] D. Povey. Discriminative training for large vocabulary speech recognition. PhD thesis, Cambridge University, 2003. [14] C. H. Lee, C. H. Lin, and B-H. Juang. A study on speaker adaptation of the parameters of continuous density hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 39:806814, 1991. [15] J-L. Gauvain and C-H. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2:291298, 1994. [16] Q. Huo, C. Chan, and C.-H. Lee. On-line adaptation of the SCHMM parameters based on the segmental quasi-Bayes learning for speech recognition. IEEE Transactions on Speech and Audio Processing, 4:141144, 1996. [17] J. T. Chien. Quasi-Bayes linear regression for sequential learning of hidden Markov models. IEEE Transactions on Speech and Audio Processing, 10:268278, 2002. [18] S. Furui. Recent advances in spontaneous speech recognition and understanding. In Proc. SSPR2003, pages 16, 2003. [19] J. O. Berger. Statistical Decision Theory and Bayesian Analysis, Second Edition. SpringerVerlag, 1985. [20] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley & Sons Ltd, 1994. [21] K. Shinoda and T. Watanabe. MDL-based context-dependent subword modeling for speech recognition. Journal of the Acoustical Society of Japan (E), 21:7986, 2000. [22] Q. Huo and C-H. Lee. A Bayesian predictive classication approach to robust speech recognition. IEEE Transactions on Speech and Audio Processing, 8:200204, 2000. [23] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183233, 1997. [24] S. Waterhouse, D. MacKay, and T. Robinson. Bayesian methods for mixtures of experts. NIPS 7, MIT Press, 1995. [25] H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. In Proc. Uncertainty in Articial Intelligence (UAI) 15, 1999. [26] N. Ueda and Z. Ghahramani. Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks, 15:12231241, 2002. [27] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Application of variational Bayesian approach to speech recognition. NIPS 2002, MIT Press, 2002.
BIBLIOGRAPHY
89
[28] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Variational Bayesian estimation and clustering for speech recognition. IEEE Transactions on Speech and Audio Processing, 12:365381, 2004. [29] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Constructing shared-state hidden Markov models based on a Bayesian approach. In Proc. ICSLP2002, volume 4, pages 2669 2672, 2002. [30] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Selection of shared-states hidden Markov model structure using Bayesian criterion. IEICE Transactions on Information and Systems, J86-D-II:776786, 2003. (in Japanese). [31] S. Watanabe, A. Sako, and A. Nakamura. Automatic determination of acoustic model topology using variational Bayesian estimation and clustering. In Proc. ICASSP2004, volume 1, pages 813816, 2004. [32] S. Watanabe, A. Sako, and A. Nakamura. Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14, 2006. (in press). [33] S. Watanabe and A. Nakamura. Effects of Bayesian predictive classication using variational Bayesian posteriors for sparse training data in speech recognition. In Proc. Interspeech 2005 - Eurospeech, pages 11051108, 2005. [34] S. Watanabe and A. Nakamura. Speech recognition based on Students t-distribution derived from total Bayesian framework. IEICE Transactions on Information and Systems, E89D:970980, 2006. [35] S. Kullback and R. A. Leibler. On information and sufciency. Annals of Mathematical Statistics, 22:7986, 1951. [36] H. Akaike. Likelihood and the Bayes procedure. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, editors, Bayesian Statistics, pages 143166. University Press, Valencia, Spain, 1980. [37] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientic Computing, Second Edition. Cambridge University Press, 1993. [38] S. Sagayama. Phoneme environment clustering for speech recognition. ICASSP1989, volume 1, pages 397400, 1989. In Proc.
[39] J. Takami and S. Sagayama. A successive state splitting algorithm for efcient allophone modeling. In Proc. ICASSP1992, volume 1, pages 573576, 1992.
90
BIBLIOGRAPHY
[40] M. Ostendorf and H. Singer. HMM topology design using maximum likelihood successive state splitting. Computer Speech and Language, 11:1741, 1997. [41] J. J. Odell. The use of context in large vocabulary speech recognition. PhD thesis, Cambridge University, 1995. [42] H. Akaike. A new look at the statistical model identication. IEEE Transactions on Automatic Control, 19:716723, 1974. [43] J. Rissanen. Universal coding, information, prediction and estimation. IEEE Transactions on Information Theory, 30:629636, 1984. [44] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461464, 1978. [45] K. Shinoda and T. Watanabe. Acoustic modeling based on the MDL criterion for speech recognition. In Proc. Eurospeech1997, volume 1, pages 99102, 1997. [46] W. Chou and W. Reichl. Decision tree state tying based on penalized Bayesian information criterion. In Proc. ICASSP1999, volume 1, pages 345348, 1999. [47] S. Chen and R. Gopinath. Model selection in acoustic modeling. In Proc. Eurospeech1999, volume 3, pages 10871090, 1999. [48] K. Shinoda and K. Iso. Efcient reduction of Gaussian components using MDL criterion for HMM-based speech recognition. In Proc. ICASSP2001, volume 1, pages 869872, 2001. [49] T. Jitsuhiro and S. Nakamura. Automatic generation of non-uniform HMM structures based on variational Bayesian approach. In Proc. ICASSP2004, volume 1, pages 805808, 2004. [50] H. Attias. A variational Bayesian framework for graphical models. NIPS 2000, MIT Press, 2000. [51] P. Somervuo. Speech modeling using variational Bayesian mixture of Gaussians. In Proc. ICSLP2002, volume 2, pages 12451248, 2002. [52] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Application of variational Bayesian approach to speech recognition. In Proc. Fall Meeting of ASJ 2002, volume 1, pages 127128, 2002. (in Japanese). [53] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Bayesian acoustic modeling for spontaneous speech recognition. In Proc. SSPR2003, pages 4750, 2003. [54] F. Valente and C. Wellekens. Variational Bayesian GMM for speech recognition. In Proc. Eurospeech2003, pages 441444, 2003. [55] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda. Application of variational Bayesian estimation and clustering to acoustic model adaptation. In Proc. ICASSP2003, volume 1, pages 568571, 2003.
BIBLIOGRAPHY
91
[56] T. Kato, S. Kuroiwa, T. Shimizu, and N. Higuchi. Efcient mixture Gaussian synthesis for decision tree based state tying. In Proc. ICASSP2001, volume 1, pages 493496, 2001. [57] S. Watanabe and A. Nakamura. Robustness of acoustic model topology determined by variational Bayesian estimation and clustering for speech recognition for different speech data sets. In Proc. IEICE International Workshop of Beyond HMM, SP2004-90, pages 5560, 2004. [58] H. Jiang, K. Hirose, and Q. Huo. Robust speech recognition based on a Bayesian prediction approach. IEEE Transactions on Speech and Audio Processing, 7:426440, 1999. [59] T. Kawahara, H. Nanjo, T. Shinozaki, and S. Furui. Benchmark test for speech recognition using the Corpus of Spontaneous Japanese. In Proc. SSPR2003, pages 135138, 2003. [60] A. Nakamura. Acoustic modeling for speech recognition based on a generalized Laplacian mixture distribution. IEICE Transactions on Information and Systems, J83-D-II:21182127, 2000. (in Japanese). [61] S. Basu, C. A. Micchelli, and P. Olsen. Power exponential densities for the training and classication of acoustic feature vectors in speech recognition. Journal of Computational & Graphical Statistics, 10:158184, 2001. [62] A. C. Surendran and C-H. Lee. Transformation-based Bayesian prediction for adaptation of HMMs. Speech Communication, 34:159174, 2001. [63] U. E. Makov and A. F. M. Smith. A quasi-Bayes unsupervised learning procedure for priors. IEEE Transactions on Information Theory, 23:761764, 1977. [64] O. Kwon, T.-W. Lee, and K. Chan. Application of variational Bayesian PCA for speech feature extraction. In Proc. ICASSP2002, volume 1, pages 825828, 2002. [65] F. Valente and C. Wellekens. Variational Bayesian feature selection for Gaussian mixture models. In Proc. ICASSP2004, volume 1, pages 513516, 2004. [66] S. Watanabe and A. Nakamura. Acoustic model adaptation based on coarse-ne training of transfer vectors and its application to speaker adaptation task. In Proc. ICSLP2004, volume 4, pages 29332936, 2004. [67] K. Yu and M. J. F. Gales. Bayesian adaptation and adaptively trained systems. In Proc. Automatic Speech Recognition and Understanding Workshop (ASRU) 2005, pages 209214, 2005. [68] A. Ogawa, Y. Yamaguchi, and S. Takahashi. Reduction of mixture components using new Gaussian distance measure. In Proc. Fall Meeting of ASJ 2004, volume 1, pages 8182, 2004. (in Japanese).
92
BIBLIOGRAPHY
[69] T. Mishina and M. Yamamoto. Context adaptation using variational Bayesian learning for ngram models based on probabilistic LSA. IEICE Transactions on Information and Systems, J87-D-II:14091417, 2004. (in Japanese). [70] Y.-C. Tam and T. Schultz. Dynamic language model adaptation using variational Bayes inference. In Proc. Interspeech 2005 - Eurospeech, pages 58, 2005. [71] Q. Huo and C-H. Lee. On-line adaptive learning of the correlated continuous density hidden Markov models for speech recognition. IEEE Transactions on Speech and Audio Processing, 6:386397, 1998. [72] K. Shinoda and C-H. Lee. A structural Bayes approach to speaker adaptation. IEEE Transactions on Speech and Audio Processing, 9:276287, 2001. [73] M. Ostendorf, V. Digalakis, and O.A. Kimball. From HMMs to segment models. IEEE Transactions on Speech and Audio Processing, 4:360378, 1996.
LIST OF WORK
Journal papers
[J1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Selection of shared-state hidden Markov model structure using Bayesian criterion, (in Japanese), IEICE Transactions on Information and Systems, vol. J86-D-II, no. 6, pp. 776786, (2003) (received the best paper award from IEICE). [J2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Variational Bayesian estimation and clustering for speech recognition, IEEE Transactions on Speech and Audio Processing, vol. 12, pp. 365381, (2004). [J3] S. Watanabe, A. Sako and A. Nakamura, Automatic determination of acoustic model topology using variational Bayesian estimation and clustering for large vocabulary continuous speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, (in press). [J4] S. Watanabe and A. Nakamura, Speech recognition based on Students t-distribution derived from total Bayesian framework, IEICE Transactions on Information and Systems, vol. E89-D, pp. 970980, (2006).
Letters
[L1] S. Watanabe and A. Nakamura, Acoustic model adaptation based on coarse/ne training of transfer vectors, (in Japanese), Information Technology Letters.
International conferences
[IC1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Constructing shared-state hidden Markov models based on a Bayesian approach, In Proc. ICSLP02, vol. 4, pp. 26692672, (2002). [IC2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Application of variational Bayesian approach to speech recognition, In Proc. NIPS15 MIT Press, (2002). 93
94
LIST OF WORK
[IC3] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Application of variational Bayesian estimation and clustering to acoustic model adaptation, In Proc. ICASSP03, vol. 1, pp. 568571, (2003). [IC4] S. Watanabe, A. Sako and A. Nakamura, Automatic determination of acoustic model topology using variational Bayesian estimation and clustering, In Proc. ICASSP04, vol. 1, pp. 813816, (2004). [IC5] P. Zolfaghari, S. Watanabe, A. Nakamura and S. Katagiri, Bayesian modelling of the speech spectrum using mixture of Gaussians, In Proc. ICASSP04, vol. 1, pp. 553556, (2004). [IC6] S. Watanabe and A. Nakamura, Acoustic model adaptation based on coarse-ne training of transfer vectors and its application to speaker adaptation task, In Proc. ICSLP04, vol. 4, pp. 29332936, (2004). [IC7] S. Watanabe and A. Nakamura, Effects of Bayesian predictive classication using variational Bayesian posteriors for sparse training data in speech recognition, In Proc. Interspeech 2005 Eurospeech, pp. 11051109, (2005).
International workshops
[IW1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Bayesian acoustic modeling for Spontaneous speech recognition, In Proc. SSPR03, pp. 4750, (2003). [IW2] P. Zolfaghari, H. Kato, S. Watanabe and S. Katagiri, Speech spectral modelling using mixture of Gaussians, In Proc. Special Workshop In Maui (SWIM) , (2004) [IW3] S. Watanabe and A. Nakamura, Robustness of acoustic model topology determined by VBEC (Variational Bayesian Estimation and Clustering for speech recognition) for different speech data sets, In Proc. Workshop on Statistical Modeling Approach for Speech Recognition - Beyond HMM, pp. 5560, (2004).
Domestic conferences (in Japanese)

[DC1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Application of variational Bayesian method to speech recognition, In Proc. Fall Meeting of ASJ 2002, 1-9-23, pp. 4546, (2002.9) (recieved the Awaya prise from the ASJ). [DC2] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Application of variational Bayesian estimation and clustering to acoustic model adaptation, In Proc. Spring Meeting of ASJ 2003, 3-3-12, pp. 127128, (2003.3). [DC3] S. Watanabe, T. Hori, N-gram language modeling using Bayesian approach, In Proc. Fall Meeting of ASJ 2003, 2-6-10, pp. 79-80, (2003.9).
LIST OF WORK
95
[DC4] P. Zolfaghari, S. Watanabe and S. Katagiri, Bayesian modeling of the spectrum using Gaussian mixtures, In Proc. Fall Meeting of ASJ 2003, 2-Q-10, pp. 331332, (2003.9). [DC5] S. Watanabe, A. Sako, A. Nakamura, Automatic determination of acoustic model topology using variational Bayesian estimation and clustering, In Proc. Spring Meeting of ASJ 2004, 1-8-6, pp. 1112, (2004.3). [DC6] S. Watanabe, T. Hori, E. McDermott, Y. Minami, A. Nakamura, An evaluation of speech recognition systemSOLON using corpus of spontaneous Japanese, In Proc. Spring Meeting of ASJ 2004, 2-8-7, pp. 73-74, (2004.3). [DC7] S. Watanabe, A. Nakamura, A supervised acoustic model adaptation based on coarse/ne training of transfer vectors, In Proc. Fall Meeting of ASJ 2004, 2-4-11, pp. 107108, (2004.9). [DC8] S. Watanabe, T. Hori, A perplexity for spoken language processing using joint probabilities of HMM states and words, In Proc. Spring Meeting of ASJ 2005, 1-5-23, pp. 45-46, (2005.3)
Domestic workshops (in Japanese)

[DW1] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Training of shared states in hidden Markov model based on Bayesian approach, In Technical Report of IEICE, SP2002-14, pp. 4348, (2002). [DW2] S. Watanabe, [Invited talk] VBEC: robust speech recognition based on a Bayesian approach, In Proc. 5th Young Researcher Meeting of ASJ Kansai section, I-2, (2003) [DW3] T. Hori, S. Watanabe, E. McDermott, Y. Minami, A. Nakamura, Evaluation of the speech recognizer SOLON using the corpus of spontaneous Japanese, In Proc. Workshop of Spontaneous Speech Science and Engineering, pp. 8592, (2004). [DW4] S. Watanabe, [Tutorial talk] Speech recognition based on a Bayesian approach, In Technical Report of IEICE, SP2004-74, pp. 1320, (2004). [DW5] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, [Talk celebrating the best paper award] Selection of shared-state hidden Markov model structure using Bayesian criterion, In Technical Report of IEICE, SP2004-149, pp. 2530 (2005). [DW6] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Speech recognition using variational Bayes, In Proc. 8th workshop of Information-Based Induction Sciences (IBIS2005), pp. 269274, (2005).
96
LIST OF WORK
Others
[O1] The Awaya prize from the ASJ in 2003 [O2] The best paper award from the IEICE in 2004 [O3] S. Watanabe, Y. Minami, A. Nakamura, and N. Ueda, Selection of shared-state hidden Markov model structure using Bayesian criterion, (English translation paper of IEICE Transactions on Information and Systems, vol. J86-D-II, no. 6, pp. 776786 [J1]) IEICE Transactions on Information and Systems, vol. E88-D, no. 1, pp. 19, (2005).
APPENDICES
A.1 Upper bound of Kullback-Leibler divergence for posterior distributions
This section derives the upper bound of the KL divergence between a true posterior distribution and an arbitrary distribution. Jensens inequality is important as regards deriving the upper bound. With a continuous function, if f is a concave function and p is a distribution function ( p(x)dx = 1), then f g(x) p(x) f (g(x)) p(x) (A.1) Here f (x) = log x and g(x) = h(x)/p(x). Then, log h(x) p(x) = log
p(x)
h(x)dx
log
h(x) p(x)
.
p(x)
(A.2)
Similarly, with a discrete function, if f is a concave function and p is a distribution function ( l pl = 1), then hl hl log hl log = log . (A.3) pl pl pl pl l These inequalities are used to derive the upper bounds of the KL divergences. In this section we simplify the arbitrary posterior distributions q(|O, m), q(Z|O, m), p(|O, m), and p(Z|O, m) as q(), q(Z), p() and p(Z) to avoid complicated equation forms.
A.1.1
Model parameter
We rst consider the KL divergence between an arbitrary posterior distribution for model parameters q((c) ) and the true posterior distribution for model parameters p((c) ): q((c) ) (c) d . p((c) ) Substituting Eq. (2.4) into Eq. (A.4), the KL divergence is rewritten as follows: KL[q((c) )|p((c) )] q((c) ) log KL[q((c) )|p((c) )] = q((c) ) log
Z
(A.4)
q((c) )
p(O,Z|,m)p(|m) d(c) p(O|m) Z
d(c)
(A.5)
= log p(O|m)
q((c) ) log
p(O, Z|, m)p(|m)d(c) (c) d q((c) )
97
98
APENDICES
Then applying the continuous Jensens inequality Eq. (A.2) to Eq. (A.5), the following inequality is obtained: KL[q((c) )|p((c) )] log p(O|m)
Z
q((c) )q((c) )q(Z) log q()q(Z) log

Z
p(O, Z|, m)p(|m) (c) (c) d d q((c) )q((c) )q(Z) (A.6)
= log p(O|m) = log p(O|m) log
p(O, Z|, m)p(|m) d q()q(Z)p(O|m) .
p(O, Z|, m)p(|m) q()q(Z)
From the 3rd to the 4th line, we use the denition d(c) d(c) d and the relation q((c) )q((c) ) = q(), which is derived from Eq. (2.7). Using the F m denition in Eq. (2.10), the upper bound of the KL divergence is represented based on F m as follows: KL[q((c) )|p((c) )] log p(O|m) F m [q(), q(Z)], This inequality corresponds to Eq. (2.9) if the omitted notations are recovered. (A.7)
A.1.2 Latent variable

Similar to Section A.1.1, we consider the KL divergence between an arbitrary posterior distribution for model parameters q(Z (c) ) and the true posterior distribution for model parameters p(Z (c) ): KL[q(Z (c) )|p(Z (c) )]
Z (c)
q(Z (c) ) log
q(Z (c) ) . p(Z (c) )
(A.8)
Substituting Eq. (2.5) into Eq. (A.8), the KL divergence is rewritten as follows: KL[q(Z (c) )|p(Z (c) )] = log p(O|m)
Z (c)
q(Z (c) ) log
Z (c)
p(O, Z|, m)p(|m)d (A.9) q(Z (c) )
Then by applying Jensens inequality Eq. (A.3) to Eq. (A.9), the following inequality is obtained: KL[q(Z (c) )|p(Z (c) )] log p(O|m) log p(O, Z|, m)p(|m) q()q(Z) . (A.10)
To derive Eq. (A.10), we use the denition Z (c) Z (c) Z and the relation q(Z (c) )q(Z (c) ) = q(Z), which is derived from Eq. (2.7). Using the F m denition in Eq. (2.10), the upper bound of the KL divergence is represented based on F m as follows: KL[q(Z (c) )|p(Z (c) )] log p(O|m) F m [q(), q(Z)], (A.11)
This inequality corresponds to Eq. (2.14) if the omitted notations are recovered. From Eqs. (A.7) and (A.11), we nd that the KL divergences of the model parameters and latent variables have the same upper bound log p(O|m) F m [q(), q(Z)]. This guarantees that the arbitrary posterior distributions for model parameters and latent variables ( q((c) ) and q(Z (c) ) ) have the same objective functional F m [q(), q(Z)].
A.2 Variational calculation for VB posterior distributions
99
A.1.3 Model structure

Similar to Section A.1.1 and A.1.2, we consider the KL divergence between an arbitrary posterior distribution for model structure q(m|O) and the true posterior distribution for model structure p(m|O): q(m|O) KL[q(m|O)|p(m|O)] q(m|O) log . (A.12) p(m|O) m Substituting Eq. (2.6) into Eq. (A.12), the KL divergence is rewritten as follows: KL[q(m|O)|p(m|O)] =
m
q(m|O) log
Z
q(m|O)
p(O,Z|,m)p(|m)p(m) d p(O)
= log p(O|m) +
m
q(m|O) log
q(m|O) log p(m)
p(O, Z|, m)p(|m)d .

Z
(A.13) Then by applying Jensens inequality Eq. (A.2) to Eq. (A.13), the following inequality is obtained: KL[q(m|O)|p(m|O)] log p(O|m) +
m
q(m|O) log
q(m|O) p(m)
q()q(Z) log
Z
p(O, Z|, m)p(|m) d . q()q(Z) (A.14)
Using the F m denition in Eq. (2.10), the upper bound of the KL divergence is represented based on F m as follows: KL [q(m|O)|p(m|O)] log p(O) + log q(m|O) F m [q(), q(Z)] p(m) .
q(m|O)
(A.15)
This inequality corresponds to Eq. (2.18) if the omitted notations are recovered.
A.2
Variational calculation for VB posterior distributions
Functional differentiation is a technique for obtaining an extremal function based on a variational calculation, and is dened as follows: Continuous function case H[g(x) + (x y)] H[g(x)] H[g(x)] = lim 0 g(y) Discrete function case (A.16)
H[gn + nl ] H[gn ] H[gn ] = lim (A.17) 0 gl In this section we simplify the arbitrary posterior distributions q(|O, m) and q(Z|O, m) as q() and q(Z), and objective function F m [q] as F m to avoid complicated equation forms, and omit category index c.
100
APENDICES
A.2.1 Model parameter

If we consider q()d = 1 constraint, the functional differentiation in Eq. (2.12) is represented by substituting respectively F m and q() into H and g(y) in Eq. (A.16) as follows: q( ) = lim
0
F m [q(), q(Z)] + K 1 log +K
q()d 1 Fm
q()+ ( ),q(Z)
p(O, Z|, m)p(|m) (q() + ( )) q(Z)
(A.18) ,
(q() + ( )) d 1 K
q()d 1
where K is a Lagranges undetermined multiplier. We focus on the rst term in the brackets () in the 2nd line of Eq. (A.18). By expanding the expectation, the rst term is represented as follows: log = = p(O, Z|, m)p(|m) (q() + ( )) q(Z)
q()+ ( ),q(Z)
(q() + ( )) log (q() + ( )) log
p(O, Z|, m)p(|m) (q() + ( )) q(Z)
d
q(Z)
p(O, Z|, m)p(|m) ( ) log 1 + q()q(Z) q()
d.
q(Z)
(A.19) By expanding the logarithmic term in Eq. (A.19) with respect to , Eq. (A.19) is represented as: Equation (A.19) = (q() + ( )) log log p(O, Z|, m)p(|m) q()q(Z) + o( 2 )
q(Z) q(Z)
q(Z)
( ) q()
d + o( 2 )
= Fm + = Fm +
p(O, Z|, m)p(|m) q()q(Z)

q(Z)
1 + log p(O, Z|, m)
log q(Z)
+ log p(|m) log q() + o( 2 ) (A.20)
where o( 2 ) denotes a set of terms of more than the 2nd power of . Therefore, by substituting Eq. (A.20) into Eq. (A.18), Eq. (A.18) is represented as: Equation (A.18) 1 = lim 1 + log p(O, Z|, m)
0
q(Z)
log q(Z)
2
q(Z)
(A.21) + log p(|m) log q() + K + o( )

q(Z)
= 1 + log p(O, Z|, m)
q(Z)
log q(Z)
+ log p(|m) log q() + K.
A.2 Variational calculation for VB posterior distributions
101
Therefore, the optimal posterior (VB posterior) q() satises the relation whereby Eq. (A.21) = 0 from Eq. (2.12), and is obtained as: log q() = 1 + log p(O, Z|, m)
q(Z)
log q(Z)
q(Z)
+ log p(|m) + K.
(A.22)
By disregarding the normalization constant, the optimal VB posterior is nally derived as: q() p(|m) exp log p(O, Z|, m)
q(Z)
(A.23)
which corresponds to Eq. (2.13) if the omitted notations are recovered.
A.2.2 Latent variable

Similar to Section A.2.1, if we consider Z q(Z) = 1 constraint, the functional differentiation is represented by substituting respectively F m and q(Z) into H and gn in Eq. (A.17) as follows: q(Z ) = lim
0
F m [q(), q(Z)] + K
Z
q(Z) 1 Fm
q(Z)+ ZZ ,q()
log +K
p(O, Z|, m)p(|m) (q(Z) + ZZ ) q()
(A.24) ,
(q(Z) + ZZ ) 1
Z
K
Z
q(Z) 1
where K is a Lagranges undetermined multiplier. We focus on the rst term in the brackets () in the 2nd line of Eq. (A.24). By expanding the expectation, the rst term is represented as follows: log =
Z
p(O, Z|, m)p(|m) (q(Z) + ZZ ) q() (q(Z) + ZZ
q(Z)+ ZZ ,q()
ZZ p(O, Z|, m)p(|m) log 1 + ) log q()q(Z) q(Z)
(A.25) .
q()
By expanding the logarithmic term in Eq. (A.25) with respect to , Eq. (A.25) is represented as: Equation (A.25) = Fm + 1 + log p(O, Z|, m)
q()
+ log
p(|m) q()
log q(Z)
q()
+ o( 2 ) (A.26)
Therefore, by substituting Eq. (A.26) into Eq. (A.24), Eq. (A.24) is represented as: Equation (A.24) = 1 + log p(O, Z|, m)
q()
+ log
p(|m) q()
log q(Z) + K.
q(Z)
(A.27)
102
APENDICES
Therefore, the optimal posterior (VB posterior) q(Z) satises the relation whereby Eq. (A.27) = 0 from Eq. (2.12), and is obtained as: log q(Z) = 1 + log p(O, Z|, m)
q()
+ log
p(|m) q()
+ K.
q()
(A.28)
By disregarding the normalization constant, the optimal VB posterior is nally derived as: q(Z) exp log p(O, Z|, m)
q()
(A.29)
which corresponds to Eq. (2.15) if the omitted notations are recovered.
A.2.3 Model structure

Similar to Sections A.2.1 and A.2.2, if we consider m q(m|O) = 1 constraint, the functional differentiation is represented by substituting respectively F m and q(m|O) into H and gn in Eq. (A.17) as follows: q(m |O) = lim
0
log log +K
q(m|O) Fm p(m)
+K
q(m|O) m
q(m|O) 1 log q(m|O) Fm p(m)
q(m|O) + mm Fm p(m)
q(m|O)+ mm
q(m|O)
q(m|O) + mm 1
m
K
m
q(m|O) 1 (A.30)
where K is a Lagranges undetermined multiplier. We focus on the st term in the brackets () in the 2nd line of Eq. (A.30). By expanding the expectation, the rst term is represented as follows: log =
m
q(m|O) + mm Fm p(m)
q(m|O)+ mm
mm q(m|O) + log 1 + (q(m|O) + mm ) log p(m) q(m|O)
(A.31) F
m
By expanding the logarithmic term in Eq. (A.31) with respect to , Eq. (A.31) is represented as: Equation (A.31) q(m|O) Fm = log p(m)
+
q(m|O)
log
q(m |O) F m + 1 + o( 2 ) p(m )
(A.32)
Therefore, by substituting Eq. (A.32) into Eq. (A.30), Eq. (A.30) is represented as: Equation (A.30) = log q(m |O) F m + 1 + K. p(m ) (A.33)
A.3 VB posterior calculation
103
Therefore, the optimal posterior (VB posterior) q(m|O) satises the relation whereby Eq. (A.33) = 0, and is obtained as: q(m|O) log F m + 1 + K = 0. (A.34) p(m) By disregarding the normalization constant, the optimal VB posterior is nally derived as: q(m|O) p(m) exp (F m ) , which corresponds to Eq. (2.19) if the omitted notations are recovered. (A.35)
A.3
A.3.1
VB posterior calculation
Model parameter
From the output distribution and prior distribution in Section 2.3.1, we can obtain the optimal VB posterior distribution for model parameters q(|O, m). We rst derive the concrete form of the expectation of the logarithmic output distribution log p(O, S, V |, m) with respect to VB posterior for latent variables log p(O, S, V |, m) as follows: log p(O, S, V |, m) =
i,j
q (S,V |O,m) e
ij log aij +
j,k
jk log wjk +
e,t
t e,jk log bjk (Ot ) e
(A.36) This is obtained by substituting Eq. (2.22) into Eq. (A.36) and regrouping the terms in the summations according to state transitions and observed symbols. By using Eq. (A.36), q(|O, m) is represented as follows: q(|O, m) p(|m) exp = p(|m) exp
i,j
log p(O, S, V |, m) ij log aij +

j,k
q (S,V |O,m) e
(A.37)
t e,jk log bjk (Ot ) e e,t
jk log wjk +
where t e,ij ij t e,jk jk q(st1 = i, st = j|O, m) e e Te E t t=1 e,ij e=1 . t q(st = j, ve = k|O, m) e Te E t t=1 e,jk e=1
(A.38)
t Here, e,ij is a VB transition posterior distribution, which denotes the transition probability from t state i to state j at a frame t of an example e, and e,jk is a VB occupation posterior distribution, which denotes the occupation probability of mixture component k in state j at a frame t of an example e, with the VB approach.
104
APENDICES
From Eqs. (2.24) and (A.37), the factors of the optimal VB posterior distributions, each of which depends on aij , wjk and bjk (O t ), is decomposed as follows: e e q({aij }J |O, m) p({aij }J |m) j (aij )ij j=1 j=1 e q({wjk }L |O, m) p({wjk }L |m) k (wjk )jk . k=1 k=1 et q(b |O, m) p(bjk |m) e,t (bjk (O t ))e,jk jk e Therefore we can derive the concrete form of each factor. State transition probability a By focusing on a term that depends on a probabilistic variable aij , the concrete form of q({aij }J |O, m) j=1 can be calculated from Eqs. (2.25) and (A.39) as follows: q({aij }J |O, m) j=1
j
(A.39)
(aij )ij 1 ,
(A.40)
where ij 0 + ij . Therefore, by considering the normalization constant, q({aij }J |O, m) is ij j=1 obtained as follows: q({aij }J |O, m) = CD ({ij }J ) j=1 j=1
j
(aij )ij 1 (A.41)
= where CD ({ij }J ) j=1
D({aij }J |{ij }J ), j=1 j=1
J j=1 J j=1
ij )
(ij )
(A.42)
Weight factor w Similarly, the concrete form of q({wjk }L |O, m) is obtained from Eqs. (2.25) and (A.39) as k=1 follows: q({wjk }L |O, m) = CD ({jk }L ) k=1 k=1 = where jk 0 + jk and jk CD ({jk }L ) k=1 (
L k=1 L k=1 e (wjk )jk 1
k D({wjk }L |{jk }L ), k=1 k=1
(A.43)
jk )
(jk )
(A.44)
A.3 VB posterior calculation

Gaussian parameters and
105
Finally, the concrete form of q(bjk |O, m) can be derived from Eqs. (2.25) and (A.39) as follows: Since the calculation is more complicated than the two previous calculations, the indexes j and k are removed to simplify the derivation. q(b|O, m)
d
d 2 exp
1 1 (d d )2 d 2
(1 ) 2 1 exp d
Rd 2d
(A.45)
0 t 0 where 0 + , 0 0 + e,t e O t /, 0 + , Rd Rd + 0 (d d )2 + e 2 t t e,t e (Oe,d d ) . Consequently, by considering normalization constants, the concrete form of q(b|O, m) is obtained as follows:
q(b|O, m) = CN ()
d 1
exp
1 1 (d d )2 d 2
CG (, Rd )(1 ) 2 1 exp d
Rd 2d
= N (|, )
d
G(1 |, Rd ), d (A.46)
where
CN ()
(/2) 2 Rd /2
D e 2
CG (, Rd )
/(/2)
(A.47)
Thus, the VB posterior distribution for model parameters are analytically obtained as (2.27), (2.28) and (2.30) by summarizing the calculation results.
A.3.2
Latent variable
From the output distribution and prior distribution in Section 2.3.1, the optimal VB posterior distribution for latent variables q(S, V |O, m) is represented by substituting Eqs. (2.22) and (2.27) into Eq. (2.16) as follows: q(S, V |O, m)
e,t
exp exp
log ast1 st e e
t log wst ve e
q ({aij }J =1 |O,m) e j q ({wjk }L =1 |O,m) e k
(A.48) exp
t log bst ve (Ot ) q(bjk |O,m) e e e
e We calculate each factor in this equation, changing the sufx se to i or j, and the sufx vt to k to t simplify the derivation.
Weight factor a First, the integral over aij is solved from Eq. (A.41) by using a partial integral technique and a normalization constant, and aij , which denotes the state transition probability from state i to state
106 j in the VB approach, is dened as follows: log aij

q ({aij }J =1 |O,m) e j
APENDICES
= CD ({ij }J =1 ) j = CD ({ij }J =1 ) j = (ij )
log aij
j
(aij )ij 1 daij 1 log aij , (A.49)
ij CD ({ij }J =1 ) j
j
ij
where (y) is a digamma function dened as (y) /y log (y). Weight factor w In a way similar to that used for aij , the integral over wjk is solved from Eq. (A.43), and wjk , which denotes the k-th weight factor of the Gaussian mixture for state j in the VB approach, is dened as follows: log wjk
q ({wjk }J e k
=1
|O,m)
= (jk )
jk
log wjk .
(A.50)
Gaussian parameters and Finally, the integrals over bjk (= {jk , 1 }) are solved from Eqs. (2.23) and (A.46), and bjk (O t ) e jk is dened. Since the calculation is more complicated than the two previous calculations, the indexes j and k are removed to simplify the derivation. log b(O t ) e =
d q (b|O,m) e
N (d |d , 1 d )G(1 |, Rd ) d 1 2
t log 2 log(1 ) + 1 (Oe,d d )2 d d d
dd d(1 ) d
= =
1 2 D 2
G(1 |, Rd ) d
d d
log 2 + 2 1 2
t log(1 ) + 1 (Oe,d d )2 d(1 ) d d d
log 2 +
log
d
Rd t + (Oe,d d )2 2 Rd (A.51)
log b(O t ). e
A.4 Students t-distribution using VB posteriors

In this section, we explain the derivation of the Students t-distribution using the VB posteriors in Eq. (5.10). The indexes of phoneme category c and frame t are removed to simplify the derivation.
A.4 Students t-distribution using VB posteriors
107
Substituting the VB posteriors of Eq. (2.27) into Eq. (5.10), we can obtain the following equation: p(x|ij )q(ij |O)dij = aij
k
wjk N (x|jk , jk )D({aij }j |{ij }j )D({wjk }k |{jk }k ) N (jk | jk , (jk )1 jk )

d
(A.52)
G(1 |jk , Rjk,d )daij dwjk djk d(1 ). jk,d jk
The concrete forms of the posterior distribution functions are shown in Eq. (2.28). Therefore, we substitute the factors in Eq. (2.28) that depend on the integral variables into Eq. (A.52). Integrating with respect to aij and wjk and grouping the arguments of the exponential functions, the equation is represented as follows: p(x|ij )q(ij |O)dij aij
j
(aij )ij 1 daij

k
wjk
k
(wjk )jk 1 dwjk
1 1 (jk,d ) 2 exp (1 )(xd jk,d )2 2 jk,d
jk exp (1 )(jk,d jk,d )2 jk,d 2 jk k jk

jk 2
1 jk,d
jk 2
exp (1 ) jk,d
Rjk,d 2
djk d(1 ) jk
ij j ij
1 jk,d
d
1 jk,d exp (xd jk,d )2 + jk (jk,d jk,d )2 + Rjk,d 2
djk d(1 ). jk (A.53)
We focus on the integral with respect to jk,d and sjk,d 1 The indexes of state i, j and jk,d mixture component k are removed to simplify the derivation. In addition, we adopt a diagonal covariance matrix, which does not consider the correlation between dimensions, and the integration can be performed for each feature dimension independently. Therefore we also remove the index of dimension d. First, we focus on the integration with respect to , and completing the square with respect to . Then, by integrating with respect to , and arranging the equation, the following equation is obtained: s 2 exp =
2
s (x )2 + ( )2 + r 2
d
2
s x + s exp (1 + ) 2 1+
1 2
(x + )2 + x2 + 2 + r 1+
d (A.54)
(x + )2 s + x2 + 2 + r 1+ +1 (x )2 + (1 + )r = s 2 1 exp s . 2(1 + ) s exp 2
108
APENDICES
Here we discuss the case when the VB posterior for variance is the Dirac delta function in the VB-BPC-MEAN discussion in Section 5.2.1, and the argument of its Dirac delta function is the maximum value of the VB posterior. Then, the result of the integration with respect to s is obtained by changing s to r1 in Eq. (A.54). Therefore, by substituting it into Eq. (A.53) and adding the removed indexes i, j, k, c, t and d, the following equation is obtained: ij
j (c)
jk
k k
(c)
ij
(c)
jk
(c)
(c) jk ,
(1 + jk )Rjk jk jk
(c) (c)
(c)
(c)
(A.55)
This is the analytical result of the predictive distribution for VB-BPC-MEAN based on the mixture of Gaussian distributions. In Eq. (A.54), by integrating with respect to s, and arranging the equation, the following equation is obtained: s
+1 1 2
exp s
(x )2 + (1 + )r 2(1 + )
+1 2
ds (A.56)
(x )2 + (1 + )r 2(1 + ) 1+ (x )2 (1 + )r
+1 2
Here we refer to the concrete form of the Students t-distribution given in Eq. (5.11). The parameters , and of the Students t-distribution correspond to those of the above equation as follows: = = . (A.57) (1+)r = Thus, the result of integrating with respect to and s is represented as the Students t-distribution. St x , (1 + )r , . (A.58)
Finally, by substituting Eq. (A.58) into Eq. (A.53) and adding the removed indexes i, j, k, c, t and d, we can obtain the analytical result of the predictive distribution using the VB posteriors as follows: p(xt |ij )q(ij |O)dij = ij
j (c) (c) (c) (c) (c) (c)
jk
k k
(c)
ij
(c)
jk
(c) d
St xt jk,d , d
(c)
(1 + jk )Rjk,d jk jk
(c) (c)
(A.59) , jk
(c)
Thus, the predictive distribution for VB-BPC is obtained analytically based on the mixture of the Students t-distributions.

Shinji Watanabe PHD Thesis

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Shinji Watanabe PHD Thesis

Загружено:

Авторское право:

Доступные форматы

Speech Recognition Based on a Bayesian Approach

Graduate School of Science and Engineering Waseda University

1. HMM GMM () 2. Baum-Welch VB Baum-Welch VB () 3. VB () 4. VB Student t ()

vii 79 79 79 80 81 83 85 87 93 97 97 97 98 99 99 100 101 102 103 103 105 106

Speech recognition notations

Notations of prior and posterior parameters and VB values

Notations of phonetic decision tree clustering

Special function notations

Denitions of probabilistic distribution functions and normalization constants

2.4 2.5 3.1 3.2

4.5 4.6 4.7 4.8

2.1 2.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7

4.1 4.2 4.3 4.4 4.5

5.1 5.2 5.3 5.4 5.5 5.6

Figure 1.1: Automatic speech recognition.

noitacifissalc hceeps naiiseyaB ..5rettpahC hceeps na seyaB 5re pahC

Figure 1.2: Chapter ow of thesis.

2.1 Maximum likelihood and Bayesian approach

2.1. MAXIMUM LIKELIHOOD AND BAYESIAN APPROACH

p(x|(c) , m)p((c) |O, m)d(c) .

p(O, Z|, m)p(|m) (c) d p(O|m)

and p(Z (c) |O, m) =

p(O, Z|, m)p(|m) d. p(O|m)

p(O, Z|, m)p(|m)p(m) d, p(O)

2.2 Variational Bayesian (VB) approach

q((c) |O (c) , m)q(Z (c) |O (c) , m).

2.2.1 VB-EM algorithm

gniretsulc txetnoc citenohP

2.2. VARIATIONAL BAYESIAN (VB) APPROACH

q((c) |O (c) ,m),q(Z|O (c) ,m)

F m,(c) [q((c) |O (c) , m), q(Z (c) |O (c) , m)]. (2.11)

2.2.2 VB posterior distribution for model structure

We can set an informative prior distribution for p(m).

2.3.1 Output and prior distributions

etats rof erutxim naissuaG

p({aij }J =1 |m)p({wjk }L =1 |m)p(jk , jk |m) j k D({aij }J =1 |{0 }J =1 )D({wjk }L =1 |{0 }L =1 ) j ij j k jk k

16 The explicit forms of the distributions are dened as follows:

2.3.2 VB Baum-Welch algorithm

G(1 |jk , Rjk,d ). jk,d

g nitt e s l aiti nI l e d o m cit s u o c A

log p(O, st1 = i, st = j|, m) e e

t1 exp log e,i aij k

t wjk bjk (O t ) e,j e q (|O,m) e

t+1 wik bik (O t+1 ) e,i e

t=T t=0 e,j and e,j e are initialized appropriately.

VB occupation probability (VB E-step)

t log p(O, st = j, ve = k|, m) e

t wjk bjk (O t )e,j e q (|O,m) e

2.3.3 VBEC objective function

q (|O,m) e q (S,V |O,m) e

log q(S, V |O, m)

where m F log p(O,S,V |,m)p(|m) q (|O,m) e

jk (jk ) ( jk 2 jk ) + log Rjk

2.3.4 VB posterior based Bayesian predictive classication

argmax p(c)p(x|c, O). =

p(xt |c, i, j, , m)p(|c, i, j, O, m)p(m|O)d (2.45) p(x

p(xt |ij , m)q(ij |O, m)dij .

ledom lacixeL ledom egaugnaL

atad gn n arT atad gniiniiarT

n oit u bi rt si d r oi r p + n oit u bi rt si d t u pt u o : g nitt e s l e d o M

no desab noitceles ledoM

Chapter 3 Bayesian acoustic model construction