Академический Документы
Профессиональный Документы
Культура Документы
% error
Perplexity
Parameters 998 60
Y0 Baseline 155,717 36.1 12.8
CI-DECIPHER 125,769 44.7 14.0
CI-MLP-DECIPHER 155,717 30.1 7.8
8.4 Summary
In this chapter, we have extended and tested our basic hybrid HMM/MLP ap-
proach in two ways:
1. Using continuous acoustic vectors and dynamic features (Section 8.2).
As for standard HMMs, this significantly improves performance. More-
9
Recent improvements to Y0 also improved the 12.8% to 7.1%, despite a reliance on single
pronunciations and a single density per phone. This recent work was done as part of a collabo-
ration between the authors and Cambridge University researchers Tony Robinson, Steve Renals,
and Mike Hochberg. The biggest source of improvement was getting new people to find our old
bugs.
8.4. SUMMARY 165
over, the resulting HMM/MLP performance has been better than what
was achieved with the equivalent HMM system.
Since the major weakness of the HMM/MLP approach seems to be the lim-
itation to context-independent phoneme models, it is shown in the next chapter
how to circumvent this problem and to use an MLP to estimate probabilities of
thousands of context-dependent models.
166 CHAPTER 8. EXPERIMENTAL SYSTEMS
Chapter 9
CONTEXT-DEPENDENT
MLPs
It’s a poor sort of memory that only works back-
wards.
– Lewis Carroll –
9.1 Introduction
Chapters 7 and 8 have shown the ability of Multilayer Perceptrons
(MLPs) to estimate emission probabilities for Hidden Markov Models (HMM).
In these chapters, we have shown that these estimates led to improved per-
formance over standard estimation techniques when a fairly simple HMM
was used. However, current state-of-the-art continuous speech recognizers re-
quire HMMs with greater complexity, e.g., multiple densities per phone and/or
context-dependent phone models. Will the consistent improvement we have
seen in these tests be washed out in systems with more detailed models?
One difficulty with more complex models is that many more parameters
must be estimated with the same limited amount of data. Brute-force appli-
cation of our earlier techniques would result in an output layer with many
thousands of units, and a network with many millions of connections. This
network would be impractical to train, both in terms of computation and learn-
ability, using current-sized public databases. In each of our earlier studies, a
simple context-independent trained network used a single output unit for each
phone. As described in the previous chapter, for our recent Resource Manage-
ment tests, we used 69 of these units. Were one to consider the coarticulatory
effects from the right only, this number would expand out to ����� , or over 4000.
Considering both right and left context, we would require ����� units, or about
328,000. With a typical hidden layer of 500 units, we would have over �����
connections, which is far too many for a practical system.
167
168 CHAPTER 9. CONTEXT-DEPENDENT MLPS
0 ����� � ������� � ������� can be estimated by the MLP represented in Figure 9.1
�
�
(referred to as MLP1) in which the output units are associated with the
left phonemes of the triphones and in which the input field is consti-
tuted by the current acoustic vector ��� (possibly extended to its left and
right contexts), the current state and the right phonetic contexts in the
triphones (which are known during training).
Output Layer
Hidden Layer
qk crl Xn+c
n-c
Figure 9.1: MLP estimator for left phonetic context, given input, current state,
and right phonetic context.
0 ����� � � ������� � � can be estimated by the MLP represented in Figure 9.2 (re-
�
ferred to as MLP2) in which the output units are associated with the
right phonemes and in which the input field is constituted by the current
acoustic vector ��� (possibly extended to its left and right contexts) and
the current state (associated with ��� ).
0 ��������� � � � is estimated by the same MLP as the one used for modeling
context-independent phonemes (as done in the previous chapters and re-
ferred to as MLP3) where the input field contains the current acoustic
vector only (possibly extended to its left and right contexts) and the out-
put units are associated with the current labels.
0 ����� � ������� � � can be estimated by an MLP in which the output units are
�
�
associated with the left phonemes of the triphones and where the input
field represents the current state and the right phonemes. This provides
9.3. THEORETICAL ISSUE 171
p(cr | qk , Xn+c
n-c )
l
Output Layer
Hidden Layer
qk Xn+c
n-c
Figure 9.2: MLP estimator for right phonetic context, given input, and current
state.
The neural networks represented in Figures 9.1 and 9.2 are nothing else but
standard MLPs with a simple restriction on the hidden layer topology. The
reason for this restriction will be explained later on in Section 9.4.
By training different neural networks and using their output activation val-
ues in (9.5), it is possible to estimate, without any particular assumptions, the
probability ��������� ������� � ��� � � that is required for modeling triphone probabilities
�
172 CHAPTER 9. CONTEXT-DEPENDENT MLPS
output layer was replicated for each of 8 possible broad phonetic contexts
(8 for the left and 8 for the right, each associated with one state of a 3-state
model). These experiments showed significant improvements over the simpler
context-independent approach, eliminating roughly one-fourth of the errors in
a speaker-independent Resource Management test set. Thus, statistical factor-
ization of MLP probabilistic estimators appears to have practical significance.
Ongoing research is exploring the use of different architectures and more de-
tailed phonetic models.
9.7 Summary
The early form of our hybrid HMM/MLP approach focused on HMMs that are
independent of phonetic context, and for these models the MLP approaches
have consistently provided significant improvements (once we learned how to
use them). In this chapter, we extended this approach to context-dependent
models, which has been shown in standard HMMs to be an important feature
for robust recognizers. However, a brute force application of our initial ap-
proach would result in an MLP with a huge number of output units and, con-
sequently, an excessive number of parameters to estimate in comparison with
the (limited) amount of available training data. To overcome this problem, a
general approach to split any big nets used for pattern classification (“1-from-
K” classification) into smaller nets has been presented. However, while this
9.7. SUMMARY 177