Вы находитесь на странице: 1из 17

846 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO.

5, MAY 2016

Turbo Automatic Speech Recognition


Simon Receveur, Robin Wei, and Tim Fingscheidt, Senior Member, IEEE

AbstractPerformance of automatic speech recognition (ASR)


systems can significantly be improved by integrating further
sources of information such as additional modalities, or acous-
tic channels, or acoustic models. Given the arising problem of
information fusion, striking parallels to problems in digital com-
munications are exhibited, where the discovery of the turbo codes
by Berrou et al. was a groundbreaking innovation. In this paper,
we show ways how to successfully apply the turbo principle to
the domain of ASR and thereby provide solutions to the above- Fig. 1. A four-state convolutional encoder with states s = (s(1) s(2) ), a ran-
mentioned information fusion problem. The contribution of our dom input bit sequence x governing the state transitions via the generator
work is fourfold: First, we review the turbo decoding forward- polynomials Gr and G1 , systematic output bits x(s) , parity output bits y(s) ,
backward algorithm (FBA), giving detailed insights into turbo (s) .
and multiplexed output bits y
ASR, and providing a new interpretation and formulation of the
so-called extrinsic information being passed between the recogniz-
ers. Second, we present a real-time capable turbo-decoding Viterbi
algorithm suitable for practical information fusion and recogni-
tion tasks. Then we present simulation results for a multimodal
example of information fusion. Finally, we prove the suitability
of both our turbo FBA and turbo Viterbi algorithm also for a
single-channel multimodel recognition task obtained by using two
acoustic feature extraction methods. On a small vocabulary task
(challenging, since spelling is included), our proposed turbo ASR Fig. 2. A four-state hidden Markov model consisting of two random processes:
approach outperforms even the best reference system on average the first governs the temporal sequence of states via state transition matrix A;
over all SNR conditions and investigated noise types by a relative the second random process provides the observation o(i) given state s = i,
word error rate (WER) reduction of 22.4% (audio-visual task) and with i S = {1, 2, 3, 4}.
18.2% (audio-only task), respectively.
Index TermsSpeech recognition, iterative decoding, hidden bound could be achieved. Nowadays, turbo codes are a fun-
Markov models, robustness, multimedia systems. damental component of many digital transmission systems in
practical use [3]. In the following we will compare some basic
I. I NTRODUCTION
aspects of digital communications and ASR.

W HY SHOULD an expert in automatic speech recogni-


tion (ASR) read a publication with strong ties to digital
communications? Simply because in digital communications a A. Analogies: Digital Communications vs. ASR
revolution in robustness has been seen in the past 20 years, Upon closer analysis, digital communications and automatic
while accuracy of ASR systems still lags far behind human speech recognition reveal many parallels. Given the fundamen-
performance [1]. Indeed, we can learn something from the tal communication diagram by Shannon consisting of a source
Communications Society: In 1993, Berrou et al. introduced the and a transmitter, a (noisy) channel, and a receiver [4], each of
so-called turbo codes [2] for forward-error correction (FEC) the components has its analogy both in digital communications
providing unprecedented robustness to digital communications and ASR. These analogies are particularly striking for so-called
over error-prone transmission channels. By means of a simple convolutional codes [5], [6], which are an important class of
parallel concatenated iterative decoding scheme a performance FEC codes, and which are typically decoded by the Viterbi
very close to the theoretical limits as given by the Shannon algorithm. As shown in Fig. 1, such a convolutional encoder
Manuscript received June 02, 2015; revised December 14, 2015; accepted is built of shift registers marked by D and adders in the
January 12, 2016. Date of publication February 12, 2016; date of current ver- Galois field GF(2), mapping binary inputs to a binary output
sion March 23, 2016. The material in Sections III-A and V-B-a) was presented
in part on the 5th International Workshop on Spoken Dialog Systems, Napa,
(0 0 = 0, 0 1 = 1, 1 0 = 1, 1 1 = 0). There are in

CA, USA, January 2014 [48] and at the 39th International Conference on total T  input bits x = xT1 to be transmitted, while the output
(s)
Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014 (s) = (
y y (s) )T1 is a multiplex between the so-called system-
[49], solely addressing the (turbo) FBA in the context of audio-visual ASR. The 

associate editor coordinating the review of this manuscript and approving it for
atic bits x = x and the parity bits y(s) = (y (s) )T1 . The input
(s)

publication was Dr. Dong Yu. connections of the adders are typically described by so-called
The authors are with the Institute for Communications Technology, generator polynominals G (Gr = recursive polynominal (111),
Technische Universitt Braunschweig, Braunschweig D-38106, Germany G1 = (101)). In the example of Fig. 1 the number of output
(e-mail: s.receveur@tu-bs.de; robin.weiss@tu-bs.de; t.fingscheidt@tu-bs.de). 
Color versions of one or more of the figures in this paper are available online bits1 is T (s) = rT(s) with the so-called code rate r(s) = 12 .
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASLP.2016.2520364 1 without regarding a so-called trellis termination

2329-9290 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 847

Fig. 3. A parallel turbo encoder with a random input bit sequence x govern-
ing state transitions, involving a bit interleaver . After encoding, a puncturing
scheme is applied and the remaining bits are multiplexed. A transmission chan-
nel introduces random additive noise providing the sequence of observations z.
The demodulator provides sequences of log-likelihood ratios (LLRs) separately
for both subsequent convolutional decoders (s) and (r).

Let us denote the source signal to be the information we


are actually interested in. In ASR this can be regarded to be
the random hidden-Markov model (HMM) state sequence sT1
(see Fig. 2), while in digital communications this is the ran-

dom sequence of bits xT1 which shall be transmitted. In ASR, Fig. 4. A parallel turbo decoder with two input LLR streams Z(s) and Z(r) ,
and iteration index z = 1, 2, . . ., starting with decoder (s); time index t omit-
if we knew the state transitions (which are actually governed by
ted. Note that both decoders provide results in the bit order of BCJR decoder
the first of the two random processes assumed in an HMM), we (s), i.e., x = (x(s) ) . Omitting the deinterleaver 1 in the lower right of
knew the sequence of states, and vice versa. This also holds for the figure would yield, after applying the sign() function, the decoding result
digital communications: Looking into the structure of the con- (x(r) ) .

volutional encoder in Fig. 1, given the input bit sequence xT1 ,

a (time) sequence of states sT1 follows, where each state con- Viterbi algorithm had to be developed (see Hagenauer [11] and
sists of two binary register contents s = (s(1) , s(2) ), resulting in Huber [12]). On the ASR side we are in the same situation:
a four-state encoder. Given such a sequence of states, also the While the FBA naturally provides confidences, the topic of
respective input bit sequence can be fully reconstructed from a Viterbi ASR decoding with confidence output is a research
trellis diagram. The transmitter in Fig. 1 is obviously the con- field on its own [13], [14].
volutional encoder with its particular structure as determined
by the generator polynominals Gr , G1 , and the state initializa-
tion (e.g., s = (0 0)). The analogy to ASR is the state transition B. Iterative Decoding and Information Fusion in Digital
matrix A and some initial state probability vector as part of Communications
the HMM.
The channel in ASR is everything contributing to the second The turbo principle fundamentally relies on the use of two
of the HMM random processes linking states and observations: (or more) encoders, and iterative decoding by two (or more)
articulation, reverberation, acoustic noise, sensor frequency decoders, and in that respect provides a solution to information
responses, etc. The channel in digital communications on a fusion on the decoder (in ASR: classifier) level. The two
very high level is everything which might incur bit or symbol decoders process the observations (called intrinsic informa-
errors during transmission, and is consequently also modeled tion) in an alternating fashion, exchanging so-called extrinsic
by a random process, e.g., additive Gaussian noise. information. Such extrinsic information is a kind of confidence
In both worlds, the receiver may consist of a Viterbi algo- information and can be obtained from (estimates of) posteriors
rithm [7], [8] that deduces the optimal sequence of states with the intrinsic information being taken out in order not to
from the noisy observations. In ASR, the sequence of states use it again in later iterations; for helpful details see [15], and
then yields the sequence of recognized words, while in digi- Sec. III.
tal communications the sequence of the convolutional coders In Fig. 3 a turbo encoder consisting of two parallel convo-
states can be easily mapped to a decoded sequence of bits by lutional encoders, each, e.g., according to Fig. 1, is shown.


use of the trellis diagram. Alternatively, in both worlds, the The input bit sequence x = xT1 is offered to encoder (r) in
forward-backward algorithm (FBA) can be applied for decod- pseudorandomized order x(r) by means of interleaver . Since
ing, providing the sequence of most probable states and reliable after multiplexing, the input bits x and both parity outputs y(s)
soft outputs (in ASR called: confidences) [8][10]. The reliable and y(r) are typically of a bit rate too high, they are subject to
confidence scores based on true a posteriori probabilities are some regular bit stealing (called puncturing), resulting, after
the reason why we will start our turbo ASR presentation later multiplexing, in a coded bit sequence y = y1T . A transmission
on based on the FBA, and will proceed to the turbo Viterbi algo- channel may add noise (typically on a modulation symbol
rithm only in a second step. Note that the FBA is often called level) yielding noisy non-binary observations z = z1T . Please
BCJR algorithm [9] in digital communications. note that these channel outputs have their analogy in ASR
A main prerequisite of the discovery and application of the in the sequence of observed feature vectors oT1 (cf. Fig. 2).
turbo principle in digital communications was the availability A demodulator computes log-likelihood ratios (LLRs)
of soft-output (channel) decoders providing not only decoded Z = log p(z|y=1)
p(z|y=0) and, after demultiplexing, provides two
bits, but also the confidence of such decision. While the BCJR LLR streams Z(s) and Z(r) each suited for decoding by the
algorithm naturally provides such confidences, a soft-output respective decoder (see Fig. 4). Note that in ASR LLRs are not
848 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

used since the channel input is not binary, but consists of any fusion the separate classification processes do not take profit
of the states s, see Fig. 2. Therefore likelihoods p(o | s = i) or from each other. This interaction during classification is a pri-
log-likelihoods are employed (see Fig. 5). mary element of classifier level fusion systems, which gather
In Fig. 4 the turbo decoder is shown, where in the first the input information sources at a slightly higher level than fea-
iteration (z = 1) decoder (s) processes LLRs Z(s) assuming ture fusion systems, often in the form of so-called likelihood
(s)
=1)
some flat prior Q(s) = 0 (simply meaning log P(x P(x(s) =0)
= 0). streams. In particular, in classifier level fusion systems learning
Decisions could be taken on the basis of computed posteriors and classifying of the input streams is employed indepen-
(s)
=1 | Z(s) ) dently, but under a temporal dependence assumption between
D(s) = log P(x
P(x(s) =0 | Z(s) )
by simple decision for the sign. A the observation likelihood streams during classification [33].
so-called extrinsic information E(s) is computed and after inter- Employing for instance so-called coupled HMMs (CHMMs)
leaving it is given to the next decoder (r) as non-flat prior Q(r) . [25], [34], this interaction also might involve a proper weighting
In iteration z = 2 decoder (r) processes it along with its own scheme of different information sources, e.g., according to their
LLRs (intrinsic information) Z(r) . This process of an alternat- reliability [35]. Often being optimized beforehand on training
ing call of the decoders can now be repeated until convergence data, these weights are subsequently applied during recognition
of the decoding results x , which can be obtained any time from while computing the joint observation likelihoods [34], [36],
any of the decoders by deciding for the sign of the posteriors D [37]; recently also real-time weight updates have been proposed
(r)
=1 | Z(s) ,Z(r) )
with, e.g., D(r) = log P(x
P(x(r) =0 | Z(s) ,Z(r) )
. Note that if z is even, [38].
1
a deinterleaver is required (see Fig. 4). For further details For a multi-modal ASR application, Shivappa et al. proposed
on turbo decoding, the interested reader is referred to [2], [3] to our knowledge for the first time in ASR an iterative
and particularly to [15]. fusion of two parallel audio-visual likelihood streams on the
classifier level [39][41]. Later on they adopted their parallel
recognition scheme to speaker localization and tracking [42],
[43]. Their approach comes with the advantage of two sepa-
C. Iterative Decoding and Information Fusion in ASR
rately trained HMMs for each modality instead of a joint one
In ASR literature, iterative decoding has often been investi- (as in feature-level fusion). However, during recognition the
gated seeking to improve the recognition robustness in adverse iterative decoding is controlled by a heuristic rate parameter
acoustic conditions. Considering the feature level, research modeling and re-estimation of the observation likelihood distri-
on iterative decoding focused mainly on denoising and de- butions. Moreover, within the feedback loop the modified fed-
reverberation of speech features with feedback of a previous back a posteriori probabilities still contain intrinsic information
recognition pass into the feature extraction. While Faubel et al. contradicting the principles of turbo decoding [44][47].
used the best hypothesis as feedback [16], Yan et al. employed Originating from Shivappas altered FBA approach, we
a complete word graph to clean-up features [17]. Aiming at showed in our previous work [48][51] that the unmodified
feature vector enhancement by means of an iterative lineariz- FBA is already suitable for iterative recognition. This can be
ing approximation and compensation, Deng et al. [18] took achieved just by a redefinition of the likelihood term. Moreover,
significant profit from an iterative procedure. In terms of a it was shown that a definition of an extrinsic information (for
model-based approach, Windmann and Haeb-Umbach com- information exchange between ASR decoders) is possible with
puted the HMM state posteriors for feedback to a model-based the intrinsic information being taken out exactly as required
speech feature enhancement [19]. Moreover, at decision level, by the turbo principle.
iterative approaches achieve improvements by combining ASR In this paper, we will first revisit all important aspects of
and machine translation techniques [20]. the turbo FBA. Then we complement our transfer of the turbo
Another approach of improving ASR performance in adverse principle to the domain of automatic speech recognition by
acoustic conditions is based on information fusion. Here, fur- introducing also the turbo decoding Viterbi algorithm, which
ther information sources such as additional acoustic channels will allow for real-time implementations in a practical context.
[21], [22], modalities [23][28], or models [29], [30] are We then show applications in two fields: First in audio-visual
exploited by the speech recognizer. Inherently, the success of ASR as a well-known representative of multi-modal ASR.
such methods is closely linked to the level of information Second, we present simulation results of both turbo FBA and
fusion within the recognition process and the employed fusion the new turbo Viterbi algorithm for a single channel unimodal
method. In feature level fusion systems, commonly the vari- ASR, where two different feature extractions are employed. We
ous input features are combined into a single representation, also briefly discuss likelihood stream weighting aspects of our
e.g., by ordinary concatenation, where the classifier learns the approaches. Moreover, our proposed approach is discussed both
statistics of the joint observations [25][27]. In contrast, in deci- with respect to the required complexity, as well as with respect
sion level fusion systems separate classifier output hypotheses to the influence of latency in extrinsic information computation.
(and confidence scores) are combined to achieve a joint deci- These aspects are of particular interest, when judging the appli-
sion (cf. ROVER [31], confusion networks [32]). While being cability of turbo ASR for large vocabulary continuous speech
very flexible in terms of amount and choice of the individual recognition (LVCSR).
classifiers, this fusion approach also enables a control of the The paper is organized as follows: For later reference, first we
relative influence of each classifier output by a voting scheme, introduce notations and recapitulate briefly the baseline FBA
e.g., according to its confidence. However, in decision level and the Viterbi algorithm in Section II. Section III gives an
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 849

outline of the new turbo ASR based on the FBA and the Viterbi C. The Viterbi Algorithm
algorithm, while Section IV briefly sketches the employed fea- Given an observation sequence oT1 , at each time instant t =
tures and all investigated information fusion ASR reference
1, . . . , T a score
approaches. The performance of the presented approaches is
evaluated in Section V in the audio-visual and the audio-only t (i) = max p(ot1 , s1 , . . . , st = i | (s) ) (5)
iS
task. The paper is concluded with its most important results in
Section VI. for state st = i is computed, based on the observation sequence
ot1 = o1 , . . . , ot . Using the Viterbi algorithm [7], [8], the scores
II. N OTATIONS AND BASELINE A LGORITHMS (5) are obtained recursively by

In order to prepare grounds for presentation of the new turbo t (i) = max[t1 (j) aj,i ] bi (ot ), t = 2, . . . , T, (6)
jS
ASR approach, this section outlines the notations used hereafter
and briefly reviews the baseline algorithms, i.e., the forward- t (i) = arg max[t1 (j) aj,i ] t = 2, . . . , T, (7)
jS
backward algorithm (FBA) [9] and the Viterbi algorithm [7].
whereas so-called backtracking pointers t (i) indicate the opti-
mal predecessor state st1 (st = i) = j for every corresponding
A. Notations t (i). Being initialized by 1 (i) = i bi (o1 ) and 1 (i) = 0, sub-
Let xT1 = x1 , . . . , xT be a sequence of do -dimensional fea- sequently these variables are employed to deliver the most
ture vectors with values xt = ot Rdo for each frame index likely state sequence (s )T1 = s1 , . . . , sT by applying
t = 1, . . . , T . The continuous density hidden Markov model
(HMM) parameters for processing xT1 are the vector = sT = arg max T (i), (8)
iS
[1 , . . . , N ]T of prior probabilities i = P(s1 = i) of all st = t+1 (st+1 ), t = T 1, T 2, . . . , 1, (9)
states i S = {1, . . . , N }, the matrix A = {aj,i }j,iS of state
transition probabilities aj,i = P(st = i | st1 = j), and the set in reversed chronological order. Note that any factor that is con-
B = {bi (xt )}iS of do -variate emission probability density stant with respect to state i has no effect on the result of the
functions (pdfs) bi (xt ) = p(xt | st = i). Please note that we maximum decision (8) and thus can generally be omitted.
distinguish probabilities P() and pdfs (or their values) p(). In
order to ease the understanding of our turbo ASR approaches
in Section III, we will now briefly recapitulate the well-known III. T HE T URBO ASR A PPROACH
forward-backward algorithm and the Viterbi algorithm, which The strength of the turbo principle as known from digital
will also both serve as baseline approaches in later simulations. communications lies in the exchange of reliability (or so-
called extrinsic) information enabling the various (typically
two) involved decoders to improve their estimates of the trans-
B. The ForwardBackward Algorithm (FBA)
mitted information instead of simply decoding it. Adopting the
Given an observation sequence oT1 , at each time instant t = turbo principle to the domain of automatic speech recogni-
1, . . . , T a hidden state st = i S is assigned with the pos- tion, this section outlines the turbo-decoding FBA and the new
terior probability t (i) = P(st = i | oT1 , (s) ). The state-level turbo Viterbi algorithm. Note that although we only present two
maximum-a-posteriori (MAP) recognizer [8], [10] provides the likelihood streams and decoders, we are not at all restricted con-
sequence (s )T1 = s1 , . . . , sT of most likely states by cerning the number of input streams in practice, since the turbo
approach only specifies the information flow from one decoder
st = arg max t (i), t = 1, . . . , T. (1)
iS to the next, i.e., two decoders are always involved at a time.
Applying the FBA, the state posteriors t (i) are obtained by Besides feature vector sequence xT1 = oT1 , let there be
another observation sequence uT1 of the same length2 T as oT1 ,
1 but from a different feature space ut Rdu . Note that the two
t (i) = t (i)t (i), t = 1, . . . , T, (2)
Ct feature vector sequences oT1 , uT1 may stem from two different

t (i) = bi (ot ) aj,i t1 (j), t = 2, . . . , T, (3) modalities (e.g., as in audio-visual ASR), or from two different
jS sensors of the same modality (e.g., as in multi-channel ASR), or
 even from the same microphone (but different feature extraction
t (i) = bj (ot+1 )ai,j t+1 (j), t = T 1, . . . , 1, (4)
techniques are employed), or even we may have ot = ut (but
jS
both feature vectors are later subject to two different acoustic
where t (i) = p(ot1 , st = i) and t (i) = p(oTt+1 | st = i) for models).
all i S denote the forward and backward variables, respec- Furthermore, let there be two state-level speech recognizers
tively. These variables are initialized to 1 (i) = i bi (o1 ) and concatenated in parallel as shown in Figure 5. Each of these
T (i) = 1; subsequently, computation is carried out recursively component recognizers (CRs) shall process one of the obser-
vation sequences oT1 , uT1 ; accordingly, each CR employs an
according to (3) and (4). TheNconstant Ct ensures the posterior
state distribution to fulfill i=1 t (i) = 1. Note that any state- 2 Note that we have assumed an equal length of observation sequences only
independent factor with respect to state i will cancel out in (3) for clarity of presentation. This implicitly means that there is an identical frame
and (4), once (2) is applied. shift in computing both ot and ut .
850 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

Fig. 5. Turbo (FBA) speech recognizer with two streams of input likelihoods p(o | s = i) and p(u | r = k), time index t omitted, and iteration index z = 1, 2, . . .,
(r) is not subject to a transformation matrix T(s)(r) , simply because the arg max[ ]
starting with the component recognizer (s). Different to Fig. 4, the posterior
function of component recognizer (r) is drawn to deliver the sequence of optimal states r (and not s ).

HMM as acoustic model = {; A; B} matching the respec- Following again [40], the extrinsic likelihoods may be written
tive observations. For distinction, the two individually trained as a marginalization result according to
sets of HMMs shall be assigned by (s) and (r) , where the (s) (r)
 (r)
superscripts (s) and (r) refer to the respective state index spaces gi ( t ) = p(t | rt = k) P(rt = k | st = i), (11)
S = {1, . . . , N } and R = {1, . . . , M } of the CRs. Throughout kR

this work, we will adopt these superscripts for other symbols as (r)
assuming conditional independence between t and st , given
well, wherever helpful for clarification. Without loss of gener-
state rt = k. With P(rt = k | st = i) being assumed station-
ality, CR (s) shall process the observation sequence oT1 and CR
ary, Shivappa et al. [40] proposed a heuristic model regarding
(r) the observation sequence uT1 , or the streams of likelihoods (r)
p(t | rt = k), whose parameters are re-estimated iteratively
p(o | s = i), p(u | r = k), respectively, as shown in Fig. 5.
during recognition. In contrast to that, we provide an analytical
solution by means of Bayes rule (see App. A).
A. Turbo ForwardBackward Decoder So far, we assumed equal HMM state index spaces within
each CR. However, the respective state index spaces S, R may
Given two state-level recognizers concatenated in parallel differ in multimodal ASR systems, e.g., audio-visual speech
(Fig. 5), each CR receives both the respective observation recognition. We will consider this in the following by merely
sequence and some state information t from the other CR. assuming a stationary known prior co-occurance probability for
As in turbo decoding, these so-called extrinsic probabilities are all HMM states i S and k R, in the form
related, but not exactly equal to the state posteriors t .
(s)(r) P(rt = k | st = i) P(rt = k, st = i)
Without loss of generality, we consider an M -dimensional Ti,k = = . (12)
vector
(r)
t = [
(r)
t (1), . . . ,
(r)
t (M )]T of extrinsic probabili- P(rt = k) P(rt = k) P(st = i)
ties (obtained from the previous processing of CR (r)) to be As we will see, Ti,k
(s)(r)
represents a linear transformation of
integrated into the posterior computation of CR (s); the other extrinsic information from state index space R to S. Note that
direction is analogous. in a topological sense this transformation resembles the so-
(r)
Assuming conditional independence between ot and t called (de-)interleaving carried out by the (de-)interleavers of
given state st = i, the emission function of each state i S of a parallel-concatenated turbo decoder [2] (compare Figs. 4 and
the HMM (s) may be modified according to 5), although their motivation is essentially different.
(s) (r) (r)
With (32) from App. A, we obtain a solution to compute the
bi (ot ,
t ) = p(ot ,
t | st = i) extrinsic likelihood in (10):
= p(ot | st = i) p(
(r)
t | st = i) (s) (r)
 (s)(r) (r) (s) (r)
gi (t ) Ti,k t (k) = gi ( t ). (13)
(s) (s) (r)
= bi (ot ) gi (
t ), (10) kR

(s) (s) (r) (s) (r)


which leads to a simple scaling of the observation likelihoods3 In vectorial notation, g t = [g1 ( t )]T
t ), . . . , gN (
(s) (r) denotes an N -dimensional vector of the linearly trans-
by the so-called extrinsic likelihoods gi ( t ). In conse-
quence and in contrast to the presentation in [40], the baseline formed M -dimensional extrinsic probability vector
(r) (r) (r)
FBA (Sec. II-B) does actually not require a modification at all, t = [
t (1), . . . ,
t (M )] from state space R, letting
it is merely supplied with the modified observation likelihoods. (s) (r)
t
g = T(s)(r)
t , (14)
3 Note that (10) can be considered as a special case of a multi-stream HMM (s)(r)
[37], where both likelihood stream weights are set to unity. where T(s)(r) = {Ti,k }iS,kR = [T(r)(s) ]T .
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 851

Inserting (14) into (10), we now obtain modified forward and


backward variables (3), (4):
(s) (s) (s) (r)
 (s) (s)
t (i) = bi (ot ) gi (
t ) a (j), t = 2, . . . , T,
   jS j,i t1
new
(15)

t (i) = t+1 ) ai,j t+1 (j),
(s) (s) (s) (r) (s) (s)
bj (ot+1 ) gj (
jS
  
new
t = T 1, . . . , 1, (16)

with appended observation likelihoods, whereby the initializa-


tions of the baseline FBA (Sec. II-B) can still be used, with
g(s) ( ) = 1. Note, that any state-independent proportionality
factors have been dismissed. To conclude, the modified pos-
(s)
terior probability t (i) can be obtained from (15) and (16) in
analogy to (2). Fig. 6. Vector-matrix notation of the turbo forward-backward algorithm for the
In turbo decoding, extrinsic information in the form of a purpose of clarity; Hadamard product () denotes the entry-wise multiplica-
tion of two vectors resulting in a vector; CR means component recognizer; all
modified a posteriori probability is exchanged between the
symbols without CR identifier refer to CR (s).
various (typically two) CRs seeking to converge towards the
same decision on the information of interest. However, in order
not to overemphasize the observation likelihoods during the derived solution required no heuristic modeling of observa-
decoding iterations, intrinsic information (i.e., the observation tion likelihood distributions. For purpose of clarity, Fig. 6
likelihoods) being already exploited by one of these CRs should introduces a compact formulation of the turbo forward-
not be used more than once [15], [52]. Following [15], we backward algorithm [50]. Here, the forward and backward
(s)
t , t are denoted in vectorial notations
(s)
therefore dissect the modified posterior probability t (i) into variables t =
t (N )]T and t = [t (1), . . . , t (N )]T .
(s) (s) (s) (s) (s)
so-called a priori information, channel or intrinsic information, t (1), . . . ,
[
and extrinsic information, letting Moreover, also the emission pdfs may be expressed in
vectorial notation as bt = [b1 (xt ), . . . , bN (xt )]T , with [ ]T
1  being the transpose. Subsequently, by using an iterative
t1 (j) t (i),
(s) (s) (s) (r) (s) (s) (s)
t (i) =  bi (ot ) gi ( t ) aj,i (s)
Ct recognition scheme, the vector of state posteriors t =
jS (s) (s)
         t (1), . . . , t (N )]T is obtained by means of the Hadamard
[
intrinsic a priori extrinsic product of these forward and backward vectors (entry-wise
(17)
multiplication), which is marked by the () operator. Note that
with Ct ensuring the stochastic constraint.
(s) the intrinsic and a priori information of current time instant t
Only the extrinsic probability t (i) is passed on between is explicitly excluded from the fed-back extrinsic information
the CRs as new a priori information in each iteration. Thus, the vector
t .
risk of re-using information is minimized. Note that due to the
dissection (17) also the assumed conditional independence of
(r)
ot and t is maintained, ensuring the factorization in (10) to
be valid. Finally, with a normalization factor Ct for fulfillment B. Turbo Viterbi Decoder
of the stochastic constraint, we obtain the extrinsic probability In analogy to the turbo FBA speech recognition scheme
by shown in Fig. 5, we again assume two state-level component
recognizers (CRs) concatenated in parallel. In addition to the
1  respective observation sequence each CR shall receive so-called
t1 (j) t (i),
(s) (s) (s) (s)
t (i) = P(st = i | . . .) = 
aj,i
Ct extrinsic probabilities t from the other recognizer. Inspired
jS
by the well-known soft-output Viterbi algorithm (SOVA) [11],
(18)
these extrinsic probabilities represent a reliability value to be
differing from [40] in the explicitly neglected intrinsic infor- used in the subsequent decoding of the most likely sequence.
mation, which turns out to be an important point. Note that Without loss of generality, we consider the integration of
(r) (r) (r)
the ellipsis . . . in (18) represents the source of extrinsic state an M -dimensional vector t = [ t (1), . . . ,
t (M )] of
information {oT1 , uT1 } with the actual intrinsic and a priori extrinsic probabilities from the previous processing of CR (r)
information at time instant t in (17) being taken out. into the Viterbi decoding of CR (s); the other direction is
In summary, we introduce an iterative recognition scheme analogous.
(r)
by modifying the observation likelihoods of the unmodified Assuming conditional independence between ot and t
forward-backward algorithm to allow injection of informa- given an HMM state st = i, the emission function of each
tion from a previous iteration. Moreover, our analytically HMM state i S of HMM (s) may be modified in the same
852 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

Fig. 8. Vector-matrix notation of the turbo Viterbi algorithm for the purpose
of clarity; Hadamard product () denotes the entry-wise multiplication of two
vectors resulting in a vector, || ||1 being the L1 norm; CR means component
Fig. 7. Example state trellis diagram for t t = t = 2 and N = 4 emitting
recognizer; all symbols without CR identifier refer to CR (s).
states. Survivor paths are drawn with solid lines.

way as with the turbo FBA (Sec. III-A). By doing so, we now (cf. (19)). Note that t = t t specifies a necessary compu-
derive a Viterbi recursion (6) with a modified score: tational delay for the purpose of providing a Viterbi confidence
(s)
output: Eq. (20) provides such a confidence output t (i) for
t (i) = max[t1 (j) aj,i ] bi (ot ) gi (
(s) (s) (s) (s) (s) (r)
), time instant t = t t. Without loss of generality, just for
jS   t 
new ease of presentation, we assume t to be a constant here. The
t = 2, . . . , T, j S, (19) choice of any such delay influences the fidelity of the extrin-
sic information. For the purpose of clarity, Fig. 7 illustrates an
with appended observation likelihoods, and the backtracking example state trellis diagram for t = 2 and N = 4.
(s) (s)
pointers t (i) in (7) modified to employ t1 (j). Moreover,
(s)
Next, the multiplicative contribution bi gi at time t has
to be taken out of the final score t at time t (same principle
(s)
the initializations of the baseline Viterbi algorithm can still be
(s) (r)
employed (Sec. II-C). The extrinsic likelihood gi ( t ) again as in (18) compared to (17)). Now it is useful to understand that
follows (14). To conclude, the most likely state sequence (s )T1 (20) is effectively a ratio of some such modified scores being
can be obtained from (14), (19), and (7), (8), and (9) being used added up, divided by all such modified scores. Such a ratio is
in analogy. usually known in the literature as homogeneity score [53].
In order not to overemphasize the observation likelihoods In order to specify which scores are among the some, we
during the decoding iterations, we again strictly follow the introduce a set Hi,t S of final states at time t = t + t
turbo principle and dissect the modified scores t (i) into a
(s) of all survivor paths through state i at time t as
priori information, intrinsic information, and extrinsic infor-
mation; only the latter is passed on between the CRs as new Gi,t+1 , if t = t 1,
Hi,t = (21)
a priori information in each iteration. Assuming that all sur- Gi,t+1 H,t+1 , if t < t 1.
vivor paths at time instant t contain essential information, we
obtain the extrinsic probabilities by Here, we used Gi,t+1 = { | t+1 () = i, S} being the set
of all states at time t + 1, which are connected to state i at time
 t ()
(s)

(s) (s) (r)


t by survivor paths. The symbol Gi,t+1 joins all states at
(s) Hi,t bi (ot )
gi ( t )
time t , which are connected to state i at time instant t < t 1
t (i) =
. (20)
  t ()
(s)
by survivor paths.
(s) (s) (r)
jS b
Hj,t j (o t gj (
) t ) To explain this, in the example of Fig. 7, the set of those
states at time t (some states), which are linked to state i = 2
Postponing the definition of state sets Hi,t , Hj,t for the at time t is4 H2,t = {1, 2, 3}. On the other hand, the double sum
moment, lets first explain the involved time instances t and in the denominator of (20) simply joins the state sets at time t,
t. As can be seen in Fig. 7, there is a most recent time t , where resulting in H2,t H3,t = {1, 2, 3, 4} = S (all states).
Viterbi scores t (i) shall be available. Among other contri-
(s)
For purpose of clarity and to summarize, Fig. 8 intro-
butions, they depend on the intrinsic and a priori information duces a compact formulation of the turbo Viterbi algorithm.
(s) (s) (r)
bi (ot ), gi (t ), respectively, which are a multiplicative 4 At time t 1 all survivor paths pass through state i = 1, resulting in

contribution at an earlier time t in the final score t (i)


(s)
H1,t1 = {1, 2, 3, 4} = S.
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 853

Here, the state scores t and backtracking pointers t are concatenation as reference system, one of the respective base-
denoted in vectorial notations t = [t (1), . . . , t (N )]T
(s) (s) (s) line algorithms (Sec. II-B and II-C) is applied processing the
(s) (s) (s) (s) (do + du ) = d-dimensional sequence of concatenated feature
and t = [t (1), . . . , t (N )]T . A column vector ai of
vectors y1T = y1 , . . . , yT with values yt = [oT T T
t , ut ] R ,
d
matrix A(s) is given by A(s) = (a1 , . . . , ai , . . . , aN ). Note that T
with ( ) being the transpose.
{max[ ]}iS denotes a column vector with index i, with max[ ]
b) Coupled HMM (CHMM): In classifier level fusion sys-
delivering the maximum of the elements of its vectorial argu-
tems, training is employed independently, while recognition is
ment. The () operator again marks the Hadamard product, an
performed under a temporal dependence assumption between
entry-wise product of vectors.
the observation likelihood streams [33]. We compared our turbo
ASR approaches to the widely known coupled HMM (CHMM)
approach [25], [34], which serves as classifier level fusion
C. Employing Weights
reference. As commonly practiced in CHMMs, the coupled
While turbo decoders in practice may regularly converge stationary state transition probability
to a common estimate of the information of interest, neither
convergence to the correct solution nor even convergence to a A(s)(r) = {aj,i a,k }j,iS, ,kR , (22)
stable solution can be guaranteed, particularly at a low signal-
as well as the coupled emission
to-noise ratio (SNR) [54]. However, inspired by the weighting
schemes commonly applied in multi-stream HMMs [35], the (s)(r) o u
bi,k (ot , ut ) = p(ot | st = i) p(ut | rt = k)
inclusion of the a priori information in (15), (16), or (19) and
also the intrinsic likelihood may be controlled beneficially. As (i, k) S R, (23)
shown in Fig. 5, we therefore introduce four individual expo-
nents o , u , s , r on the intrinsic and extrinsic likelihoods, can be gathered from the two marginal HMMs (s) and (r) ,
respectively, controlling the dominance of either observation respectively. The two parameters o and u in (23) are opti-
or extrinsic information for each CR over iteration instant z. mized dependent on the SNR as shown later.
Those four exponents can also be called weights on the respec- c) Iterative Reference (-S): Considering an iterative decod-
tive logarithmic entities, or in brief: likelihood weights. For ing reference, we implemented the audio-visual ASR approach
further details considering the employed weights, please refer presented by Shivappa et al. [39][41]. In compliance with their
to Appendix B. proposed procedure, we employed a heuristic model for the
(r)
likelihood p(t | rt = k) in [39, eq. (4)], letting
(r) (r)
 (r)
IV. E VALUATED ASR R EFERENCE A PPROACHES t | rt = k) = f (1
p( k,t ; ) f (
,t ; ). (24)
=k
We compared our proposed turbo ASR approaches to one
representative of each of the three levels of information However, in order to improve fairness of comparison to
fusion respectively: feature level, classifier level, and decision our turbo ASR approaches partly employing a weighting
level fusion [33]. In both an audio-visual and an audio-only scheme (III-C), we further optimized the exponential dis-
speech recognition task the following feature representations tribution f (; ) used in (24) by introducing an additional
were examined: For visual speech representation, we extracted SNR-dependent scaling factor SNR > 0, with
shape-based features of order 11 for each speaker at the SNR
visual frontend (cf. [50]) respectively. As acoustic features 1
e (z) , 0,
we employed 13 MFCC coefficients according to the ETSI f (; ) = (z) (25)
Advanced Front-End (AFE) Recommendation [55], plus 1st- 0, < 0.
and 2nd-order derivatives and an additional log-energy param- According to [39], we computed and updated the rate param-
eter. For a second acoustic decoder, Gabor features of order 311 1
eter (z) as the estimated variance of the likelihood values
were extracted [56]. Gabor features are particularly interest- (r)
ing in information fusion, since spectro-temporal features are p(t | rt = k) during recognition at each iteration. The same
(s)
reported to contain complementary information to MFCCs [56]. was done for p( t | st = i). The parameter SNR is optimized
Moreover, they are very strong in high SNR and relatively weak dependent on the SNR as shown later.
in low SNR, which intentionally poses a challenge on informa- d) Weighted ROVER: In decision level fusion systems, the
tion fusion systems in such conditions. In general, all feature final classifier outputs obtained from separate classifiers are
representations were obtained by applying a 25 ms window combined to achieve a joint recognition hypothesis. Hence,
with 10 ms shift. the individual recognition processes are completely indepen-
a) Feature Concatenation (CONCAT): In feature level dent of each other. Considering the combination of N = 2
fusion, the incoming feature vectors are believed to be directly word sequences of length R (r) , we employed a
(s) and R
related and synchronous [25], [33]. On this basis, the classifier weighted version of the well-known recognition output voting
learns and recognizes the statistics of a joint feature represen- error reduction (ROVER) approach as decision level fusion ref-
tation, which is usually achieved by an ordinary concatenation erence [31], [57]. After aligning the two individual classifier
of the incoming feature vectors. Employing common feature outputs at word-level by dynamic programming (resulting in a
854 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

for each word instance w in the


word sequence of length R), To mirror adverse acoustic conditions during recognition, the
aligned word sequence a simple voting scheme is conducted, audio recordings were interfered with white Gaussian noise
given by and three real noise types (i.e., car, train station and babble
  noise) from the AURORA-2 database [61] at fixed SNRs from
N (w, ) (CR) 0 dB to 30 dB active speech level (5 dB steps, according to
= arg max
w + (1 ) CS (w, ) ,
w N ITU-T P.56 [62]).

= 1, 2, . . . , R. (26) For all experiments reported here, out of the 34 speakers
we selected 20 (10 male and 10 female) evaluation speak-
Here, w denotes the joint word-level recognition hypothesis, ers, whereas 2 (1 male and 1 female) additional speakers
while N (w, ) {1, . . . , N } is the number of models rec- were employed for parameter training. Moreover, for each
ognizing a word w at position . Note that (26) implies a selected evaluation speaker, the speech data was randomly
linear interpolation between a simple majority voting scheme divided into 800 training and 200 test files. Concerning the
(i.e., = 1) and an unweighted confidence score combina- parameter training subset, 200 randomly chosen speech files
tion scheme i.e., = 0. Note also that the confidence score were selected for each of the two speakers. For the (speaker-
CS (CR) (w, ) directly corresponds to the respective word dependent) audio-visual task, we trained speaker-dependent
posterior P(w | o, (CR) ). Since the word frequency-based HMMs on the respective clean speech training files of each
ROVER (i.e., = 1) is not capable of exploiting the inherent evaluation speaker separately for each CR (video or undisturbed
diversity of an ensemble of CRs [58], [59], we computed in our audio). During test, these speaker-dependent HMMs were eval-
experiments the maximum confidence per word (i.e., = 0) uated on the 200 test files of the respective evaluation speaker.
as the scoring metric. To do so, we computed the confidence To maintain a comparable evaluation setup for the (speaker-
score by multiplying the state posterior probabilities of all states independent) audio-only task, i.e., 200 test files of in total
belonging to the currently recognized word w as 20 evaluation speakers and 800 training files per speaker (as
with the audio-visual task), respectively, we pooled 800 ran-
t+Ns (w)1
   domly chosen clean speech files per speaker from the so far
CS (CR) (w) = (CR) (CR) s(w)
, (27)
unseen 12 GRID speakers respectively, and trained separately
=t
two speaker-independent HMMs for each CR (MFCC or Gabor
(w) features of undisturbed audio) based on those 9600 files. To
where Ns (w) is the number of states s a word w starting
at frame t is composed of. The state posterior probabilities set up the (speaker-independent) audio-only evaluation data, we
of both CRs were gathered as follows: For the FBA decoder pooled the respective 200 test files from each of the selected 20
we carried out a baseline FBA (cf. Sec. II-B), whereas for evaluation speakers, leading to in total 4000 evaluation speech
the Viterbi decoder we employed a baseline Viterbi algorithm files. After training, each HMM set comprised 51 word HMMs
(cf. Sec. II-C), letting according to the GRID vocabulary. We utilized a linear HMM
 (s) topology employing a rule of four emitting states per phoneme.
t () The state emission pdfs were modeled with Gaussian mixture
(s) Hi,t models of order 5 and diagonal covariance matrices.
t (i) =  (s)
(28)
t () As a performance measure, we applied the word recogni-
S tion accuracy in percent, given by ACC = N DIS N , where
N, D, S, I denote the number of reference labels, deletions,
and using the state set Hi,t as being defined in (21), and t =
substitutions, and insertions, respectively. For this measure to
t + t with t = 100 frames. The confidence weights (CR)
be applicable, we converted the recognized state sequences to
in (27) are optimized dependent on the SNR as shown later.
word sequences using the surjective relation between an HMM
state and the respective word identity of its containing word
V. S IMULATION R ESULTS HMM: first, each state in the recognized sequence was allo-
cated to the respective word identity and subsequently strings
A. Experimental Setup
of consecutive identical words were merged.
We apply the presented (turbo) ASR approaches to a speaker- Except the feature concatenation (cf. Sec. IV-a), all evaluated
dependent audio-visual ASR task as well as to a speaker- ASR fusion approaches take advantage of a few control vari-
independent audio-only ASR task. All experiments are based ables (i.e., CHMM: o , u , iterative reference: SNR , weighted
on the GRID audio-visual speech corpus [60] (downsam- ROVER: (s) , turbo ASR: o , u , s , r , s , r ). All these
pled to 8 kHz) containing audio and video recordings of control variables were optimized on the pooled test files of
34 native English speakers (18 male and 16 female) utter- the (only) two parameter training speakers being interfered with
ing 1000 sentences of a fixed syntax respectively. Each of white Gaussian noise at the same SNR as the evaluation test
these uttered sentences is exactly 3 sec long and consists of a data; during recognition, the obtained parameters were adopted
six word sequence such as bin blue at F 9 now, following for the test data of the 20 evaluation speakers, also for the
the form < command : 4 >< color : 4 >< preposition : 4 > AURORA-2 noises.
< letter : 25 >< digit : 10 >< adverb : 4 >. Here, the inte- To rate the current weights during the weight optimiza-
gers after each sentence component indicate the number of tion procedure for our turbo ASR approach for each of the
different choices, leading to a vocabulary of in total 51 words. three tasks (audio-visual FBA, audio-only FBA, and Viterbi),
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 855

we applied a pattern search algorithm [63] maximizing an whereas the probabilities P(rt = k) and P(st = i) in (12) are
accuracy-based figure of merit of both CRs, given by obtained by marginalization of (30).
   For each SNR, we carried out zmax = 8 turbo iterations
(s) (r)
ACCFoM = ACCz=8 + ACCz=8 and computed the output posteriors, or, scores, respectively,
  
 (s) (r)  of each CR. Except where otherwise stated, in all conducted
(1 ) ACCz=8 ACCz=8  , (29)
turbo Viterbi experiments presented hereafter t was set to 100
(CR) frames.
where ACCz=8 denotes the obtained accuracy of either CR (s)
or CR (r) after the (arbitrarily chosen) 8th iteration. As can be
seen, both accuracies shall be high, while the accuracy differ- B. Results and Discussion
ence shall be low. While was set to 0.45 for all optimizations, Tables I XIII and Figs. 9 11 illustrate the results of our
the actual optimization procedure was carried out in two steps: recognition experiments in white Gaussian noise, as well as
first, the task-dependent extrinsic likelihood weight parameters train station, car, and babble noise taken from the AURORA-2
s , r , s , r (cf. App. B) were optimized on a multi-condition database [61]. In the figures, the dotted lines with triangu-
parameter training subset of only one of the parameter train- lar markers show the single-channel baselines (suffix -B) for
ing speakers5 disturbed with white noise (in total 600 files, MFCC () and video (or Gabor, respectively, ), using either
pooled from the respective speech files of the parameter train- an FBA or Viterbi baseline algorithm (cf. Sec. II-B and II-
ing subset being interfered at an SNR of 0 dB, 15 dB, 30 dB), C). Employing () markers, the dotted lines depict the feature
governing the task-dependent a priori information influence of concatenation reference (CONCAT; cf. Sec. IV-a), whereas
each CR. Subsequently, these extrinsic likelihood weights were the dashed lines plot the CHMM reference (cf. Sec. IV-b).
kept constant, and only the intrinsic likelihood weights o , u Moreover, the ROVER (cf. Sec. IV-d) reference is indicated by
(cf. App. B) were adapted on the 400 speech files of the parame- dashed lines with () markers. The remaining curves with ()
ter training subset according to the SNR, adjusting the influence and () markers indicate the recognition results of Shivappas
of the intrinsic information in a fully SNR-dependent manner. iterative reference (dashed with suffix -S, cf. Sec. IV-c, [39])
However, since the video data is not affected by acoustic noise, and the herein presented turbo recognition approaches (solid
in the audio-visual task also the video emission weight u was with suffix -T; cf. Secs. III-A, III-B): the curve with () markers
kept constant for all SNRs. was obtained by starting with the MFCC CR in the first itera-
For the CHMM reference method (cf. Sec. IV-b), we tion and then examining the output of both CRs in an alternating
employed the very same pattern search algorithm [63] on the fashion. Analogously, the () marked curve was generated by
abovementioned 400 files of the parameter training subset opti- starting with the video (or Gabor, respectively) CR.
mizing the two control variables o , u in (23) by maximizing a) Audio-Visual Task (FBA): Applying a baseline FBA
the accuracy. As CHMMs imply an elementary single-step (cf. Sec. II-B) to an audio-visual speech recognition task in
classifier level fusion, these two SNR-dependent weights were white Gaussian noise, the following single-modality accuracies
found after simply applying a baseline FBA (cf. Sec. II-B) or were achieved (Tab. I): 53.5% on the video-only test corpus,
Viterbi algorithm (cf. Sec. II-C). For the audio-visual iterative while the audio-only recognition results vary from 33.4% at
reference approach (cf. Sec. IV-c), the iterative rate param- 0 dB SNR to 94.1% in undisturbed (i.e., clean) conditions.
eter SNR (25) was optimized separately on the very same The MFCC baseline (MFCC-B) is the best among all reference
400 speech files of the parameter training subset by employ- schemes for SNR > 15 dB. In comparison, the audio-visual
ing the same pattern search algorithm [63] to maximize (29). CHMM approach yields accuracies between 54.0% and 94.1%,
Note that = 0.45 was also chosen as optimal value here. For providing the best audio-visual reference algorithm on average
the weighted ROVER (cf. Sec. IV-d, [57]), we introduced a over all SNR conditions. The audio-visual feature concatena-
confidence weight (r) = 1 for CR (r) operating on video or tion approach (CONCAT) yields accuracies of 48.7% up to
Gabor features, and, for fairness in later comparisons, an SNR- 89.1%, being the best reference method at about 5 dB SNR.
individually optimized (s) for CR (s) operating on MFCCs Nevertheless, the susceptibility of feature level fusion (as in
(27), being also optimized by means of a simple grid search on CONCAT) to strongly differing performance of the modali-
the 400 parameter training files maximizing accuracy. ties due to the joint feature representation becomes visible at
Back to the turbo ASR approaches, we inferred stationarity 0 dB SNR and at high SNRs. The audio-visual joint recogni-
(s)(r)
of Ti,k , and estimated it by means of a baseline FBA tion hypothesis of decision level fusion by ROVER in this task
(s) (r) yields, except one condition (5 dB), a performance in between
(cf. Sec. II-B) computing the state posteriors (i), (k)
on the respective training data instances and subsequent the MFCC and the video baselines, mostly closer to the better
estimation of a joint probability distribution one. This might be due to some actually incorrect words with
high confidence, caused by the approximation of word con-

P(rt = k, st = i) = 1 (s) (i)(r) (k), (i, k) S R, fidence scores by multiplication of state posteriors (27), and
C the setting of the ROVER alignment module (Sec. IV-d) to
(30) include all occurring output words into the aligned word net,
5 Note that preliminary investigations showed that using limited data of only whereby all arising insertions are also considered for voting
one parameter training speaker is fully sufficient for the optimization of these (26). Shivappas iterative audio-visual reference takes some vis-
four turbo ASR parameters. ible profit from the incorporated parametric model (reinforcing
856 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

TABLE I TABLE III


AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY
(% ACC) VS . SNR ( D B) IN W HITE G AUSSIAN N OISE . A LL (% ACC) VS . SNR ( D B) IN C AR N OISE . A LL A PPROACHES
A PPROACHES ARE BASED ON FBA R ECOGNITION ARE BASED ON FBA R ECOGNITION

TABLE II TABLE IV
AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY
(% ACC) VS . SNR ( D B) IN T RAIN S TATION N OISE . A LL (% ACC) VS . SNR ( D B) IN BABBLE N OISE . A LL A PPROACHES
A PPROACHES ARE BASED ON FBA R ECOGNITION ARE BASED ON FBA R ECOGNITION

selectively the most probable state), but does not perform as


good as the strong CHMM reference at high SNRs (particularly
in undisturbed conditions), which might be due to a high depen-
dency on the chosen features6 and the intrinsic information,
which is still contained in the fed back extrinsic probabilities
(cf. Sec. III-A). At 0 dB and 15 dB SNR, however, Shivappas
approach starting with MFCC decoding (MFCC-S) provides
the best reference algorithm.
Taking a look at the audio-visual turbo FBA simulations
results, we observe the following: Starting with the MFCC fea-
tures and ending in the 8th iteration with the video features
(MFCC-T) performs best at a very low SNR (0 dB). Starting
with the video features and ending with the MFCC features
(Video-T), however, works best at low to high SNRs 5 dB. At
all SNR, both turbo ASR approaches exceed the performance of
any of the reference approaches.
Fig. 9 depicts the recognition results in word accuracy
(% ACC) vs. iteration z at 5 dB SNR in white Gaussian Fig. 9. Audio-visual recognition results in word accuracy (% ACC) vs. the
noise. Except the iterative reference (acronym -S, Sec. IV- number of iterations z at an SNR of 5 dB in white Gaussian noise. All
c), all reference approaches operate in a non-iterative fashion, approaches are based on FBA recognition.
therefore they deliver flat curves in Fig. 9 with performance
values as given in Tab. I. The two turbo FBA schemes, on
accuracy from 56.8% (53.5%) to about 72.0% in the 8th iter-
the other hand, show a significant improvement of the word
ation. Interestingly, at the second use of the video emission
probabilities (z = 3 or z = 4, respectively), both turbo FBA
6 As stated by the authors in [39, Sec. 4.2]. curves drop severely; however, they later recover completely
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 857

TABLE V TABLE VII


AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC)
VS . SNR ( D B) IN W HITE G AUSSIAN N OISE , O PERATING ON MFCC VS . SNR ( D B) IN C AR N OISE , O PERATING ON MFCC AND G ABOR
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED FEATURES . A LL A PPROACHES ARE BASED ON FBA R ECOGNITION
ON FBA R ECOGNITION

TABLE VIII
TABLE VI AUDIO - ONLY RECOGNITION RESULTS IN WORD ACCURACY (% ACC)
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) VS . SNR ( D B) IN BABBLE NOISE , OPERATING ON MFCC AND G ABOR
VS . SNR ( D B) IN T RAIN S TATION N OISE , O PERATING ON MFCC FEATURES . A LL APPROACHES ARE BASED ON FBA RECOGNITION
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED
ON FBA R ECOGNITION

due to increasing emphasis on the extrinsic information (cf.


App. B). According to our experience this behaviour is due to
the instantaneous intrinsic likelihood weight switch from unity
to fixed attenuated values and does not constitute a robust-
ness problem: If the number of iterations is not too small
(zmax 8), the optimization of the extrinsic likelihood weights
given a preset zmax takes care of final recovery in test con-
ditions and even leads to improved performance (cf. figure of
merit (29) aiming at an optimal performance at zmax = 8). In
contrast to that please note that the iterative references (suffix
-S) reveal an hardly improving performance after the second
iteration. Regarding the best performing reference system in
this condition, CONCAT yields 67.1% accuracy.
Tabs. II-IV display the audio-visual recognition results in
more realistic noise environments (AURORA-2 noises). The
main findings being observed with white Gaussian noise exper-
iments from Tab. I seem to be generally valid also with these
noise types, except that in train station noise MFCC-B is the
best reference approach on average over all SNRs; while in the
Fig. 10. Audio-only recognition results in word accuracy (% ACC) vs. the
other 3 noise types CHMM is the best reference. The over- number of iterations z at an SNR of 20 dB in white Gaussian noise. All
all strength of CHMM can be explained by its capability to approaches are based on FBA recognition.
deal with asynchronous input sequences for the two modali-
ties. Again both turbo ASR approaches exceed all reference
approaches in all noises and SNR conditions. Averaged over b) Audio-Only Task (FBA): Applying a baseline FBA
all four noise types and SNRs, Video-T is ahead of the overall (cf. Sec. II-B) to an audio-only speech recognition task in
best reference approach CHMM by an absolute 3.9%, which white Gaussian noise, the following single-model accuracies
corresponds to a relative WER reduction of 22.4%. were achieved (Tab. V): The MFCC baseline results vary from
858 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

TABLE IX TABLE XI
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC)
VS . SNR ( D B) IN W HITE G AUSSIAN N OISE , O PERATING ON MFCC VS . SNR ( D B) IN C AR N OISE , O PERATING ON MFCC AND G ABOR
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED F EATURES . A LL A PPROACHES ARE BASED ON V ITERBI R ECOGNITION
ON V ITERBI R ECOGNITION

TABLE XII
TABLE X AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC)
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) VS . SNR ( D B) IN BABBLE N OISE , O PERATING ON MFCC AND G ABOR
VS . SNR ( D B) IN T RAIN S TATION N OISE , O PERATING ON MFCC F EATURES . A LL A PPROACHES ARE BASED ON V ITERBI R ECOGNITION
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED
ON V ITERBI R ECOGNITION

34.1% at 0 dB SNR to 89.6% in clean conditions7 , while the


noise sensitive Gabor baseline yields a poor 1.6% at 0 dB SNR
up to a strong 95.9% in clean conditions. The MFCC base-
line obviously provides the best reference performance at low
SNR. Regarding information fusion strategies, the CONCAT
approach provides accuracies of 1.5% to 95.7%. Reflecting
the bad recognition performance of the Gabor features in low
SNR, the feature concatenation apparently benefits from the
information fusion in high SNR ( 15 dB): Where MFCC-B
and Gabor-B yield an almost equal recognition performance,
CONCAT serves as the best reference in noisy conditions at
SNR 15 dB. Looking at decision level fusion, the weighted
ROVER yields a performance in between the MFCC and the
Gabor baselines, close to the better one. This strong recognition
performance is particularly underlined by the fact, that ROVER
provides the best audio-only reference algorithm on average
over all SNR conditions. Among the turbo FBA schemes,
MFCC-T exceeds all reference approaches in all SNR condi-
tions, while Gabor-T is even better in very low SNR. Fig. 11. Audio-only recognition results in word accuracy (% ACC) vs. the
Looking at Fig. 10 (SNR = 20 dB), the MFCC baseline number of iterations z at an SNR of 20 dB in white Gaussian noise. All
(84.2%) and the Gabor baseline (85.1%) yield a similar recog- approaches are based on Viterbi recognition.
nition accuracy. Regarding information fusion strategies, the
feature concatenation approach slightly improves recognition FBA schemes show a significant improvement of the word
performance (85.7%), while ROVER provides a recognition accuracy providing recognition results of up to 88.9% in the
result of only 84.1%. As in audio-visual ASR, again the turbo 8th iteration. Note that mainly the inherent iterative informa-
7 Please note the somewhat lower MFCC baseline quality in this task (com- tion transfer of the turbo approach has a large share of the total
pared to the audio-visual task). This is due to the speaker-independent HMMs performance gain, since introducing an individually optimized
in the audio-only tasks. weighting of the MFCC or Gabor baseline FBA observation
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 859

TABLE XIII Looking at Fig. 11 (SNR = 20 dB), given almost equal recog-
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) nition accuracies of the MFCC (85.4%) and the Gabor baseline
VS . SNR ( D B) FOR D IFFERENT C ONFIDENCE C OMPUTATIONAL D ELAYS
t. T URBO V ITERBI A PPROACHES O PERATING ON MFCC
(85.1%), the feature concatenation approach slightly improves
AND G ABOR F EATURES IN WHITE G AUSSIAN N OISE recognition performance (85.8%). The turbo Viterbi schemes,
on the other hand, clearly take profit from the iterative and
weighted information fusion. Both show a clear improvement
of the word accuracy from 85.4% (85.1%) to 88.2% (87.4%) in
the 8th iteration.
Taking a look at the AURORA-2 noises (Tabs. X-XII), to
some extent we observe a similar behavior as with white
Gaussian noise. Besides a local 10 dB weakness of the turbo
approaches in car noise, we find a 0.1% clean condition
advantage of the Gabor baseline and the again best reference
approach ROVER vs. the MFCC-T turbo approach. Also in
SNR = 25 dB (train station and babble noise) we find that
likelihoods only yields accuracies at best of 85.1% (Gabor) in CONCAT performs a bit better than the turbo approaches. Upon
this 20 dB SNR condition. closer analysis, this might be owed to somewhat wrong or not
Considering again the AURORA-2 noise environments reliable extrinsic information, impeding strongly the interaction
(Tabs. VI-VIII), the recognition results reflect the previous find- between the individual CRs, particularly in very high SNRs.
ings with white Gaussian noise, except that in babble noise, Again, however, on average over all SNR conditions, both turbo
averaged over all SNR, CONCAT and CHMM perform slightly ASR approaches exceed the performance of any of the refer-
better than ROVER. Again, MFCC-T turbo ASR outperforms ence approaches in each of the noise types. Moreover, averaged
all reference approaches in all noises and SNR conditions. over all SNRs and all four noise types the turbo approach
Moreover, both turbo ASR approaches exceed the performance MFCC-T is ahead of the best reference approach (ROVER) by
of any of the reference approaches on average over all SNR an absolute of 1.2%, which corresponds to a relative WER
conditions in any of the investigated noise types; averaged over reduction of 6.5%.
all four noise types and SNRs, MFCC-T is ahead of the overall As observed with the AURORA-2 noises, the extrinsic prob-
best reference approach ROVER by an absolute 3.8%, which is abilities (Viterbi confidence information) play a crucial role in
a relative reduction in WER of 18.2%. potential performance of the turbo Viterbi; other approaches
c) Audio-Only Task (Viterbi): When employing the base- than (20) may even perform better. Note also that the turbo
line Viterbi algorithm (cf. Sec. II-C) to an audio-only speech Viterbi performance was obtained by means of a real-time
recognition task in white Gaussian noise, the following single- decoding approach with an adjustable computational delay of
model accuracies were achieved (Tab. IX): the MFCC baseline t frames for the purpose of providing a confidence output.
results vary from 38.7% at 0 dB SNR to 90.6% in clean con- Note that we do not actively influence the inherent decision
ditions, while the Gabor baseline yields only 1.6% at 0 dB latency of the Viterbi algorithm. Tab. XIII depicts a comparison
SNR but up to 95.9% in clean conditions. As with the FBA of three different delays t, allowing a clear view on the inher-
audio-only results, the MFCC baseline again is the best refer- ent trade-off between a low confidence output latency (t =
ence approach for low SNR. Again, the feature concatenation 10 frames) and the more reliable confidence (20) obtained
approach very much follows the noise-sensitive Gabor base- at t = 100 frames. For all earlier turbo Viterbi experiments
line recognition results, serving as a sound reference at high reported here, we employed t = 100 frames (cf. Sec. III-B),
SNRs (> 15 dB). Still, again the weighted ROVER provides but even with t = 20 frames the performance on average over
a strong recognition performance in between the MFCC and all SNR conditions would have been better than the best refer-
the Gabor baselines, close to the better one. We observe, that ence scheme ROVER (cf. Tab. IX). Note that recognition results
ROVER offers the best audio-only reference on average over achieved in batch file mode (i.e., t = T and t = T t with
all SNR conditions. T being the last frame in the file) even exceeded the herein
As with the FBA-based turbo schemes, both audio-only reported results for t = 100 frames.
Viterbi turbo approaches perform better on average over all Let us close with some final considerations on complexity.
SNR conditions than any of the reference approaches, with What are additional requirements for the turbo ASR approach
MFCC-T being the best among these. Only in clean condition on top of some existing Viterbi-based ASR system? In broad
the Gabor baseline (Gabor-B) as well as the weighted ROVER terms, each iteration z requires an execution of the Viterbi
reference yield slightly better word accuracies. In this condi- algorithm (19). The additional multiplication with the extrinsic
tion, the somewhat weaker performance of the turbo Viterbi likelihood term in (19) is negligible compared to the max[ ]
schemes might be owed to the use of not globally optimal operation. Next, the computation of extrinsic information (20)
weights on the extrinsic likelihoods impeding the feedback is required. This does not consume more computational power
between the two individual CRs8 . than typical confidence computations, particularly since it
turns out that the performance loss of omitting the two small
8 Note that the undisturbed (i.e., clean) condition is not part of the multi- denominators in (20) (very much like in (28)) is small and
(s) (s)
condition setup for extrinsic likelihood weights optimization (cf. Sec. V-A). omits bookkeeping of former products bi gi . Apart from
860 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

the negligible effort of employing intrinsic and extrinsic time instant t in (17) being taken out. Given evidence [64]
likelihood weights (cf. Fig. 5), there is indeed effort needed that the output a posteriori probabilities (APPs) of a decoder
in computing the multiplication (14) with matrix T(s)(r) , or already provide a sufficient statistic of the received sequences,
T(r)(s) , respectively. While the matrix itself requires N M i.e., all exploited information, accordingly the probability of
words of memory (the number of states N, M can easily be state rt = k given the exploited information can be replaced by
(r)
about 10,000 in LVCSR systems), the execution of zmax itera- the (APP-related) extrinsic probability t (k) of the respective
tions (zmax = 8 in our simulations) requires (zmax 1) N M state k (31). With p(
(r)
t ) being omitted in (31) due to its state-
multiply-accumulate operations per frame. Since this may independence, we derive a simple formulation of the extrinsic
be considered still a computational obstacle towards the use likelihood in (10):
of turbo ASR in practical LVCSR applications, the role and
structure and even necessity of the state transition matrix T is (s) (r)
 P(rt = k | st = i) (r) (s) (r)
gi (
t )
t (k) = gi (t ).
of major interest for further investigations. P(rt = k)
kR
(32)
VI. C ONCLUSIONS
In this paper, we transferred the famous turbo principle A PPENDIX B
from digital communications to the domain of ASR provid- O N THE U SE OF W EIGHTS IN T URBO ASR
ing an elegant solution to classifier-level information fusion.
Eqs. (17), (19) reveal both the presence of the intrinsic
First, we reviewed our turbo decoding forward-backward algo- (s)
information bi (ot ) and the respective fed back a priori infor-
rithm (FBA) discussing differences to other prior art. Then we (s) (r)
presented the new turbo Viterbi algorithm for ASR, showing mation gi ( t ). They complement each other and ideally, the
that actually no severe modification of the Viterbi algorithm is a priori information serves to sharpen the decoders observa-
required, providing a real-time capable solution for turbo ASR tion likelihoods, enabling it to converge to a correct estimate.
in practice. We showed simulation results both in a multi-modal As illustrated in Fig. 5, first we introduce two exponents
(audio-visual) ASR task, and in a single-channel unimodal ASR 0 o , u 1 for the observation likelihoods (multiplicative
task (audio-only with two different feature extractions). The weights for the log-likelihoods), respectively, to settle the influ-
experimental results prove the significant benefit of turbo ASR ence of the intrinsic information. As in a multi-stream HMM,
approaches over both iterative and conventional methods for these two intrinsic weights also serve to compensate for a con-
information fusion on different levels, illustrated by outper- stant bias in the reliability of a respective observation likelihood
forming even the best reference system on average by a relative [25], e.g., depending on the signal-to-noise ratio (SNR) [34],
WER reduction of 22.4%, and 18.2%, respectively. [36]. Initially ensuring non-iterative reference algorithm behav-
ior, the two intrinsic weights are set to unity in the first two
iterations z {1, 2} and, from the third iteration on, they are
ACKNOWLEDGMENT separately set to an attenuated fixed value.
The authors would like to thank the unknown reviewers for Second, two weights on the extrinsic log-likelihoods are
their numerous helpful comments on an earlier draft of this employed determining the a priori information influence by
article, and also Peter Transfeld for valuable discussions on adjusting its likelihood peakedness. To obtain an increasing
likelihood stream weighting aspects. influence of the a priori information with ongoing iterations,
we employ two extrinsic weights s , r that grow dynamically
according to a logistic function
A PPENDIX A 1
C OMPUTING THE FBA E XTRINSIC L IKELIHOOD (z) =   , z = 2, 3, . . . , (33)
1
(r) 1+ 1 e(z2)
Applying Bayes rule to (11), the likelihood p(
t | rt = k) (2)

may be dissected further: with (z) {s (z), r (z)}. Here, (2) {s (2), r (2)} and
(r) (r) {s , r } mark the initial extrinsic weight and the logistic
(r) P(rt = k |
t )p(
t )
p(
t | rt = k) = proportionality constant, respectively. Hence, beginning from
P(rt = k) a given initial value, the extrinsic weights approach unity as
(r) (r) (r)
P(rt = k |
t (1), . . . ,
t (M ))p(
t ) the number of iterations z increases. Note that the exact def-
=
P(rt = k) inition of the extrinsic weight growing (33) from some start
(r) (r) value (2) towards 1 is not performance-critical, as long as it
t (k)p(
t )
= . (31) is monotonously increasing.
P(rt = k)
(r)
Note that t in (31) represents a vector of extrinsic prob- R EFERENCES
(r)
abilities t (k) = P(rt = k | . . . ) of all states rt = k R, [1] R. P. Lippmann, Speech recognition by machines and humans, Speech
given the entirety of exploited information . . . of the cur- Commun., vol. 22, no. 1, pp. 115, Jul. 1997.
[2] C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error-
rent and preceding iterations, which is basically the sequences correcting coding and decoding: Turbo-codes, in Proc. IEEE Int. Conf.
{oT1 , uT1 } with the actual intrinsic and a priori information at Commun. (ICC), Geneva, Switzerland, May 1993, pp. 10641070.
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 861

[3] C. Berrou, R. Pyndiah, P. Adde, C. Douillard, and R. Le Bidan, An [29] U. Jain et al., Recognition of continuous broadcast news with multi-
overview of turbo codes and their applications, in Proc. IEEE Eur. Conf. ple unknown speakers and environments, in Proc. ARPA Speech Recog.
Wireless Technol., Paris, France, Oct. 2005, pp. 19. Workshop, Harriman, NY, USA, Feb. 1996, pp. 6166.
[4] C. E. Shannon, A mathematical theory of communication, Bell Syst. [30] J. Ming, P. Hanna, D. Stewart, M. Owens, and F. J. Smith, Improving
Tech. J., vol. 27, pp. 379423, Jul. 1948. speech recognition performance by using multi-model approaches, in
[5] S. Lin and J. D. J. Costello, Error Control Coding. Englewood Cliffs, NJ, Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Phoenix, AZ,
USA: Prentice-Hall, 1983. USA, Mar. 1999, pp. 161164.
[6] R. Johannesson and K. S. Zigangirov, Fundamentals of Convolutional [31] J. G. Fiscus, A post-processing system to yield reduced word error rates:
Coding. Hoboken, NJ, USA: Wiley/IEEE Press, 1999. Recognizer output voting error reduction (ROVER), in Proc. Workshop
[7] G. Forney, The Viterbi algorithm, Proc. IEEE, vol. 61, no. 3, pp. 268 Automat. Speech Recog. Understand. (ASRU), Santa Barbara, CA, USA,
278, Mar. 1973. Dec. 1997, pp. 347352.
[8] L. Rabiner and B.-H. Juang, Fundamentals of Speech Processing. [32] L. Mangu, E. Brill, and A. Stolcke, Finding consensus in speech recog-
Englewood Cliffs, NJ, USA: Prentice-Hall, 1993. nition: Word error minimization and other applications of confusion
[9] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear networks, Comput. Speech Lang., vol. 14, no. 4, pp. 373400, Oct.
codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT- 2000.
20, no. 2, pp. 284287, Mar. 1974. [33] S. Lucey, T. Chen, S. Sridharan, and V. Chandran, Integration strategies
[10] L. Bahl, F. Jelinek, and R. Mercer, A maximum likelihood approach to for audio-visual speech processing: Applied to text-dependent speaker
continuous speech recognition, IEEE Trans. Pattern Anal. Mach. Intell., recognition, IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495506, Jun.
vol. PAMI-5, no. 2, pp. 179190, Mar. 1983. 2005.
[11] J. Hagenauer and P. Hoeher, A Viterbi algorithm with soft-decision out- [34] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, Dynamic Bayesian
puts and its applications, in Proc. GLOBECOM, Dallas, TX, USA, Nov. networks for audio-visual speech recognition, EURASIP J. Appl. Signal
1989, pp. 16801686. Process., vol. 11, pp. 115, 2002.
[12] J. Huber and A. Rppel, Zuverlssigkeitsschtzung fr die [35] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, Frame-dependent
Ausgangssymbole von Trellis-Decodern, AE, vol. 44, no. 1, pp. 821, multi-stream reliability indicators for audio-visual speech recognition,
Jan. 1990. (in German). in Proc. Int. Conf. Multimedia Expo (ICME), Baltimore, MD, USA, Jul.
[13] H. Jiang, Confidence measures for speech recognition: A survey, 2003, pp. 605608.
Speech Commun., vol. 45, pp. 455470, 2005. [36] D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, Audiovisual
[14] F. Wessel, R. Schlter, K. Macherey, and H. Ney, Confidence measures speech recognition with missing or unreliable data, in Proc. Auditory
for large vocabulary continuous speech recognition, IEEE Trans. Speech Visual Speech Process. (AVSP), Norwich, U.K., Sep. 2009, pp. 117122.
Audio Process., vol. 9, no. 3, pp. 288298, Mar. 2001. [37] J. Luettin, G. Potamianos, and C. Neti, Asynchronous stream modeling
[15] J. Hagenauer, The turbo principle: Tutorial introduction and state of the for large vocabulary audio-visual speech recognition, in Proc. Int. Conf.
art, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. Acoust. Speech Signal Process. (ICASSP), Salt Lake City, UT, USA, May
1997, pp. 111. 2001, pp. 169172.
[16] F. Faubel and M. Wlfel, Coupling particle filters with automatic speech [38] A. Abdelaziz, S. Zeiler, and D. Kolossa, Learning dynamic stream
recognition for speech feature enhancement, in Proc. Int. Conf. Spoken weights for coupled-HMM-based audio-visual speech recognition, IEEE
Lang. Process. (ICSLP), Pittsburgh, PA, USA, Sep. 2006, pp. 3740. Trans. Audio Speech Lang. Process., vol. 23, no. 5, pp. 863876, Mar.
[17] Z.-J. Yan, F. Soong, and R.-H. Wang, Word graph based feature enhance- 2015.
ment for noisy speech recognition, in Proc. Int. Conf. Acoust. Speech [39] S. Shivappa, B. Rao, and M. Trivedi, An iterative decoding algorithm
Signal Process. (ICASSP), Honululu, HI, USA, Apr. 2007, vol. 4, pp. IV- for fusion of multimodal information, EURASIP J. Adv. Signal Process.,
373IV-376. vol. 2008, pp. 110, 2008.
[18] J. D. Li Deng and A. Acero, Recursive estimation of nonstationary noise [40] S. Shivappa, B. Rao, and M. Trivedi, Multimodal information fusion
using iterative stochastic approximation for robust speech recognition, using the iterative decoding algorithm and its application to audio-visual
IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 568580, Nov. speech recognition, in Proc. Int. Conf. Acoust. Speech Signal Process.
2003. (ICASSP), Las Vegas, NV, USA, Apr. 2008, pp. 22412244.
[19] S. Windmann and R. Haeb-Umbach, Approaches to iterative speech fea- [41] S. Shivappa, M. Trivedi, and B. Rao, Audiovisual information fusion
ture enhancement and recognition, IEEE Trans. Audio Speech Lang. in human-computer interfaces and intelligent environments: A survey,
Process., vol. 17, no. 5, pp. 974984, Jul. 2009. Proc. IEEE, vol. 98, no. 10, pp. 16921715, Oct. 2010.
[20] M. Paulik, S. Stuker, C. Fugen, T. Schultz, T. Schaaf, and A. Waibel, [42] S. Shivappa, M. Trivedi, and B. Rao, Person tracking with audio-visual
Speech translation enhanced automatic speech recognition, in Proc. cues using the iterative decoding framework, in Proc. IEEE 5th Int. Conf.
Workshop Automat. Speech Recog. Understand. (ASRU), Cancn, Adv. Video Signal Based Surveillance (AVSS), Santa Fe, NM, USA, Sep.
Mexico, Nov. 2005, pp. 121126. 2008, pp. 260267.
[21] H. Bourlard and S. Dupont, A new ASR approach based on independent [43] S. Shivappa, B. Rao, and M. Trivedi, Audiovisual fusion and tracking
processing and recombination of partial frequency bands, in Proc. Int. with multilevel iterative decoding: Framework and experimental evalua-
Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, tion, IEEE J. Sel. Topics Signal Process., vol. 4, no. 5, pp. 882894, Oct.
pp. 426429. 2010.
[22] H. Hermansky, S. Tibrewala, and M. Pavel, Towards ASR on partially [44] D. Divsalar and F. Pollara, Turbo codes for deep-space communica-
corrupted speech, in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), tions, Jet Propul. Lab., Pasadena, CA, USA, Telecommun. Data Acquis.
Philadelphia, PA, USA, Oct. 1996, pp. 462465. Progress Rep. 42120, Feb. 1995.
[23] W. H. Sumby and I. Pollack, Visual contribution to speech intelligibility [45] R. G. Gallager, Low-density parity-check codes, IRE Trans. Inf.
in noise, J. Acoust. Soc. Amer., vol. 26, no. 2, pp. 212215, Mar. 1954. Theory, vol. 8, no. 1, pp. 2128, Jan. 1962.
[24] D. G. Stork, M. E. Hennecke, and K. V. Prasad, Visionary speech: [46] J. Lodge, R. Young, P. Hoeher, and J. Hagenauer, Separable MAP
Looking ahead to practical speechreading systems, in Speechreading by filters for the decoding of product and concatenated codes, in Proc.
Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. New York, IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993,
NY, USA: Springer, 1996. pp. 17401745.
[25] C. Neti et al. Audio-visual speech recognition, Center Lang. Speech [47] S. ten Brink, Convergence behaviour of iteratively decoded parallel con-
Process., Johns Hopkins Univ., Baltimore, MD, USA, Tech. Rep. EPFL- catenated codes, IEEE Trans. Commun., vol. 49, no. 10, pp. 17271737,
Report-82633, IDIAP, 2000. Oct. 2001.
[26] G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, Large-vocabulary [48] D. Scheler, S. Walz, and T. Fingscheidt, On iterative exchange
audio-visual speech recognition by machines and humans, in Proc. of soft state information in two-channel automatic speech recog-
Eurospeech, Aalborg, Denmark, Sep. 2001, pp. 10271030. nition, in Proc. ITG-Fachtagung Sprachkommunikation, Sep. 2012,
[27] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, Audio-visual pp. 5558.
automatic speech recognition: An overview, in Issues in Visual and [49] S. Receveur and T. Fingscheidt, A turbo-decoding weighted forward-
Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson and backward algorithm for multimodal speech recognition, in Proc. Int.
P. Perrier, Eds. Cambridge, MA, USA: MIT Press, 2004, pp. 356396. Workshop Spoken Dialog Syst. (IWSDS), Napa Valley, CA, USA, Jan.
[28] J. Kratt, F. Metze, R. Stiefelhagen, and A. Waibel, Large vocabu- 2014, pp. 415.
lary audio-visual speech recognition using the Janus speech recogni- [50] S. Receveur and T. Fingscheidt, A compact formulation of turbo audio-
tion toolkit, in Proc. DAGM-Symp., Tbingen, Germany, Aug. 2004, visual speech recognition, in Proc. Int. Conf. Acoust. Speech Signal
pp. 488495. Process. (ICASSP), Florence, Italy, May 2014, pp. 55545558.
862 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016

[51] S. Receveur, R. Weiss, and T. Fingscheidt, Multimodal ASR by turbo Robin Wei received the M.Sc. degree in com-
decoding vs. feature concatenation: Where to perform information inte- puter and communications systems engineering
gration? in Proc. 11th ITG Conf. Speech Commun., Erlangen, Germany, from Technische Universitt Braunschweig,
Sep. 2014, pp. 2124. Braunschweig, Germany, in 2015. During his
[52] C. Douillard et al., Iterative correction of intersymbol interference: studies, he worked as a Student Assistant in the
Turbo-equalization, Eur. Trans. Telecommun., vol. 6, no. 5, pp. 507511, field of automatic speech recognition and wrote his
May 1995. master thesis at the Institute for Communications
[53] R. Zhang and A. Rudnicky, Word level confidence annotation using Technology on turbo automatic speech recognition
combinations of features, in Proc. 7th Eur. Conf. Speech Commun. with multiple models. Since 2015, he has been
Technol., Aalborg, Denmark, Sep. 2001, pp. 21052108. working as a self-employed Data Scientist in Berlin,
[54] A. C. Reid, T. A. Gulliver, and D. P. Taylor, Convergence and errors in Germany. His research interests include iterative
turbo-decoding, IEEE Trans. Commun., vol. 49, no. 12, pp. 20452051, ASR, machine learning, and data visualization.
Dec. 2001.
[55] ETSI STQ Aspects: Distributed Speech Recognition; Advanced Front-
End Feature Extraction Algorithm; Compression Algorithms, ETSI ES Tim Fingscheidt (S93M98SM04) received the
202 050, Oct. 2002. Dipl.-Ing. degree in electrical engineering and
[56] M. R. Schdler, B. T. Meyer, and B. Kollmeier, Spectro-temporal modu- the Ph.D. degree from RWTH Aachen University,
lation subspace-spanning filter bank features for robust automatic speech Aachen, Germany, in 1993 and 1998, respectively. He
recognition, J. Acoust. Soc. Amer., vol. 131, no. 5, pp. 41344151, May further pursued his work on joint speech and chan-
2012. nel coding as a Consultant in the Speech Processing
[57] B. Hoffmeister, T. Klein, R. Schlter, and H. Ney, Frame based sys- Software and Technology Research Department at
tem combination and a comparison with weighted ROVER and CNC, in AT&T Labs, Florham Park, NJ, USA. In 1999, he
Proc. INTERSPEECH, Pittsburgh, PA, USA, Sep. 2006, pp. 537540. entered the Signal Processing Department of Siemens
[58] K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, AG (COM Mobile Devices) in Munich, Germany,
Theoretical analysis of diversity in an ensemble of automatic speech and contributed to speech codec standardization in
recognition systems, IEEE/ACM Trans. Audio Speech Lang. Process., ETSI, 3GPP, and ITU-T. In 2005, he joined Siemens Corporate Technology
vol. 22, no. 3, pp. 711726, Mar. 2014. in Munich, Germany, leading the speech technology development activi-
[59] K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, ties in recognition, synthesis, and speaker verification. Since 2006, he is a
Empirical link between hypothesis diversity and fusion performance Full Professor with the Institute for Communications Technology, Technische
in an ensemble of automatic speech recognition systems, in Proc. Universitt Braunschweig, Braunschweig, Germany. His research interests
INTERSPEECH, Lyon, France, Aug. 2013, pp. 30823086. include speech and audio signal processing, enhancement, transmission, recog-
[60] M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual cor- nition, and instrumental quality measures. From 2008 to 2010, he served as
pus for speech perception and automatic speech recognition, J. Acoust. an Associate Editor for the IEEE T RANSACTIONS ON AUDIO , S PEECH ,
Soc. Amer., vol. 120, no. 5, pp. 24212424, Nov. 2006. AND L ANGUAGE P ROCESSING , and since 2011 as a Member of the IEEE
[61] H. G. Hirsch and D. Pearce, The AURORA experimental framework for Speech and Language Processing Technical Committee. He was the recipient
the performance evaluations of speech recognition systems under noisy of several awards including the Prize of the Vodafone Mobile Communications
conditions, in Proc. ISCA Workshop Automat. Speech Recog. (ASR), Foundation in 1999 and the 2002 prize of the Information Technology branch
Paris, France, Sep. 2000, pp. 18. of the Association of German Electrical Engineers (VDE ITG), where he has
[62] ITU, ITU-T Recommendation P.56, Objective measurement of active been leading the Speech Acoustics Committee ITG FA4.3 since 2015.
speech level, Dec. 2011.
[63] T. G. Kolda, R. M. Lewis, and V. Torczon, A generating set direct search
augmented Lagrangian algorithm for optimization with a combination of
general and linear constraints, Sandia National Lab., Albuquerque, NM,
USA, Tech. Rep. SAND2006-5315, Aug. 2006.
[64] J. Kliewer, S. X. Ng, and L. Hanzo, Efficient computation of EXIT func-
tions for nonbinary iterative decoding, IEEE Trans. Commun., vol. 54,
no. 12, pp. 21332136, Dec. 2006.

Simon Receveur received the Dipl.-Ing. degree in


electrical engineering from Technische Universitt
Braunschweig, Braunschweig, Germany. Following
his diploma thesis, in January 2012 he started work-
ing towards his Ph.D. degree in the field of turbo
automatic speech recognition at the Institute for
Communications Technology, Technische Universitt
Braunschweig. In summer 2015, he was an Intern
with the Watson Multimedia Group, IBM Thomas
J. Watson Research Center, Yorktown Heights, NY,
USA. His research interests include iterative ASR,
information fusion, and speaker verification.

Вам также может понравиться