Академический Документы
Профессиональный Документы
Культура Документы
5, MAY 2016
associate editor coordinating the review of this manuscript and approving it for
atic bits x = x and the parity bits y(s) = (y (s) )T1 . The input
(s)
publication was Dr. Dong Yu. connections of the adders are typically described by so-called
The authors are with the Institute for Communications Technology, generator polynominals G (Gr = recursive polynominal (111),
Technische Universitt Braunschweig, Braunschweig D-38106, Germany G1 = (101)). In the example of Fig. 1 the number of output
(e-mail: s.receveur@tu-bs.de; robin.weiss@tu-bs.de; t.fingscheidt@tu-bs.de).
Color versions of one or more of the figures in this paper are available online bits1 is T (s) = rT(s) with the so-called code rate r(s) = 12 .
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASLP.2016.2520364 1 without regarding a so-called trellis termination
2329-9290 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 847
Fig. 3. A parallel turbo encoder with a random input bit sequence x govern-
ing state transitions, involving a bit interleaver . After encoding, a puncturing
scheme is applied and the remaining bits are multiplexed. A transmission chan-
nel introduces random additive noise providing the sequence of observations z.
The demodulator provides sequences of log-likelihood ratios (LLRs) separately
for both subsequent convolutional decoders (s) and (r).
use of the trellis diagram. Alternatively, in both worlds, the The input bit sequence x = xT1 is offered to encoder (r) in
forward-backward algorithm (FBA) can be applied for decod- pseudorandomized order x(r) by means of interleaver . Since
ing, providing the sequence of most probable states and reliable after multiplexing, the input bits x and both parity outputs y(s)
soft outputs (in ASR called: confidences) [8][10]. The reliable and y(r) are typically of a bit rate too high, they are subject to
confidence scores based on true a posteriori probabilities are some regular bit stealing (called puncturing), resulting, after
the reason why we will start our turbo ASR presentation later multiplexing, in a coded bit sequence y = y1T . A transmission
on based on the FBA, and will proceed to the turbo Viterbi algo- channel may add noise (typically on a modulation symbol
rithm only in a second step. Note that the FBA is often called level) yielding noisy non-binary observations z = z1T . Please
BCJR algorithm [9] in digital communications. note that these channel outputs have their analogy in ASR
A main prerequisite of the discovery and application of the in the sequence of observed feature vectors oT1 (cf. Fig. 2).
turbo principle in digital communications was the availability A demodulator computes log-likelihood ratios (LLRs)
of soft-output (channel) decoders providing not only decoded Z = log p(z|y=1)
p(z|y=0) and, after demultiplexing, provides two
bits, but also the confidence of such decision. While the BCJR LLR streams Z(s) and Z(r) each suited for decoding by the
algorithm naturally provides such confidences, a soft-output respective decoder (see Fig. 4). Note that in ASR LLRs are not
848 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
used since the channel input is not binary, but consists of any fusion the separate classification processes do not take profit
of the states s, see Fig. 2. Therefore likelihoods p(o | s = i) or from each other. This interaction during classification is a pri-
log-likelihoods are employed (see Fig. 5). mary element of classifier level fusion systems, which gather
In Fig. 4 the turbo decoder is shown, where in the first the input information sources at a slightly higher level than fea-
iteration (z = 1) decoder (s) processes LLRs Z(s) assuming ture fusion systems, often in the form of so-called likelihood
(s)
=1)
some flat prior Q(s) = 0 (simply meaning log P(x P(x(s) =0)
= 0). streams. In particular, in classifier level fusion systems learning
Decisions could be taken on the basis of computed posteriors and classifying of the input streams is employed indepen-
(s)
=1 | Z(s) ) dently, but under a temporal dependence assumption between
D(s) = log P(x
P(x(s) =0 | Z(s) )
by simple decision for the sign. A the observation likelihood streams during classification [33].
so-called extrinsic information E(s) is computed and after inter- Employing for instance so-called coupled HMMs (CHMMs)
leaving it is given to the next decoder (r) as non-flat prior Q(r) . [25], [34], this interaction also might involve a proper weighting
In iteration z = 2 decoder (r) processes it along with its own scheme of different information sources, e.g., according to their
LLRs (intrinsic information) Z(r) . This process of an alternat- reliability [35]. Often being optimized beforehand on training
ing call of the decoders can now be repeated until convergence data, these weights are subsequently applied during recognition
of the decoding results x , which can be obtained any time from while computing the joint observation likelihoods [34], [36],
any of the decoders by deciding for the sign of the posteriors D [37]; recently also real-time weight updates have been proposed
(r)
=1 | Z(s) ,Z(r) )
with, e.g., D(r) = log P(x
P(x(r) =0 | Z(s) ,Z(r) )
. Note that if z is even, [38].
1
a deinterleaver is required (see Fig. 4). For further details For a multi-modal ASR application, Shivappa et al. proposed
on turbo decoding, the interested reader is referred to [2], [3] to our knowledge for the first time in ASR an iterative
and particularly to [15]. fusion of two parallel audio-visual likelihood streams on the
classifier level [39][41]. Later on they adopted their parallel
recognition scheme to speaker localization and tracking [42],
[43]. Their approach comes with the advantage of two sepa-
C. Iterative Decoding and Information Fusion in ASR
rately trained HMMs for each modality instead of a joint one
In ASR literature, iterative decoding has often been investi- (as in feature-level fusion). However, during recognition the
gated seeking to improve the recognition robustness in adverse iterative decoding is controlled by a heuristic rate parameter
acoustic conditions. Considering the feature level, research modeling and re-estimation of the observation likelihood distri-
on iterative decoding focused mainly on denoising and de- butions. Moreover, within the feedback loop the modified fed-
reverberation of speech features with feedback of a previous back a posteriori probabilities still contain intrinsic information
recognition pass into the feature extraction. While Faubel et al. contradicting the principles of turbo decoding [44][47].
used the best hypothesis as feedback [16], Yan et al. employed Originating from Shivappas altered FBA approach, we
a complete word graph to clean-up features [17]. Aiming at showed in our previous work [48][51] that the unmodified
feature vector enhancement by means of an iterative lineariz- FBA is already suitable for iterative recognition. This can be
ing approximation and compensation, Deng et al. [18] took achieved just by a redefinition of the likelihood term. Moreover,
significant profit from an iterative procedure. In terms of a it was shown that a definition of an extrinsic information (for
model-based approach, Windmann and Haeb-Umbach com- information exchange between ASR decoders) is possible with
puted the HMM state posteriors for feedback to a model-based the intrinsic information being taken out exactly as required
speech feature enhancement [19]. Moreover, at decision level, by the turbo principle.
iterative approaches achieve improvements by combining ASR In this paper, we will first revisit all important aspects of
and machine translation techniques [20]. the turbo FBA. Then we complement our transfer of the turbo
Another approach of improving ASR performance in adverse principle to the domain of automatic speech recognition by
acoustic conditions is based on information fusion. Here, fur- introducing also the turbo decoding Viterbi algorithm, which
ther information sources such as additional acoustic channels will allow for real-time implementations in a practical context.
[21], [22], modalities [23][28], or models [29], [30] are We then show applications in two fields: First in audio-visual
exploited by the speech recognizer. Inherently, the success of ASR as a well-known representative of multi-modal ASR.
such methods is closely linked to the level of information Second, we present simulation results of both turbo FBA and
fusion within the recognition process and the employed fusion the new turbo Viterbi algorithm for a single channel unimodal
method. In feature level fusion systems, commonly the vari- ASR, where two different feature extractions are employed. We
ous input features are combined into a single representation, also briefly discuss likelihood stream weighting aspects of our
e.g., by ordinary concatenation, where the classifier learns the approaches. Moreover, our proposed approach is discussed both
statistics of the joint observations [25][27]. In contrast, in deci- with respect to the required complexity, as well as with respect
sion level fusion systems separate classifier output hypotheses to the influence of latency in extrinsic information computation.
(and confidence scores) are combined to achieve a joint deci- These aspects are of particular interest, when judging the appli-
sion (cf. ROVER [31], confusion networks [32]). While being cability of turbo ASR for large vocabulary continuous speech
very flexible in terms of amount and choice of the individual recognition (LVCSR).
classifiers, this fusion approach also enables a control of the The paper is organized as follows: For later reference, first we
relative influence of each classifier output by a voting scheme, introduce notations and recapitulate briefly the baseline FBA
e.g., according to its confidence. However, in decision level and the Viterbi algorithm in Section II. Section III gives an
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 849
outline of the new turbo ASR based on the FBA and the Viterbi C. The Viterbi Algorithm
algorithm, while Section IV briefly sketches the employed fea- Given an observation sequence oT1 , at each time instant t =
tures and all investigated information fusion ASR reference
1, . . . , T a score
approaches. The performance of the presented approaches is
evaluated in Section V in the audio-visual and the audio-only t (i) = max p(ot1 , s1 , . . . , st = i | (s) ) (5)
iS
task. The paper is concluded with its most important results in
Section VI. for state st = i is computed, based on the observation sequence
ot1 = o1 , . . . , ot . Using the Viterbi algorithm [7], [8], the scores
II. N OTATIONS AND BASELINE A LGORITHMS (5) are obtained recursively by
In order to prepare grounds for presentation of the new turbo t (i) = max[t1 (j) aj,i ] bi (ot ), t = 2, . . . , T, (6)
jS
ASR approach, this section outlines the notations used hereafter
and briefly reviews the baseline algorithms, i.e., the forward- t (i) = arg max[t1 (j) aj,i ] t = 2, . . . , T, (7)
jS
backward algorithm (FBA) [9] and the Viterbi algorithm [7].
whereas so-called backtracking pointers t (i) indicate the opti-
mal predecessor state st1 (st = i) = j for every corresponding
A. Notations t (i). Being initialized by 1 (i) = i bi (o1 ) and 1 (i) = 0, sub-
Let xT1 = x1 , . . . , xT be a sequence of do -dimensional fea- sequently these variables are employed to deliver the most
ture vectors with values xt = ot Rdo for each frame index likely state sequence (s )T1 = s1 , . . . , sT by applying
t = 1, . . . , T . The continuous density hidden Markov model
(HMM) parameters for processing xT1 are the vector = sT = arg max T (i), (8)
iS
[1 , . . . , N ]T of prior probabilities i = P(s1 = i) of all st = t+1 (st+1 ), t = T 1, T 2, . . . , 1, (9)
states i S = {1, . . . , N }, the matrix A = {aj,i }j,iS of state
transition probabilities aj,i = P(st = i | st1 = j), and the set in reversed chronological order. Note that any factor that is con-
B = {bi (xt )}iS of do -variate emission probability density stant with respect to state i has no effect on the result of the
functions (pdfs) bi (xt ) = p(xt | st = i). Please note that we maximum decision (8) and thus can generally be omitted.
distinguish probabilities P() and pdfs (or their values) p(). In
order to ease the understanding of our turbo ASR approaches
in Section III, we will now briefly recapitulate the well-known III. T HE T URBO ASR A PPROACH
forward-backward algorithm and the Viterbi algorithm, which The strength of the turbo principle as known from digital
will also both serve as baseline approaches in later simulations. communications lies in the exchange of reliability (or so-
called extrinsic) information enabling the various (typically
two) involved decoders to improve their estimates of the trans-
B. The ForwardBackward Algorithm (FBA)
mitted information instead of simply decoding it. Adopting the
Given an observation sequence oT1 , at each time instant t = turbo principle to the domain of automatic speech recogni-
1, . . . , T a hidden state st = i S is assigned with the pos- tion, this section outlines the turbo-decoding FBA and the new
terior probability t (i) = P(st = i | oT1 , (s) ). The state-level turbo Viterbi algorithm. Note that although we only present two
maximum-a-posteriori (MAP) recognizer [8], [10] provides the likelihood streams and decoders, we are not at all restricted con-
sequence (s )T1 = s1 , . . . , sT of most likely states by cerning the number of input streams in practice, since the turbo
approach only specifies the information flow from one decoder
st = arg max t (i), t = 1, . . . , T. (1)
iS to the next, i.e., two decoders are always involved at a time.
Applying the FBA, the state posteriors t (i) are obtained by Besides feature vector sequence xT1 = oT1 , let there be
another observation sequence uT1 of the same length2 T as oT1 ,
1 but from a different feature space ut Rdu . Note that the two
t (i) = t (i)t (i), t = 1, . . . , T, (2)
Ct feature vector sequences oT1 , uT1 may stem from two different
t (i) = bi (ot ) aj,i t1 (j), t = 2, . . . , T, (3) modalities (e.g., as in audio-visual ASR), or from two different
jS sensors of the same modality (e.g., as in multi-channel ASR), or
even from the same microphone (but different feature extraction
t (i) = bj (ot+1 )ai,j t+1 (j), t = T 1, . . . , 1, (4)
techniques are employed), or even we may have ot = ut (but
jS
both feature vectors are later subject to two different acoustic
where t (i) = p(ot1 , st = i) and t (i) = p(oTt+1 | st = i) for models).
all i S denote the forward and backward variables, respec- Furthermore, let there be two state-level speech recognizers
tively. These variables are initialized to 1 (i) = i bi (o1 ) and concatenated in parallel as shown in Figure 5. Each of these
T (i) = 1; subsequently, computation is carried out recursively component recognizers (CRs) shall process one of the obser-
vation sequences oT1 , uT1 ; accordingly, each CR employs an
according to (3) and (4). TheNconstant Ct ensures the posterior
state distribution to fulfill i=1 t (i) = 1. Note that any state- 2 Note that we have assumed an equal length of observation sequences only
independent factor with respect to state i will cancel out in (3) for clarity of presentation. This implicitly means that there is an identical frame
and (4), once (2) is applied. shift in computing both ot and ut .
850 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
Fig. 5. Turbo (FBA) speech recognizer with two streams of input likelihoods p(o | s = i) and p(u | r = k), time index t omitted, and iteration index z = 1, 2, . . .,
(r) is not subject to a transformation matrix T(s)(r) , simply because the arg max[ ]
starting with the component recognizer (s). Different to Fig. 4, the posterior
function of component recognizer (r) is drawn to deliver the sequence of optimal states r (and not s ).
HMM as acoustic model = {; A; B} matching the respec- Following again [40], the extrinsic likelihoods may be written
tive observations. For distinction, the two individually trained as a marginalization result according to
sets of HMMs shall be assigned by (s) and (r) , where the (s) (r)
(r)
superscripts (s) and (r) refer to the respective state index spaces gi ( t ) = p(t | rt = k) P(rt = k | st = i), (11)
S = {1, . . . , N } and R = {1, . . . , M } of the CRs. Throughout kR
this work, we will adopt these superscripts for other symbols as (r)
assuming conditional independence between t and st , given
well, wherever helpful for clarification. Without loss of gener-
state rt = k. With P(rt = k | st = i) being assumed station-
ality, CR (s) shall process the observation sequence oT1 and CR
ary, Shivappa et al. [40] proposed a heuristic model regarding
(r) the observation sequence uT1 , or the streams of likelihoods (r)
p(t | rt = k), whose parameters are re-estimated iteratively
p(o | s = i), p(u | r = k), respectively, as shown in Fig. 5.
during recognition. In contrast to that, we provide an analytical
solution by means of Bayes rule (see App. A).
A. Turbo ForwardBackward Decoder So far, we assumed equal HMM state index spaces within
each CR. However, the respective state index spaces S, R may
Given two state-level recognizers concatenated in parallel differ in multimodal ASR systems, e.g., audio-visual speech
(Fig. 5), each CR receives both the respective observation recognition. We will consider this in the following by merely
sequence and some state information t from the other CR. assuming a stationary known prior co-occurance probability for
As in turbo decoding, these so-called extrinsic probabilities are all HMM states i S and k R, in the form
related, but not exactly equal to the state posteriors t .
(s)(r) P(rt = k | st = i) P(rt = k, st = i)
Without loss of generality, we consider an M -dimensional Ti,k = = . (12)
vector
(r)
t = [
(r)
t (1), . . . ,
(r)
t (M )]T of extrinsic probabili- P(rt = k) P(rt = k) P(st = i)
ties (obtained from the previous processing of CR (r)) to be As we will see, Ti,k
(s)(r)
represents a linear transformation of
integrated into the posterior computation of CR (s); the other extrinsic information from state index space R to S. Note that
direction is analogous. in a topological sense this transformation resembles the so-
(r)
Assuming conditional independence between ot and t called (de-)interleaving carried out by the (de-)interleavers of
given state st = i, the emission function of each state i S of a parallel-concatenated turbo decoder [2] (compare Figs. 4 and
the HMM (s) may be modified according to 5), although their motivation is essentially different.
(s) (r) (r)
With (32) from App. A, we obtain a solution to compute the
bi (ot ,
t ) = p(ot ,
t | st = i) extrinsic likelihood in (10):
= p(ot | st = i) p(
(r)
t | st = i) (s) (r)
(s)(r) (r) (s) (r)
gi (t ) Ti,k t (k) = gi ( t ). (13)
(s) (s) (r)
= bi (ot ) gi (
t ), (10) kR
Fig. 8. Vector-matrix notation of the turbo Viterbi algorithm for the purpose
of clarity; Hadamard product () denotes the entry-wise multiplication of two
vectors resulting in a vector, || ||1 being the L1 norm; CR means component
Fig. 7. Example state trellis diagram for t t = t = 2 and N = 4 emitting
recognizer; all symbols without CR identifier refer to CR (s).
states. Survivor paths are drawn with solid lines.
way as with the turbo FBA (Sec. III-A). By doing so, we now (cf. (19)). Note that t = t t specifies a necessary compu-
derive a Viterbi recursion (6) with a modified score: tational delay for the purpose of providing a Viterbi confidence
(s)
output: Eq. (20) provides such a confidence output t (i) for
t (i) = max[t1 (j) aj,i ] bi (ot ) gi (
(s) (s) (s) (s) (s) (r)
), time instant t = t t. Without loss of generality, just for
jS t
new ease of presentation, we assume t to be a constant here. The
t = 2, . . . , T, j S, (19) choice of any such delay influences the fidelity of the extrin-
sic information. For the purpose of clarity, Fig. 7 illustrates an
with appended observation likelihoods, and the backtracking example state trellis diagram for t = 2 and N = 4.
(s) (s)
pointers t (i) in (7) modified to employ t1 (j). Moreover,
(s)
Next, the multiplicative contribution bi gi at time t has
to be taken out of the final score t at time t (same principle
(s)
the initializations of the baseline Viterbi algorithm can still be
(s) (r)
employed (Sec. II-C). The extrinsic likelihood gi ( t ) again as in (18) compared to (17)). Now it is useful to understand that
follows (14). To conclude, the most likely state sequence (s )T1 (20) is effectively a ratio of some such modified scores being
can be obtained from (14), (19), and (7), (8), and (9) being used added up, divided by all such modified scores. Such a ratio is
in analogy. usually known in the literature as homogeneity score [53].
In order not to overemphasize the observation likelihoods In order to specify which scores are among the some, we
during the decoding iterations, we again strictly follow the introduce a set Hi,t S of final states at time t = t + t
turbo principle and dissect the modified scores t (i) into a
(s) of all survivor paths through state i at time t as
priori information, intrinsic information, and extrinsic infor-
mation; only the latter is passed on between the CRs as new Gi,t+1 , if t = t 1,
Hi,t =
(21)
a priori information in each iteration. Assuming that all sur- Gi,t+1 H,t+1 , if t < t 1.
vivor paths at time instant t contain essential information, we
obtain the extrinsic probabilities by Here, we used Gi,t+1 = { | t+1 () = i, S} being the set
of all states at time t + 1, which are
connected to state i at time
t ()
(s)
Here, the state scores t and backtracking pointers t are concatenation as reference system, one of the respective base-
denoted in vectorial notations t = [t (1), . . . , t (N )]T
(s) (s) (s) line algorithms (Sec. II-B and II-C) is applied processing the
(s) (s) (s) (s) (do + du ) = d-dimensional sequence of concatenated feature
and t = [t (1), . . . , t (N )]T . A column vector ai of
vectors y1T = y1 , . . . , yT with values yt = [oT T T
t , ut ] R ,
d
matrix A(s) is given by A(s) = (a1 , . . . , ai , . . . , aN ). Note that T
with ( ) being the transpose.
{max[ ]}iS denotes a column vector with index i, with max[ ]
b) Coupled HMM (CHMM): In classifier level fusion sys-
delivering the maximum of the elements of its vectorial argu-
tems, training is employed independently, while recognition is
ment. The () operator again marks the Hadamard product, an
performed under a temporal dependence assumption between
entry-wise product of vectors.
the observation likelihood streams [33]. We compared our turbo
ASR approaches to the widely known coupled HMM (CHMM)
approach [25], [34], which serves as classifier level fusion
C. Employing Weights
reference. As commonly practiced in CHMMs, the coupled
While turbo decoders in practice may regularly converge stationary state transition probability
to a common estimate of the information of interest, neither
convergence to the correct solution nor even convergence to a A(s)(r) = {aj,i a,k }j,iS, ,kR , (22)
stable solution can be guaranteed, particularly at a low signal-
as well as the coupled emission
to-noise ratio (SNR) [54]. However, inspired by the weighting
schemes commonly applied in multi-stream HMMs [35], the (s)(r) o u
bi,k (ot , ut ) = p(ot | st = i) p(ut | rt = k)
inclusion of the a priori information in (15), (16), or (19) and
also the intrinsic likelihood may be controlled beneficially. As (i, k) S R, (23)
shown in Fig. 5, we therefore introduce four individual expo-
nents o , u , s , r on the intrinsic and extrinsic likelihoods, can be gathered from the two marginal HMMs (s) and (r) ,
respectively, controlling the dominance of either observation respectively. The two parameters o and u in (23) are opti-
or extrinsic information for each CR over iteration instant z. mized dependent on the SNR as shown later.
Those four exponents can also be called weights on the respec- c) Iterative Reference (-S): Considering an iterative decod-
tive logarithmic entities, or in brief: likelihood weights. For ing reference, we implemented the audio-visual ASR approach
further details considering the employed weights, please refer presented by Shivappa et al. [39][41]. In compliance with their
to Appendix B. proposed procedure, we employed a heuristic model for the
(r)
likelihood p(t | rt = k) in [39, eq. (4)], letting
(r) (r)
(r)
IV. E VALUATED ASR R EFERENCE A PPROACHES t | rt = k) = f (1
p( k,t ; ) f (
,t ; ). (24)
=k
We compared our proposed turbo ASR approaches to one
representative of each of the three levels of information However, in order to improve fairness of comparison to
fusion respectively: feature level, classifier level, and decision our turbo ASR approaches partly employing a weighting
level fusion [33]. In both an audio-visual and an audio-only scheme (III-C), we further optimized the exponential dis-
speech recognition task the following feature representations tribution f (; ) used in (24) by introducing an additional
were examined: For visual speech representation, we extracted SNR-dependent scaling factor SNR > 0, with
shape-based features of order 11 for each speaker at the SNR
visual frontend (cf. [50]) respectively. As acoustic features 1
e (z) , 0,
we employed 13 MFCC coefficients according to the ETSI f (; ) = (z) (25)
Advanced Front-End (AFE) Recommendation [55], plus 1st- 0, < 0.
and 2nd-order derivatives and an additional log-energy param- According to [39], we computed and updated the rate param-
eter. For a second acoustic decoder, Gabor features of order 311 1
eter (z) as the estimated variance of the likelihood values
were extracted [56]. Gabor features are particularly interest- (r)
ing in information fusion, since spectro-temporal features are p(t | rt = k) during recognition at each iteration. The same
(s)
reported to contain complementary information to MFCCs [56]. was done for p( t | st = i). The parameter SNR is optimized
Moreover, they are very strong in high SNR and relatively weak dependent on the SNR as shown later.
in low SNR, which intentionally poses a challenge on informa- d) Weighted ROVER: In decision level fusion systems, the
tion fusion systems in such conditions. In general, all feature final classifier outputs obtained from separate classifiers are
representations were obtained by applying a 25 ms window combined to achieve a joint recognition hypothesis. Hence,
with 10 ms shift. the individual recognition processes are completely indepen-
a) Feature Concatenation (CONCAT): In feature level dent of each other. Considering the combination of N = 2
fusion, the incoming feature vectors are believed to be directly word sequences of length R (r) , we employed a
(s) and R
related and synchronous [25], [33]. On this basis, the classifier weighted version of the well-known recognition output voting
learns and recognizes the statistics of a joint feature represen- error reduction (ROVER) approach as decision level fusion ref-
tation, which is usually achieved by an ordinary concatenation erence [31], [57]. After aligning the two individual classifier
of the incoming feature vectors. Employing common feature outputs at word-level by dynamic programming (resulting in a
854 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
we applied a pattern search algorithm [63] maximizing an whereas the probabilities P(rt = k) and P(st = i) in (12) are
accuracy-based figure of merit of both CRs, given by obtained by marginalization of (30).
For each SNR, we carried out zmax = 8 turbo iterations
(s) (r)
ACCFoM = ACCz=8 + ACCz=8 and computed the output posteriors, or, scores, respectively,
(s) (r) of each CR. Except where otherwise stated, in all conducted
(1 ) ACCz=8 ACCz=8 , (29)
turbo Viterbi experiments presented hereafter t was set to 100
(CR) frames.
where ACCz=8 denotes the obtained accuracy of either CR (s)
or CR (r) after the (arbitrarily chosen) 8th iteration. As can be
seen, both accuracies shall be high, while the accuracy differ- B. Results and Discussion
ence shall be low. While was set to 0.45 for all optimizations, Tables I XIII and Figs. 9 11 illustrate the results of our
the actual optimization procedure was carried out in two steps: recognition experiments in white Gaussian noise, as well as
first, the task-dependent extrinsic likelihood weight parameters train station, car, and babble noise taken from the AURORA-2
s , r , s , r (cf. App. B) were optimized on a multi-condition database [61]. In the figures, the dotted lines with triangu-
parameter training subset of only one of the parameter train- lar markers show the single-channel baselines (suffix -B) for
ing speakers5 disturbed with white noise (in total 600 files, MFCC () and video (or Gabor, respectively, ), using either
pooled from the respective speech files of the parameter train- an FBA or Viterbi baseline algorithm (cf. Sec. II-B and II-
ing subset being interfered at an SNR of 0 dB, 15 dB, 30 dB), C). Employing () markers, the dotted lines depict the feature
governing the task-dependent a priori information influence of concatenation reference (CONCAT; cf. Sec. IV-a), whereas
each CR. Subsequently, these extrinsic likelihood weights were the dashed lines plot the CHMM reference (cf. Sec. IV-b).
kept constant, and only the intrinsic likelihood weights o , u Moreover, the ROVER (cf. Sec. IV-d) reference is indicated by
(cf. App. B) were adapted on the 400 speech files of the parame- dashed lines with () markers. The remaining curves with ()
ter training subset according to the SNR, adjusting the influence and () markers indicate the recognition results of Shivappas
of the intrinsic information in a fully SNR-dependent manner. iterative reference (dashed with suffix -S, cf. Sec. IV-c, [39])
However, since the video data is not affected by acoustic noise, and the herein presented turbo recognition approaches (solid
in the audio-visual task also the video emission weight u was with suffix -T; cf. Secs. III-A, III-B): the curve with () markers
kept constant for all SNRs. was obtained by starting with the MFCC CR in the first itera-
For the CHMM reference method (cf. Sec. IV-b), we tion and then examining the output of both CRs in an alternating
employed the very same pattern search algorithm [63] on the fashion. Analogously, the () marked curve was generated by
abovementioned 400 files of the parameter training subset opti- starting with the video (or Gabor, respectively) CR.
mizing the two control variables o , u in (23) by maximizing a) Audio-Visual Task (FBA): Applying a baseline FBA
the accuracy. As CHMMs imply an elementary single-step (cf. Sec. II-B) to an audio-visual speech recognition task in
classifier level fusion, these two SNR-dependent weights were white Gaussian noise, the following single-modality accuracies
found after simply applying a baseline FBA (cf. Sec. II-B) or were achieved (Tab. I): 53.5% on the video-only test corpus,
Viterbi algorithm (cf. Sec. II-C). For the audio-visual iterative while the audio-only recognition results vary from 33.4% at
reference approach (cf. Sec. IV-c), the iterative rate param- 0 dB SNR to 94.1% in undisturbed (i.e., clean) conditions.
eter SNR (25) was optimized separately on the very same The MFCC baseline (MFCC-B) is the best among all reference
400 speech files of the parameter training subset by employ- schemes for SNR > 15 dB. In comparison, the audio-visual
ing the same pattern search algorithm [63] to maximize (29). CHMM approach yields accuracies between 54.0% and 94.1%,
Note that = 0.45 was also chosen as optimal value here. For providing the best audio-visual reference algorithm on average
the weighted ROVER (cf. Sec. IV-d, [57]), we introduced a over all SNR conditions. The audio-visual feature concatena-
confidence weight (r) = 1 for CR (r) operating on video or tion approach (CONCAT) yields accuracies of 48.7% up to
Gabor features, and, for fairness in later comparisons, an SNR- 89.1%, being the best reference method at about 5 dB SNR.
individually optimized (s) for CR (s) operating on MFCCs Nevertheless, the susceptibility of feature level fusion (as in
(27), being also optimized by means of a simple grid search on CONCAT) to strongly differing performance of the modali-
the 400 parameter training files maximizing accuracy. ties due to the joint feature representation becomes visible at
Back to the turbo ASR approaches, we inferred stationarity 0 dB SNR and at high SNRs. The audio-visual joint recogni-
(s)(r)
of Ti,k , and estimated it by means of a baseline FBA tion hypothesis of decision level fusion by ROVER in this task
(s) (r) yields, except one condition (5 dB), a performance in between
(cf. Sec. II-B) computing the state posteriors (i), (k)
on the respective training data instances and subsequent the MFCC and the video baselines, mostly closer to the better
estimation of a joint probability distribution one. This might be due to some actually incorrect words with
high confidence, caused by the approximation of word con-
P(rt = k, st = i) = 1 (s) (i)(r) (k), (i, k) S R, fidence scores by multiplication of state posteriors (27), and
C the setting of the ROVER alignment module (Sec. IV-d) to
(30) include all occurring output words into the aligned word net,
5 Note that preliminary investigations showed that using limited data of only whereby all arising insertions are also considered for voting
one parameter training speaker is fully sufficient for the optimization of these (26). Shivappas iterative audio-visual reference takes some vis-
four turbo ASR parameters. ible profit from the incorporated parametric model (reinforcing
856 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
TABLE II TABLE IV
AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY AUDIO -V ISUAL R ECOGNITION R ESULTS IN W ORD ACCURACY
(% ACC) VS . SNR ( D B) IN T RAIN S TATION N OISE . A LL (% ACC) VS . SNR ( D B) IN BABBLE N OISE . A LL A PPROACHES
A PPROACHES ARE BASED ON FBA R ECOGNITION ARE BASED ON FBA R ECOGNITION
TABLE VIII
TABLE VI AUDIO - ONLY RECOGNITION RESULTS IN WORD ACCURACY (% ACC)
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) VS . SNR ( D B) IN BABBLE NOISE , OPERATING ON MFCC AND G ABOR
VS . SNR ( D B) IN T RAIN S TATION N OISE , O PERATING ON MFCC FEATURES . A LL APPROACHES ARE BASED ON FBA RECOGNITION
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED
ON FBA R ECOGNITION
TABLE IX TABLE XI
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC)
VS . SNR ( D B) IN W HITE G AUSSIAN N OISE , O PERATING ON MFCC VS . SNR ( D B) IN C AR N OISE , O PERATING ON MFCC AND G ABOR
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED F EATURES . A LL A PPROACHES ARE BASED ON V ITERBI R ECOGNITION
ON V ITERBI R ECOGNITION
TABLE XII
TABLE X AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC)
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) VS . SNR ( D B) IN BABBLE N OISE , O PERATING ON MFCC AND G ABOR
VS . SNR ( D B) IN T RAIN S TATION N OISE , O PERATING ON MFCC F EATURES . A LL A PPROACHES ARE BASED ON V ITERBI R ECOGNITION
AND G ABOR F EATURES . A LL A PPROACHES ARE BASED
ON V ITERBI R ECOGNITION
TABLE XIII Looking at Fig. 11 (SNR = 20 dB), given almost equal recog-
AUDIO -O NLY R ECOGNITION R ESULTS IN W ORD ACCURACY (% ACC) nition accuracies of the MFCC (85.4%) and the Gabor baseline
VS . SNR ( D B) FOR D IFFERENT C ONFIDENCE C OMPUTATIONAL D ELAYS
t. T URBO V ITERBI A PPROACHES O PERATING ON MFCC
(85.1%), the feature concatenation approach slightly improves
AND G ABOR F EATURES IN WHITE G AUSSIAN N OISE recognition performance (85.8%). The turbo Viterbi schemes,
on the other hand, clearly take profit from the iterative and
weighted information fusion. Both show a clear improvement
of the word accuracy from 85.4% (85.1%) to 88.2% (87.4%) in
the 8th iteration.
Taking a look at the AURORA-2 noises (Tabs. X-XII), to
some extent we observe a similar behavior as with white
Gaussian noise. Besides a local 10 dB weakness of the turbo
approaches in car noise, we find a 0.1% clean condition
advantage of the Gabor baseline and the again best reference
approach ROVER vs. the MFCC-T turbo approach. Also in
SNR = 25 dB (train station and babble noise) we find that
likelihoods only yields accuracies at best of 85.1% (Gabor) in CONCAT performs a bit better than the turbo approaches. Upon
this 20 dB SNR condition. closer analysis, this might be owed to somewhat wrong or not
Considering again the AURORA-2 noise environments reliable extrinsic information, impeding strongly the interaction
(Tabs. VI-VIII), the recognition results reflect the previous find- between the individual CRs, particularly in very high SNRs.
ings with white Gaussian noise, except that in babble noise, Again, however, on average over all SNR conditions, both turbo
averaged over all SNR, CONCAT and CHMM perform slightly ASR approaches exceed the performance of any of the refer-
better than ROVER. Again, MFCC-T turbo ASR outperforms ence approaches in each of the noise types. Moreover, averaged
all reference approaches in all noises and SNR conditions. over all SNRs and all four noise types the turbo approach
Moreover, both turbo ASR approaches exceed the performance MFCC-T is ahead of the best reference approach (ROVER) by
of any of the reference approaches on average over all SNR an absolute of 1.2%, which corresponds to a relative WER
conditions in any of the investigated noise types; averaged over reduction of 6.5%.
all four noise types and SNRs, MFCC-T is ahead of the overall As observed with the AURORA-2 noises, the extrinsic prob-
best reference approach ROVER by an absolute 3.8%, which is abilities (Viterbi confidence information) play a crucial role in
a relative reduction in WER of 18.2%. potential performance of the turbo Viterbi; other approaches
c) Audio-Only Task (Viterbi): When employing the base- than (20) may even perform better. Note also that the turbo
line Viterbi algorithm (cf. Sec. II-C) to an audio-only speech Viterbi performance was obtained by means of a real-time
recognition task in white Gaussian noise, the following single- decoding approach with an adjustable computational delay of
model accuracies were achieved (Tab. IX): the MFCC baseline t frames for the purpose of providing a confidence output.
results vary from 38.7% at 0 dB SNR to 90.6% in clean con- Note that we do not actively influence the inherent decision
ditions, while the Gabor baseline yields only 1.6% at 0 dB latency of the Viterbi algorithm. Tab. XIII depicts a comparison
SNR but up to 95.9% in clean conditions. As with the FBA of three different delays t, allowing a clear view on the inher-
audio-only results, the MFCC baseline again is the best refer- ent trade-off between a low confidence output latency (t =
ence approach for low SNR. Again, the feature concatenation 10 frames) and the more reliable confidence (20) obtained
approach very much follows the noise-sensitive Gabor base- at t = 100 frames. For all earlier turbo Viterbi experiments
line recognition results, serving as a sound reference at high reported here, we employed t = 100 frames (cf. Sec. III-B),
SNRs (> 15 dB). Still, again the weighted ROVER provides but even with t = 20 frames the performance on average over
a strong recognition performance in between the MFCC and all SNR conditions would have been better than the best refer-
the Gabor baselines, close to the better one. We observe, that ence scheme ROVER (cf. Tab. IX). Note that recognition results
ROVER offers the best audio-only reference on average over achieved in batch file mode (i.e., t = T and t = T t with
all SNR conditions. T being the last frame in the file) even exceeded the herein
As with the FBA-based turbo schemes, both audio-only reported results for t = 100 frames.
Viterbi turbo approaches perform better on average over all Let us close with some final considerations on complexity.
SNR conditions than any of the reference approaches, with What are additional requirements for the turbo ASR approach
MFCC-T being the best among these. Only in clean condition on top of some existing Viterbi-based ASR system? In broad
the Gabor baseline (Gabor-B) as well as the weighted ROVER terms, each iteration z requires an execution of the Viterbi
reference yield slightly better word accuracies. In this condi- algorithm (19). The additional multiplication with the extrinsic
tion, the somewhat weaker performance of the turbo Viterbi likelihood term in (19) is negligible compared to the max[ ]
schemes might be owed to the use of not globally optimal operation. Next, the computation of extrinsic information (20)
weights on the extrinsic likelihoods impeding the feedback is required. This does not consume more computational power
between the two individual CRs8 . than typical confidence computations, particularly since it
turns out that the performance loss of omitting the two small
8 Note that the undisturbed (i.e., clean) condition is not part of the multi- denominators in (20) (very much like in (28)) is small and
(s) (s)
condition setup for extrinsic likelihood weights optimization (cf. Sec. V-A). omits bookkeeping of former products bi gi . Apart from
860 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
the negligible effort of employing intrinsic and extrinsic time instant t in (17) being taken out. Given evidence [64]
likelihood weights (cf. Fig. 5), there is indeed effort needed that the output a posteriori probabilities (APPs) of a decoder
in computing the multiplication (14) with matrix T(s)(r) , or already provide a sufficient statistic of the received sequences,
T(r)(s) , respectively. While the matrix itself requires N M i.e., all exploited information, accordingly the probability of
words of memory (the number of states N, M can easily be state rt = k given the exploited information can be replaced by
(r)
about 10,000 in LVCSR systems), the execution of zmax itera- the (APP-related) extrinsic probability t (k) of the respective
tions (zmax = 8 in our simulations) requires (zmax 1) N M state k (31). With p(
(r)
t ) being omitted in (31) due to its state-
multiply-accumulate operations per frame. Since this may independence, we derive a simple formulation of the extrinsic
be considered still a computational obstacle towards the use likelihood in (10):
of turbo ASR in practical LVCSR applications, the role and
structure and even necessity of the state transition matrix T is (s) (r)
P(rt = k | st = i) (r) (s) (r)
gi (
t )
t (k) = gi (t ).
of major interest for further investigations. P(rt = k)
kR
(32)
VI. C ONCLUSIONS
In this paper, we transferred the famous turbo principle A PPENDIX B
from digital communications to the domain of ASR provid- O N THE U SE OF W EIGHTS IN T URBO ASR
ing an elegant solution to classifier-level information fusion.
Eqs. (17), (19) reveal both the presence of the intrinsic
First, we reviewed our turbo decoding forward-backward algo- (s)
information bi (ot ) and the respective fed back a priori infor-
rithm (FBA) discussing differences to other prior art. Then we (s) (r)
presented the new turbo Viterbi algorithm for ASR, showing mation gi ( t ). They complement each other and ideally, the
that actually no severe modification of the Viterbi algorithm is a priori information serves to sharpen the decoders observa-
required, providing a real-time capable solution for turbo ASR tion likelihoods, enabling it to converge to a correct estimate.
in practice. We showed simulation results both in a multi-modal As illustrated in Fig. 5, first we introduce two exponents
(audio-visual) ASR task, and in a single-channel unimodal ASR 0 o , u 1 for the observation likelihoods (multiplicative
task (audio-only with two different feature extractions). The weights for the log-likelihoods), respectively, to settle the influ-
experimental results prove the significant benefit of turbo ASR ence of the intrinsic information. As in a multi-stream HMM,
approaches over both iterative and conventional methods for these two intrinsic weights also serve to compensate for a con-
information fusion on different levels, illustrated by outper- stant bias in the reliability of a respective observation likelihood
forming even the best reference system on average by a relative [25], e.g., depending on the signal-to-noise ratio (SNR) [34],
WER reduction of 22.4%, and 18.2%, respectively. [36]. Initially ensuring non-iterative reference algorithm behav-
ior, the two intrinsic weights are set to unity in the first two
iterations z {1, 2} and, from the third iteration on, they are
ACKNOWLEDGMENT separately set to an attenuated fixed value.
The authors would like to thank the unknown reviewers for Second, two weights on the extrinsic log-likelihoods are
their numerous helpful comments on an earlier draft of this employed determining the a priori information influence by
article, and also Peter Transfeld for valuable discussions on adjusting its likelihood peakedness. To obtain an increasing
likelihood stream weighting aspects. influence of the a priori information with ongoing iterations,
we employ two extrinsic weights s , r that grow dynamically
according to a logistic function
A PPENDIX A 1
C OMPUTING THE FBA E XTRINSIC L IKELIHOOD (z) = , z = 2, 3, . . . , (33)
1
(r) 1+ 1 e(z2)
Applying Bayes rule to (11), the likelihood p(
t | rt = k) (2)
may be dissected further: with (z) {s (z), r (z)}. Here, (2) {s (2), r (2)} and
(r) (r) {s , r } mark the initial extrinsic weight and the logistic
(r) P(rt = k |
t )p(
t )
p(
t | rt = k) = proportionality constant, respectively. Hence, beginning from
P(rt = k) a given initial value, the extrinsic weights approach unity as
(r) (r) (r)
P(rt = k |
t (1), . . . ,
t (M ))p(
t ) the number of iterations z increases. Note that the exact def-
=
P(rt = k) inition of the extrinsic weight growing (33) from some start
(r) (r) value (2) towards 1 is not performance-critical, as long as it
t (k)p(
t )
= . (31) is monotonously increasing.
P(rt = k)
(r)
Note that t in (31) represents a vector of extrinsic prob- R EFERENCES
(r)
abilities t (k) = P(rt = k | . . . ) of all states rt = k R, [1] R. P. Lippmann, Speech recognition by machines and humans, Speech
given the entirety of exploited information . . . of the cur- Commun., vol. 22, no. 1, pp. 115, Jul. 1997.
[2] C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error-
rent and preceding iterations, which is basically the sequences correcting coding and decoding: Turbo-codes, in Proc. IEEE Int. Conf.
{oT1 , uT1 } with the actual intrinsic and a priori information at Commun. (ICC), Geneva, Switzerland, May 1993, pp. 10641070.
RECEVEUR et al.: TURBO AUTOMATIC SPEECH RECOGNITION 861
[3] C. Berrou, R. Pyndiah, P. Adde, C. Douillard, and R. Le Bidan, An [29] U. Jain et al., Recognition of continuous broadcast news with multi-
overview of turbo codes and their applications, in Proc. IEEE Eur. Conf. ple unknown speakers and environments, in Proc. ARPA Speech Recog.
Wireless Technol., Paris, France, Oct. 2005, pp. 19. Workshop, Harriman, NY, USA, Feb. 1996, pp. 6166.
[4] C. E. Shannon, A mathematical theory of communication, Bell Syst. [30] J. Ming, P. Hanna, D. Stewart, M. Owens, and F. J. Smith, Improving
Tech. J., vol. 27, pp. 379423, Jul. 1948. speech recognition performance by using multi-model approaches, in
[5] S. Lin and J. D. J. Costello, Error Control Coding. Englewood Cliffs, NJ, Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP), Phoenix, AZ,
USA: Prentice-Hall, 1983. USA, Mar. 1999, pp. 161164.
[6] R. Johannesson and K. S. Zigangirov, Fundamentals of Convolutional [31] J. G. Fiscus, A post-processing system to yield reduced word error rates:
Coding. Hoboken, NJ, USA: Wiley/IEEE Press, 1999. Recognizer output voting error reduction (ROVER), in Proc. Workshop
[7] G. Forney, The Viterbi algorithm, Proc. IEEE, vol. 61, no. 3, pp. 268 Automat. Speech Recog. Understand. (ASRU), Santa Barbara, CA, USA,
278, Mar. 1973. Dec. 1997, pp. 347352.
[8] L. Rabiner and B.-H. Juang, Fundamentals of Speech Processing. [32] L. Mangu, E. Brill, and A. Stolcke, Finding consensus in speech recog-
Englewood Cliffs, NJ, USA: Prentice-Hall, 1993. nition: Word error minimization and other applications of confusion
[9] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear networks, Comput. Speech Lang., vol. 14, no. 4, pp. 373400, Oct.
codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT- 2000.
20, no. 2, pp. 284287, Mar. 1974. [33] S. Lucey, T. Chen, S. Sridharan, and V. Chandran, Integration strategies
[10] L. Bahl, F. Jelinek, and R. Mercer, A maximum likelihood approach to for audio-visual speech processing: Applied to text-dependent speaker
continuous speech recognition, IEEE Trans. Pattern Anal. Mach. Intell., recognition, IEEE Trans. Multimedia, vol. 7, no. 3, pp. 495506, Jun.
vol. PAMI-5, no. 2, pp. 179190, Mar. 1983. 2005.
[11] J. Hagenauer and P. Hoeher, A Viterbi algorithm with soft-decision out- [34] A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, Dynamic Bayesian
puts and its applications, in Proc. GLOBECOM, Dallas, TX, USA, Nov. networks for audio-visual speech recognition, EURASIP J. Appl. Signal
1989, pp. 16801686. Process., vol. 11, pp. 115, 2002.
[12] J. Huber and A. Rppel, Zuverlssigkeitsschtzung fr die [35] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, Frame-dependent
Ausgangssymbole von Trellis-Decodern, AE, vol. 44, no. 1, pp. 821, multi-stream reliability indicators for audio-visual speech recognition,
Jan. 1990. (in German). in Proc. Int. Conf. Multimedia Expo (ICME), Baltimore, MD, USA, Jul.
[13] H. Jiang, Confidence measures for speech recognition: A survey, 2003, pp. 605608.
Speech Commun., vol. 45, pp. 455470, 2005. [36] D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, Audiovisual
[14] F. Wessel, R. Schlter, K. Macherey, and H. Ney, Confidence measures speech recognition with missing or unreliable data, in Proc. Auditory
for large vocabulary continuous speech recognition, IEEE Trans. Speech Visual Speech Process. (AVSP), Norwich, U.K., Sep. 2009, pp. 117122.
Audio Process., vol. 9, no. 3, pp. 288298, Mar. 2001. [37] J. Luettin, G. Potamianos, and C. Neti, Asynchronous stream modeling
[15] J. Hagenauer, The turbo principle: Tutorial introduction and state of the for large vocabulary audio-visual speech recognition, in Proc. Int. Conf.
art, in Proc. Int. Symp. Turbo Codes Related Topics, Brest, France, Sep. Acoust. Speech Signal Process. (ICASSP), Salt Lake City, UT, USA, May
1997, pp. 111. 2001, pp. 169172.
[16] F. Faubel and M. Wlfel, Coupling particle filters with automatic speech [38] A. Abdelaziz, S. Zeiler, and D. Kolossa, Learning dynamic stream
recognition for speech feature enhancement, in Proc. Int. Conf. Spoken weights for coupled-HMM-based audio-visual speech recognition, IEEE
Lang. Process. (ICSLP), Pittsburgh, PA, USA, Sep. 2006, pp. 3740. Trans. Audio Speech Lang. Process., vol. 23, no. 5, pp. 863876, Mar.
[17] Z.-J. Yan, F. Soong, and R.-H. Wang, Word graph based feature enhance- 2015.
ment for noisy speech recognition, in Proc. Int. Conf. Acoust. Speech [39] S. Shivappa, B. Rao, and M. Trivedi, An iterative decoding algorithm
Signal Process. (ICASSP), Honululu, HI, USA, Apr. 2007, vol. 4, pp. IV- for fusion of multimodal information, EURASIP J. Adv. Signal Process.,
373IV-376. vol. 2008, pp. 110, 2008.
[18] J. D. Li Deng and A. Acero, Recursive estimation of nonstationary noise [40] S. Shivappa, B. Rao, and M. Trivedi, Multimodal information fusion
using iterative stochastic approximation for robust speech recognition, using the iterative decoding algorithm and its application to audio-visual
IEEE Trans. Speech Audio Process., vol. 11, no. 6, pp. 568580, Nov. speech recognition, in Proc. Int. Conf. Acoust. Speech Signal Process.
2003. (ICASSP), Las Vegas, NV, USA, Apr. 2008, pp. 22412244.
[19] S. Windmann and R. Haeb-Umbach, Approaches to iterative speech fea- [41] S. Shivappa, M. Trivedi, and B. Rao, Audiovisual information fusion
ture enhancement and recognition, IEEE Trans. Audio Speech Lang. in human-computer interfaces and intelligent environments: A survey,
Process., vol. 17, no. 5, pp. 974984, Jul. 2009. Proc. IEEE, vol. 98, no. 10, pp. 16921715, Oct. 2010.
[20] M. Paulik, S. Stuker, C. Fugen, T. Schultz, T. Schaaf, and A. Waibel, [42] S. Shivappa, M. Trivedi, and B. Rao, Person tracking with audio-visual
Speech translation enhanced automatic speech recognition, in Proc. cues using the iterative decoding framework, in Proc. IEEE 5th Int. Conf.
Workshop Automat. Speech Recog. Understand. (ASRU), Cancn, Adv. Video Signal Based Surveillance (AVSS), Santa Fe, NM, USA, Sep.
Mexico, Nov. 2005, pp. 121126. 2008, pp. 260267.
[21] H. Bourlard and S. Dupont, A new ASR approach based on independent [43] S. Shivappa, B. Rao, and M. Trivedi, Audiovisual fusion and tracking
processing and recombination of partial frequency bands, in Proc. Int. with multilevel iterative decoding: Framework and experimental evalua-
Conf. Spoken Lang. Process. (ICSLP), Philadelphia, PA, USA, Oct. 1996, tion, IEEE J. Sel. Topics Signal Process., vol. 4, no. 5, pp. 882894, Oct.
pp. 426429. 2010.
[22] H. Hermansky, S. Tibrewala, and M. Pavel, Towards ASR on partially [44] D. Divsalar and F. Pollara, Turbo codes for deep-space communica-
corrupted speech, in Proc. Int. Conf. Spoken Lang. Process. (ICSLP), tions, Jet Propul. Lab., Pasadena, CA, USA, Telecommun. Data Acquis.
Philadelphia, PA, USA, Oct. 1996, pp. 462465. Progress Rep. 42120, Feb. 1995.
[23] W. H. Sumby and I. Pollack, Visual contribution to speech intelligibility [45] R. G. Gallager, Low-density parity-check codes, IRE Trans. Inf.
in noise, J. Acoust. Soc. Amer., vol. 26, no. 2, pp. 212215, Mar. 1954. Theory, vol. 8, no. 1, pp. 2128, Jan. 1962.
[24] D. G. Stork, M. E. Hennecke, and K. V. Prasad, Visionary speech: [46] J. Lodge, R. Young, P. Hoeher, and J. Hagenauer, Separable MAP
Looking ahead to practical speechreading systems, in Speechreading by filters for the decoding of product and concatenated codes, in Proc.
Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. New York, IEEE Int. Conf. Commun. (ICC), Geneva, Switzerland, May 1993,
NY, USA: Springer, 1996. pp. 17401745.
[25] C. Neti et al. Audio-visual speech recognition, Center Lang. Speech [47] S. ten Brink, Convergence behaviour of iteratively decoded parallel con-
Process., Johns Hopkins Univ., Baltimore, MD, USA, Tech. Rep. EPFL- catenated codes, IEEE Trans. Commun., vol. 49, no. 10, pp. 17271737,
Report-82633, IDIAP, 2000. Oct. 2001.
[26] G. Potamianos, C. Neti, G. Iyengar, and E. Helmuth, Large-vocabulary [48] D. Scheler, S. Walz, and T. Fingscheidt, On iterative exchange
audio-visual speech recognition by machines and humans, in Proc. of soft state information in two-channel automatic speech recog-
Eurospeech, Aalborg, Denmark, Sep. 2001, pp. 10271030. nition, in Proc. ITG-Fachtagung Sprachkommunikation, Sep. 2012,
[27] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, Audio-visual pp. 5558.
automatic speech recognition: An overview, in Issues in Visual and [49] S. Receveur and T. Fingscheidt, A turbo-decoding weighted forward-
Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson and backward algorithm for multimodal speech recognition, in Proc. Int.
P. Perrier, Eds. Cambridge, MA, USA: MIT Press, 2004, pp. 356396. Workshop Spoken Dialog Syst. (IWSDS), Napa Valley, CA, USA, Jan.
[28] J. Kratt, F. Metze, R. Stiefelhagen, and A. Waibel, Large vocabu- 2014, pp. 415.
lary audio-visual speech recognition using the Janus speech recogni- [50] S. Receveur and T. Fingscheidt, A compact formulation of turbo audio-
tion toolkit, in Proc. DAGM-Symp., Tbingen, Germany, Aug. 2004, visual speech recognition, in Proc. Int. Conf. Acoust. Speech Signal
pp. 488495. Process. (ICASSP), Florence, Italy, May 2014, pp. 55545558.
862 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 5, MAY 2016
[51] S. Receveur, R. Weiss, and T. Fingscheidt, Multimodal ASR by turbo Robin Wei received the M.Sc. degree in com-
decoding vs. feature concatenation: Where to perform information inte- puter and communications systems engineering
gration? in Proc. 11th ITG Conf. Speech Commun., Erlangen, Germany, from Technische Universitt Braunschweig,
Sep. 2014, pp. 2124. Braunschweig, Germany, in 2015. During his
[52] C. Douillard et al., Iterative correction of intersymbol interference: studies, he worked as a Student Assistant in the
Turbo-equalization, Eur. Trans. Telecommun., vol. 6, no. 5, pp. 507511, field of automatic speech recognition and wrote his
May 1995. master thesis at the Institute for Communications
[53] R. Zhang and A. Rudnicky, Word level confidence annotation using Technology on turbo automatic speech recognition
combinations of features, in Proc. 7th Eur. Conf. Speech Commun. with multiple models. Since 2015, he has been
Technol., Aalborg, Denmark, Sep. 2001, pp. 21052108. working as a self-employed Data Scientist in Berlin,
[54] A. C. Reid, T. A. Gulliver, and D. P. Taylor, Convergence and errors in Germany. His research interests include iterative
turbo-decoding, IEEE Trans. Commun., vol. 49, no. 12, pp. 20452051, ASR, machine learning, and data visualization.
Dec. 2001.
[55] ETSI STQ Aspects: Distributed Speech Recognition; Advanced Front-
End Feature Extraction Algorithm; Compression Algorithms, ETSI ES Tim Fingscheidt (S93M98SM04) received the
202 050, Oct. 2002. Dipl.-Ing. degree in electrical engineering and
[56] M. R. Schdler, B. T. Meyer, and B. Kollmeier, Spectro-temporal modu- the Ph.D. degree from RWTH Aachen University,
lation subspace-spanning filter bank features for robust automatic speech Aachen, Germany, in 1993 and 1998, respectively. He
recognition, J. Acoust. Soc. Amer., vol. 131, no. 5, pp. 41344151, May further pursued his work on joint speech and chan-
2012. nel coding as a Consultant in the Speech Processing
[57] B. Hoffmeister, T. Klein, R. Schlter, and H. Ney, Frame based sys- Software and Technology Research Department at
tem combination and a comparison with weighted ROVER and CNC, in AT&T Labs, Florham Park, NJ, USA. In 1999, he
Proc. INTERSPEECH, Pittsburgh, PA, USA, Sep. 2006, pp. 537540. entered the Signal Processing Department of Siemens
[58] K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, AG (COM Mobile Devices) in Munich, Germany,
Theoretical analysis of diversity in an ensemble of automatic speech and contributed to speech codec standardization in
recognition systems, IEEE/ACM Trans. Audio Speech Lang. Process., ETSI, 3GPP, and ITU-T. In 2005, he joined Siemens Corporate Technology
vol. 22, no. 3, pp. 711726, Mar. 2014. in Munich, Germany, leading the speech technology development activi-
[59] K. Audhkhasi, A. M. Zavou, P. G. Georgiou, and S. S. Narayanan, ties in recognition, synthesis, and speaker verification. Since 2006, he is a
Empirical link between hypothesis diversity and fusion performance Full Professor with the Institute for Communications Technology, Technische
in an ensemble of automatic speech recognition systems, in Proc. Universitt Braunschweig, Braunschweig, Germany. His research interests
INTERSPEECH, Lyon, France, Aug. 2013, pp. 30823086. include speech and audio signal processing, enhancement, transmission, recog-
[60] M. Cooke, J. Barker, S. Cunningham, and X. Shao, An audio-visual cor- nition, and instrumental quality measures. From 2008 to 2010, he served as
pus for speech perception and automatic speech recognition, J. Acoust. an Associate Editor for the IEEE T RANSACTIONS ON AUDIO , S PEECH ,
Soc. Amer., vol. 120, no. 5, pp. 24212424, Nov. 2006. AND L ANGUAGE P ROCESSING , and since 2011 as a Member of the IEEE
[61] H. G. Hirsch and D. Pearce, The AURORA experimental framework for Speech and Language Processing Technical Committee. He was the recipient
the performance evaluations of speech recognition systems under noisy of several awards including the Prize of the Vodafone Mobile Communications
conditions, in Proc. ISCA Workshop Automat. Speech Recog. (ASR), Foundation in 1999 and the 2002 prize of the Information Technology branch
Paris, France, Sep. 2000, pp. 18. of the Association of German Electrical Engineers (VDE ITG), where he has
[62] ITU, ITU-T Recommendation P.56, Objective measurement of active been leading the Speech Acoustics Committee ITG FA4.3 since 2015.
speech level, Dec. 2011.
[63] T. G. Kolda, R. M. Lewis, and V. Torczon, A generating set direct search
augmented Lagrangian algorithm for optimization with a combination of
general and linear constraints, Sandia National Lab., Albuquerque, NM,
USA, Tech. Rep. SAND2006-5315, Aug. 2006.
[64] J. Kliewer, S. X. Ng, and L. Hanzo, Efficient computation of EXIT func-
tions for nonbinary iterative decoding, IEEE Trans. Commun., vol. 54,
no. 12, pp. 21332136, Dec. 2006.