Вы находитесь на странице: 1из 4

2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

October 16-19, 2005, New Paltz, NY

SEPARATION AND ROBUST RECOGNITION OF NOISY, CONVOLUTIVE SPEECH


MIXTURES USING TIME-FREQUENCY MASKING AND MISSING DATA TECHNIQUES
Dorothea Kolossa, Aleksander Klimas, Reinhold Orglmeister
Electronics and Medical Signal Processing, TU Berlin
Einsteinufer 7, 10587 Berlin, Germany
d.kolossa@ee.tu-berlin.de
ABSTRACT

Speech enhancement or source separation is carried out using statistical speech models. This results in an estimate of
speech features with associated variances in the required
time-frequency representation.

Time-Frequency masking has emerged as a powerful technique


for source separation of noisy and convolved speech mixtures.
It has also been applied successfully for noisy speech recognition, see e.g. [1, 2]. But while signicant SNR gains are possible by adequate masking functions, speech recognition performance suffers from the involved nonlinear operations so that the
greatly improved SNR often contrasts with only slight improvements in the recognition rate. To address this problem, marginalization techniques have been used for speech recognition [3, 4],
but they rely on speech recognition and source separation to be carried out in the same domain. However, source separation and denoising are often carried out in the Short-Time-Fourier-Transform
(STFT) domain, whereas the most useful speech recognition features are e.g. mel-frequency cepstral coefcients (MFCCs), LPCCepstral Coefcients and VQ-Features. In these cases, marginalization techniques are not directly applicable. Here, another approach is suggested, which estimates sufcient statistics for speech
features in the preprocessing (e.g. STFT-)domain, propagates these
statistics through the transforms from the spectrum to e.g. the
MFCCs of a speech recognition system and uses the estimated
statistics for missing data speech recognition. With this approach,
signicant gains can be achieved in speech recognition rates, and
in this context, time-frequency masking yields recognition rate
improvements of more than 35% when compared to TF-masking
based source separation.

While this approach is applicable also for denoising, here we describe the task of speech separation. The separation and recognition performance were tested in two scenarios, on data recorded in
a reverberant lab room as well as on in-car speech data. In the subsequent sections, it is presented in more detail in the following order: First, the preprocessing method, which consists of convolutive
ICA followed by time-frequency masking is briey presented. The
next section describes the approaches used for transforming features and variances to the domain of the speech recognizer. There,
two approaches are compared; the rst is analytic via integration
whereas the second utilizes the unscented transform. Third, the
modications necessary in the speech recognizer are described. Finally, Section 5 shows the experiments that were carried out and
the results thus obtained, and in Section 6, conclusions are drawn.

1. INTRODUCTION

2. SOURCE SEPARATION

This paper describes a novel approach of integrating preprocessing


and speech recognition. In the preprocessing stage, information
about the speech and noise signal statistics are acquired in many algorithms; e.g. in the Ephraim-Malah lter and spectral subtraction
as well as in many multi-channel algorithms. When preprocessing
is used for speech recognizers, however, customarily the statistics
are used solely to compute a speech estimate, and only this information is passed on to the recognition engine. All information
about condences or variances of the features, is discarded. This
is, among other reasons, due to the different domains of computation of preprocessing and recognition. Whereas most denoising
and source separation algorithms are designed to work on the spectrum or another time-frequency representation, speech recognizers
perform better e.g. on mel-scaled cepstral speech features. Here,
we suggest a new approach for dealing with this situation, which
can be divided into three parts:

Time-frequency (TF) masking has been used successfully to separate simultaneous speaker signals e.g. by [5]. As it is shown in [6],
the applicability of this technique is due to the sparsity of speech
signals in an appropriately chosen time-frequency representation.
Thus, for a mixture of speech signals, a time-frequency mask can
be found, which cancels out the interfering signals and retains only
the desired speaker signal, as long as the number of speakers is sufciently small. Appropriate time frequency masks can be found by
different approaches. For example, a histogram of phase and amplitude differences is used to group the time-frequency points into
speaker-specic segments in [5] and Roman and Wang use targetcancelling adaptive lters (see [7]). Here, ICA with subsequent
time-frequency masking was used for source separation. For the
ICA stage, a frequency domain implementation of the JADE algorithm (see [8]) is followed by beampattern-based permutation correction. The TF-mask is obtained by comparing the magnitudes of
the ICA outputs as detailed in [9]. At this point, in addition to the
output of the ICA stage, an estimated variance is passed, which is

This

work was supported by DaimlerChrysler

0-7803-9154-3/05/$20.00 2005 IEEE

82

The speech features and variances are transformed to the


chosen speech feature domain of the recognizer, which gives
the recognizer additional information in the form of variances for each feature.
Speech recognition is carried out using the features and
their variances, which are matched with the speech model
of the recognizer HMM.

2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

One is via prior analytical integration over the probability distribution, which is the computationally least intensive approach but is
only applicable, when analytical expressions are available for the
chosen speech recognizer features. Secondly, as an approximate
technique, the unscented transform has been used, which can yield
means and covariances after arbitrary nonlinear transformations.

Microphone Signals Xi
Preprocessing (ICA)
sS(w ,t)

S(w ,t)

October 16-19, 2005, New Paltz, NY

|o|
|S (w ,t) |

sabs (w ,t)

Mel Filter Bank


Smel(w ,t)

3.1. Analytical Integration

Feature Transformation
for Recognition

When a random variable f1 is transformed nonlinearly by f2 =


T (f1 ), the resulting probability distribution is found via the cumulative distribution of f2 , using

smel(w ,t)
log

Slog(w ,t)

slog(w ,t)
DCT

P (f2 < N ) =

scep(t ,t)
Acceleration

D
sd(w ,t)

Sdd(t ,t)

sdd(w,t)

Feature Vector

f 2 =
Figure 1: Overview of feature transformations from spectrum to
MFCC with and acceleration features.

f2 2 =

3. FEATURE TRANSFORMATION
The output of the preprocessing stage, in this case of the ICA stage,
consists of the estimated speech features S(, ) and an associated
variance estimate 2 (, ). This section describes how this data
is processed to obtain the static, and acceleration features in the
mel cepstrum domain Scep , Scep and Scep , together with associated variance estimates. However, the methods used here are not
limited to this specic set of recognition features but are applicable for a wide set of linear or nonlinear feature transforms. Since
the transformation between the complex speech spectrum and the
mel cepstrum consists of linear as well as nonlinear transformation
stages, and since analytic calculations are basic for the linear and
computationally intensive for the nonlinear transforms, the feature
transformation was carried out stage by stage as shown in Figure 1.
As for the linear stages of transformation, the analytic solution is
easily computed: When a vector-valued Gaussian random variable
m is transformed linearly to n = Tm, the mean and the covariance of the transformed variable n are obtained via
=
=

Tm and
Tm TT .

T (f1 )pf 1 (f1 )df1

and

(4)

(T (f1 ) f 2 )2 pf 1 (f1 )df1

(5)

In principle, it is possible to use any kind of statistical model


for describing features between transformation stages; log-normal
or GMMs can be treated within the same structure. Here, normal distributions parameterized by the estimated mean f 1 and
covariance f2 1 , were used and the feature value probability was
approximated by pf 1 (f1 ) N (f1 , f 1 , f2 1 ). With this assumption, Equations 4 and 5 were explicitly expressed for all nonlinear
functions in the chain of transformations. This resulted an analytical expression for the absolute value transformation of the mean
and an approximation for

set to the interfering signal energy when a signal is masked and to


zero otherwise.

n
n

(3)

Then, the derivative of the cumulative distribution yields the


desired probability density.
But since, for the proposed method, only the rst two moments
of the output distribution are desired, computing the entire pdf is
not necessary, and the output statistics are estimated directly via

Scep(t ,t)
Sd(t ,t)

pf 1 (f1 )df1 .
f1 :T (f1 )<N


f 2

f2 2

Re(f 1 )2 + Im(f 1 )2 +
f2 1 (1

2f2 1

2
).

and

(6)
(7)

The effect of the logarithm was computed according to [10] via:


f 2

log(f 1 ) 0.5 log(

f2 2

log(

f2 1
+ 1).
2f 1

f2 1
+ 1)
2f 1

(8)
(9)

However, the generality and simplicity of the proposed technique would suffer, if analytical computations were required for
each new set of features, respectively for each newly used nonlinear transform. Therefore, as an alternative to carrying out analytical integrations, the unscented transform was investigated.

(1)
(2)

Thus, means and variances can be computed in a straightforward


way for the linear transformation stages, which are shown in shaded
blocks in Figure 1.
Hence, only the nonlinear stages need to be considered in
more detail, which are the computation of the absolute value and
the logarithm. Two approaches have been tested for this purpose.

3.2. Unscented Transform


When a nonlinear transformation of a random variable occurs, the
statistics of the output variable are related to the input statistics

83

2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

via 3. However, when only the rst two moments of the output
distribution are needed, they may be approximated also by generating a set of points (called Sigma Points) with the same mean
and covariance as that of the distribution of interest and by subsequently measuring mean and covariance of the generated set. The
algorithm, described rst in [11], consists of three steps:
Given the n-dimensional distribution of processing features,
a set of so-called Sigma Points is calculated. This set is selected in such a way as to capture the second order statistics of the random variable x(t). For this purpose,the set
of all columns of the scaled covariance matrix nxx
is rst determined. This set has zero mean and covariance
xx . Then, the mean of x, x, is added, resulting in the set
X of sigma points X = + x.
The sigma points are propagated through the nonlinearity
to form a set of transformed points Y = g(X ).
The second order statistics of g(x) are then approximated
by mean y and covariance yy of the transformed set Y.
4. SPEECH RECOGNITION
The above method is intended for speech recognition systems based
on statistical speech models, where features in any appropriate domain, e.g. based on spectral, log-spectral or cepstral features, may
be used. Here, speech recognition is carried out via a phonemebased Gaussian Mixture HMM, using MFCCs as well as their rst
and second derivatives. This setup was chosen for two reasons,
rst, because of the robustness and sparsity of the parameters, and
secondly, to show that the proposed approach is capable of handling recognition features which are related to the source separation domain (here, the complex spectral domain) only by a highly
nonlinear transform.

October 16-19, 2005, New Paltz, NY

probability distribution p(o|x, q), Bayess law can be applied as


follows:
p(o, x, q)
p(o|x, q) =
p(x, q)
p(o|q)p(x|o, q)
.
(11)
= 
p(x|o, q)p(o|q)do
o
All statistical dependencies between the microphone signals X
and the HMM state q are assumed to be captured in the feature
vector o, therefore p(x|o, q) = p(x|o) and
p(o|x, q)

p(o|q)p(x|o)
p(x|o)p(o|q)do

p(o|q)p(o|x)p(x)

.
p(o) o p(x|o)p(o|q)do

For optimization, terms independent of the feature vector o can be


considered invariant scale factors. Dening a likelihood function
p via
p(o|q)p(o|x)
p(o|x, q)
(13)
p (o|x, s) =
p(o)
allows to simplify the problem to
p(o |q)p(o |x)
.
p(o )
o
o
(14)
Using a uniform prior for p(o ), the term to be maximized is

p(o |x, s) = arg max


p (o |x, q) =
o = arg max



p(o |q)p(o |x)),


o = arg max

o

(15)

which is the product of the generatory model and the recognition


model. Then, for a Gaussian distribution, the optimization problem is
1

e 2 ((o x )
o = arg max




T 1 
1
x (o x )+(o q ) q (o q ))

4.1. Maximum Likelihood Decoding for Data with Variances

(12)

(16)

which leads to the following maximum likelihood estimate o


1 1
1
(q 1
(1
x + q )
q + x x ),

In HMM speech recognition, the probability of a given vector of


speech features is evaluated at each frame for all HMM-states. For
this purpose, the model states are equipped with output probability
distributions denoted by bq , where bq (o) gives the probability that
observation vector o will occur at time t, when the Markov Model
is known to be in state q at that time, so:

with subscript x denoting parameters estimated in the preprocessing stage. The resulting estimated feature vector o can be used for
speech recognition in the same way as Equation 10 is applied for
recognition when the features are considered given and xed.

bq (o) = p(obs(t) = o|q(t) = q).

5. EXPERIMENTS AND RESULTS

(10)

For the recognition of given, xed observation vectors o, the probability distribution bq can be evaluated for the available vector o.
This is the customary computation of output probabilities, denoted
by po|q (o|q). With additional information from the preprocessing
stage, however, rather than only the observation o, its entire pdf
po|x (o|x) is known. Thus, a new approach for calculating observation likelihoods is suggested, which makes use of all available information: the states output probability distributions as well as the
observation probability distributions obtained from the preprocessing stage and is termed maximum likelihood decoding.
4.1.1. Maximum Likelihood Decoding
To evaluate the likelihood of an HMM state, the likelihood of the
current observation po|q (o|q) given the HMM-state q is combined
with the likelihood of the observation given the data available from
the preprocessing stages po|x (o|x). In order to obtain the desired

84

(17)

5.1. Datasets
Two setups were used to generate test data. First, recordings were
made in an ofce room with dimensions of about 10m 15m
3.5m. Here, the reverberation time was measured at 300ms.
Speech signals from the TIDigits database were played back and
recorded in two setups of loudspeakers, with the angles of incidence relative to endre set to (45 , 115 ) for conguration A and
to (10 , 115 ) for conguration B.
In the second dataset, recordings of the TIDigits database were
made inside a Mercedes S 320 at standstill and at 100km/h. The
data was reproduced using articial heads and recorded with a microphone array mounted in the center of the ceiling and two reference microphones. For evaluation, two recordings were used, one
of a male and a female speaker and one of two male speakers. The
reverberation time was between 60 and 150ms, depending on the
position of the articial head relative to the microphone.

2005 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics

5.2. Results

6. CONCLUSIONS

For evaluation, the SNR and the recognition correctness and accuracy were measured for all scenarios.
To measure recognition performance, the number of reference
labels (N ), substitutions (S), insertions (I) and deletions (D) is
counted. Then, two criteria can be calculated:
The correctness is the percentage of correctly hypothesized words
%Correct =

N DS
.
N

Time-frequency masking can be used to separate speakers from


noise as well as from interfering speech signals. But despite signicant SNR gains, speech recognizer performance in many cases
improves only slightly. Here, information about the applied mask
has been passed to the speech recognizer in the form of feature
variances, which, in combination with missing data speech recognition, signicantly improves recognition rates. In order to be able
to carry out speech processing and speech recognition independently in domains which may be related to each other by almost
arbitrary transforms while still passing variance values from the
preprocessing to the recognition stage, it is suggested to transform
the features together with their variances from the processing to the
recognition domain. For this purpose, analytical integration and
online computation via the unscented transform have been compared. While analytical integration gives the best results, the unscented transform shows a similar performance and is applicable
without the need for prior analytic integration.

(18)

Correctness has one major disadvantage for judging ICA performance, especially on the task of speaker/speaker separation: since
it ignores insertion errors, it will not penalize clear audibility of
the interfering speaker. Therefore, a second important criterion is
the recognitionaccuracy, dened by
%Accuracy =

N DSI
.
N

(19)

At rst, the signal to noise ratios and the recognition performance were measured for both the noisy and the preprocessed
data. The results for the recognizer were obtained with a speaker
independent recognizer trained on the TIDigits database and adapted
to the speaker and room. For all scenarios, the recognition rate for
clean data was 100%. Results for the noisy and processed recordings are shown in Table 1.
Table 1: Average SNR, Correctness (%C) and Accuracy (%A)
without and with Time-Frequency Masking for four Congurations.

noisy

Conf.A
0dB

Conf. B
0dB

Car 0km/h
0dB

%C
%A
TF-Masked
%C
%A

56.1%
22.1%
5.5dB
59.3%
46.1%

49.8%
20.7%
7.1dB
50.8%
41.2%

50.6%
37.1%
15.4dB
48.5%
40.6%

100km/h
0dB @
-9.6dB SNRI
21.5%
17.9 %
12.3dB
20.6%
19.0 %

Cong. B
89.0%C
57.9%A
86.1%C
53.4%A

Car 0kmh
80.9%C
53.6%A
80.1%C
51.2%A

[1] H. Glotin, F. Berthommier, and E. Tessier, A CASAlabelling model using the localisation cue for robust cocktailparty speech recognition, in Eurospeech, September 1999.
[2] G. Shi and P. Arabi, Robust digit recognition using phasedependent time-frequency masking, in Proceedings ICME
03 Int. Conf. Multimedia and Expo, vol. 3, 2003, pp. 629
632.

[4] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, Robust


automatic speech recognition with missing and unreliable
acoustic data, Speech Communication, pp. 267285, 2001.
[5] S. Rickard, R. Balan, and J. Rosca, Real-time timefrequency based blind source separation, in ICA 2001, San
Diego, California, 2001, pp. 651656.
[6] O. Yilmaz and S. Rickard, Blind separation of speech
mixtures via time-frequency masking, IEEE Trans. Signal
Processing, vol. 52, no. 7, pp. 18301847, July 2004.
[7] N. Roman and D. Wang, Binaural sound segregation
for multisource reverberant environments, in Proceedings
ICASSP 2004, Volume 2, 2004, pp. 373 376.
[8] J.-F. Cardoso, High order contrasts for independent component analysis, Neural Computation, vol. 11, pp. 157192,
1999.

Table 2: Recognition Rates of TF-Masked Speech after Maximum


Likelihood Decoding
Cong.A
88.5%C
53.8%A
87.2%C
50.4%A

7. REFERENCES

[3] A. Vizinho, P. Green, M. Cooke, and L. Josifovski, Missing


data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study, in Eurospeech,
September 1999.

As can be seen, in spite of the signicantly improved SNR,


the correctness is barely improved by ICA, and accuracy gains are
only notable in some cases. This is likely due to the nonlinear
distortions caused by time-frequency masking. When variance information is used to aid in decoding, the accuracy as well as the
correctness improves greatly, as seen in Table 2.

Analytic
Integration
Unscented
Transform

October 16-19, 2005, New Paltz, NY

[9] D. Kolossa and R. Orglmeister, Nonlinear postprocessing


for blind speech separation. in ICA, ser. Lecture Notes in
Computer Science, vol. 3195. Springer, 2004, pp. 832839.

100km/h
66.2%C
43.9%A
68.4%C
45.9%A

[10] M. Gales, Model-Based Techniques for Noise Robust Speech


Recognition. PhD thesis, Cambridge University, 1996.
[11] S. Julier and J. Uhlmann, A general method for approximating nonlinear transformations of probability distributions,
Technical Report, University of Oxford, UK, 1996.

85

Вам также может понравиться