Академический Документы
Профессиональный Документы
Культура Документы
Speech enhancement or source separation is carried out using statistical speech models. This results in an estimate of
speech features with associated variances in the required
time-frequency representation.
While this approach is applicable also for denoising, here we describe the task of speech separation. The separation and recognition performance were tested in two scenarios, on data recorded in
a reverberant lab room as well as on in-car speech data. In the subsequent sections, it is presented in more detail in the following order: First, the preprocessing method, which consists of convolutive
ICA followed by time-frequency masking is briey presented. The
next section describes the approaches used for transforming features and variances to the domain of the speech recognizer. There,
two approaches are compared; the rst is analytic via integration
whereas the second utilizes the unscented transform. Third, the
modications necessary in the speech recognizer are described. Finally, Section 5 shows the experiments that were carried out and
the results thus obtained, and in Section 6, conclusions are drawn.
1. INTRODUCTION
2. SOURCE SEPARATION
Time-frequency (TF) masking has been used successfully to separate simultaneous speaker signals e.g. by [5]. As it is shown in [6],
the applicability of this technique is due to the sparsity of speech
signals in an appropriately chosen time-frequency representation.
Thus, for a mixture of speech signals, a time-frequency mask can
be found, which cancels out the interfering signals and retains only
the desired speaker signal, as long as the number of speakers is sufciently small. Appropriate time frequency masks can be found by
different approaches. For example, a histogram of phase and amplitude differences is used to group the time-frequency points into
speaker-specic segments in [5] and Roman and Wang use targetcancelling adaptive lters (see [7]). Here, ICA with subsequent
time-frequency masking was used for source separation. For the
ICA stage, a frequency domain implementation of the JADE algorithm (see [8]) is followed by beampattern-based permutation correction. The TF-mask is obtained by comparing the magnitudes of
the ICA outputs as detailed in [9]. At this point, in addition to the
output of the ICA stage, an estimated variance is passed, which is
This
82
One is via prior analytical integration over the probability distribution, which is the computationally least intensive approach but is
only applicable, when analytical expressions are available for the
chosen speech recognizer features. Secondly, as an approximate
technique, the unscented transform has been used, which can yield
means and covariances after arbitrary nonlinear transformations.
Microphone Signals Xi
Preprocessing (ICA)
sS(w ,t)
S(w ,t)
|o|
|S (w ,t) |
sabs (w ,t)
Feature Transformation
for Recognition
smel(w ,t)
log
Slog(w ,t)
slog(w ,t)
DCT
P (f2 < N ) =
scep(t ,t)
Acceleration
D
sd(w ,t)
Sdd(t ,t)
sdd(w,t)
Feature Vector
f 2 =
Figure 1: Overview of feature transformations from spectrum to
MFCC with and acceleration features.
f2 2 =
3. FEATURE TRANSFORMATION
The output of the preprocessing stage, in this case of the ICA stage,
consists of the estimated speech features S(, ) and an associated
variance estimate 2 (, ). This section describes how this data
is processed to obtain the static, and acceleration features in the
mel cepstrum domain Scep , Scep and Scep , together with associated variance estimates. However, the methods used here are not
limited to this specic set of recognition features but are applicable for a wide set of linear or nonlinear feature transforms. Since
the transformation between the complex speech spectrum and the
mel cepstrum consists of linear as well as nonlinear transformation
stages, and since analytic calculations are basic for the linear and
computationally intensive for the nonlinear transforms, the feature
transformation was carried out stage by stage as shown in Figure 1.
As for the linear stages of transformation, the analytic solution is
easily computed: When a vector-valued Gaussian random variable
m is transformed linearly to n = Tm, the mean and the covariance of the transformed variable n are obtained via
=
=
Tm and
Tm TT .
and
(4)
(5)
n
n
(3)
Scep(t ,t)
Sd(t ,t)
pf 1 (f1 )df1 .
f1 :T (f1 )<N
f 2
f2 2
Re(f 1 )2 + Im(f 1 )2 +
f2 1 (1
2f2 1
2
).
and
(6)
(7)
f2 2
log(
f2 1
+ 1).
2f 1
f2 1
+ 1)
2f 1
(8)
(9)
However, the generality and simplicity of the proposed technique would suffer, if analytical computations were required for
each new set of features, respectively for each newly used nonlinear transform. Therefore, as an alternative to carrying out analytical integrations, the unscented transform was investigated.
(1)
(2)
83
via 3. However, when only the rst two moments of the output
distribution are needed, they may be approximated also by generating a set of points (called Sigma Points) with the same mean
and covariance as that of the distribution of interest and by subsequently measuring mean and covariance of the generated set. The
algorithm, described rst in [11], consists of three steps:
Given the n-dimensional distribution of processing features,
a set of so-called Sigma Points is calculated. This set is selected in such a way as to capture the second order statistics of the random variable x(t). For this purpose,the set
of all columns of the scaled covariance matrix nxx
is rst determined. This set has zero mean and covariance
xx . Then, the mean of x, x, is added, resulting in the set
X of sigma points X = + x.
The sigma points are propagated through the nonlinearity
to form a set of transformed points Y = g(X ).
The second order statistics of g(x) are then approximated
by mean y and covariance yy of the transformed set Y.
4. SPEECH RECOGNITION
The above method is intended for speech recognition systems based
on statistical speech models, where features in any appropriate domain, e.g. based on spectral, log-spectral or cepstral features, may
be used. Here, speech recognition is carried out via a phonemebased Gaussian Mixture HMM, using MFCCs as well as their rst
and second derivatives. This setup was chosen for two reasons,
rst, because of the robustness and sparsity of the parameters, and
secondly, to show that the proposed approach is capable of handling recognition features which are related to the source separation domain (here, the complex spectral domain) only by a highly
nonlinear transform.
p(o|q)p(x|o)
p(x|o)p(o|q)do
p(o|q)p(o|x)p(x)
.
p(o) o p(x|o)p(o|q)do
(15)
e 2 ((o x )
o = arg max
T 1
1
x (o x )+(o q ) q (o q ))
(12)
(16)
with subscript x denoting parameters estimated in the preprocessing stage. The resulting estimated feature vector o can be used for
speech recognition in the same way as Equation 10 is applied for
recognition when the features are considered given and xed.
(10)
For the recognition of given, xed observation vectors o, the probability distribution bq can be evaluated for the available vector o.
This is the customary computation of output probabilities, denoted
by po|q (o|q). With additional information from the preprocessing
stage, however, rather than only the observation o, its entire pdf
po|x (o|x) is known. Thus, a new approach for calculating observation likelihoods is suggested, which makes use of all available information: the states output probability distributions as well as the
observation probability distributions obtained from the preprocessing stage and is termed maximum likelihood decoding.
4.1.1. Maximum Likelihood Decoding
To evaluate the likelihood of an HMM state, the likelihood of the
current observation po|q (o|q) given the HMM-state q is combined
with the likelihood of the observation given the data available from
the preprocessing stages po|x (o|x). In order to obtain the desired
84
(17)
5.1. Datasets
Two setups were used to generate test data. First, recordings were
made in an ofce room with dimensions of about 10m 15m
3.5m. Here, the reverberation time was measured at 300ms.
Speech signals from the TIDigits database were played back and
recorded in two setups of loudspeakers, with the angles of incidence relative to endre set to (45 , 115 ) for conguration A and
to (10 , 115 ) for conguration B.
In the second dataset, recordings of the TIDigits database were
made inside a Mercedes S 320 at standstill and at 100km/h. The
data was reproduced using articial heads and recorded with a microphone array mounted in the center of the ceiling and two reference microphones. For evaluation, two recordings were used, one
of a male and a female speaker and one of two male speakers. The
reverberation time was between 60 and 150ms, depending on the
position of the articial head relative to the microphone.
5.2. Results
6. CONCLUSIONS
For evaluation, the SNR and the recognition correctness and accuracy were measured for all scenarios.
To measure recognition performance, the number of reference
labels (N ), substitutions (S), insertions (I) and deletions (D) is
counted. Then, two criteria can be calculated:
The correctness is the percentage of correctly hypothesized words
%Correct =
N DS
.
N
(18)
Correctness has one major disadvantage for judging ICA performance, especially on the task of speaker/speaker separation: since
it ignores insertion errors, it will not penalize clear audibility of
the interfering speaker. Therefore, a second important criterion is
the recognitionaccuracy, dened by
%Accuracy =
N DSI
.
N
(19)
At rst, the signal to noise ratios and the recognition performance were measured for both the noisy and the preprocessed
data. The results for the recognizer were obtained with a speaker
independent recognizer trained on the TIDigits database and adapted
to the speaker and room. For all scenarios, the recognition rate for
clean data was 100%. Results for the noisy and processed recordings are shown in Table 1.
Table 1: Average SNR, Correctness (%C) and Accuracy (%A)
without and with Time-Frequency Masking for four Congurations.
noisy
Conf.A
0dB
Conf. B
0dB
Car 0km/h
0dB
%C
%A
TF-Masked
%C
%A
56.1%
22.1%
5.5dB
59.3%
46.1%
49.8%
20.7%
7.1dB
50.8%
41.2%
50.6%
37.1%
15.4dB
48.5%
40.6%
100km/h
0dB @
-9.6dB SNRI
21.5%
17.9 %
12.3dB
20.6%
19.0 %
Cong. B
89.0%C
57.9%A
86.1%C
53.4%A
Car 0kmh
80.9%C
53.6%A
80.1%C
51.2%A
[1] H. Glotin, F. Berthommier, and E. Tessier, A CASAlabelling model using the localisation cue for robust cocktailparty speech recognition, in Eurospeech, September 1999.
[2] G. Shi and P. Arabi, Robust digit recognition using phasedependent time-frequency masking, in Proceedings ICME
03 Int. Conf. Multimedia and Expo, vol. 3, 2003, pp. 629
632.
7. REFERENCES
Analytic
Integration
Unscented
Transform
100km/h
66.2%C
43.9%A
68.4%C
45.9%A
85