Академический Документы
Профессиональный Документы
Культура Документы
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221489930
CITATIONS
READS
17
3 authors, including:
Weifeng Li
Kazuya Takeda
Tsinghua University
Nagoya University
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Kazuya Takeda on 09 October 2014.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
1. Introduction
Noise-robustness is today one of the most challenging
and important problems in automatic speech recognition (ASR). State-of-the-art speech recognition systems
are known to perform reasonably well by using a closetalking microphone worn near the mouth of the speaker.
In applications where the use of a close-talking microphone is neither desirable nor practical, the use of a distant microphone is required. However, as the distance
between speaker and microphone increases, the recorded
signal becomes more susceptible to distortion from background noise which severely degrades the performance
of ASR systems. This problem can be greatly alleviated
by the use of multiple microphones to capture the speech
signals. Among the various approaches using multiple
microphones, microphone arrays are commonly used for
data collection and speech enhancement. In real car environments, traditional beamforming is difficult to apply
because both speaker position and noise location are not
fixed [1]. Recently an adaptive beamforming-based approach (Generalized Sidelobe Canceller (GSC) [2] for
example) has become attractive for its dynamic parameter adjustment. However, a persistent problem in microphone arrays has been the poor low frequency directivity
for practical aray dimensions [3]. Also, problems such as
2. Regression Methods
Let denote the feature vector (e.g. log-spectrum or
cepstrum) for the th distant microphone and denote the corresponding th element for frame . Let
denote the feature vector collected from the closetalking microphone and denote the th element
denote the estimated feature vector
for frame . Let
obtained from the feature vectors of five distant microphones. Each element of is approximated independently.
9 10 11 12
(1)
(2)
where
and
are the Lagrange multipliers for
the th support vectors . The Lagrange multipliers (
,
) and the support vectors are found by solving
the dual optimization problem in support vector learning
[6]. denotes the kernel function for which we use
the Gaussian kernel in the experiments.
For multi-layer perceptron regression (MR), the network with one hidden layer composed of 8 neurons is
used. The th element of the feature vector is estimated
by
(3)
where
is the tangent hyperbolic activation function. The parameters are found by minimizing the mean squared error over the training examples
using a gradient descent algorithm [7].
in-car state
10
11
12
idling
city
expressway
air-conditioner on high level
window (near the driver) open
air-conditioner on low level
CD player on
normal
CTK
F. A.
F. A.
regression
models
100
estimated
feature vectors
feature
transform
95
HMMs
]
[%
t
c
e
rr
o
c
90
85
80
75
MDM
CTK
F. A.
F. A.
regression
models
70
CTK-CTK CTK-LRB CTK-LRC LRB-LRB LRC-LRC DST-LRB DST-LRC DST-DST
F. A.
estimated
feature vectors
feature
transform
recognition
3.3. Nonlinear regression for in-car speech recognition compared to adaptive beamforming
Next we perform the condition-dependent nonlinear regressions to obtain estimated test data. The SVM regression method (Equation (2)) and MLP regression method
(Equation (3)) are considered. The parameters (
, , )
in SVM are specified as (10, 5, 0.5). The learning rate
and the number of iterations are set as 0.001 and 1000
respectively. The regression performance is evaluated by
the signal-to-deviation ratio (SDR), which is defined as
(4)
where is the reference feature vector from closetalking microphone for frame . denotes the number
of frames during one utterances. The SDR is averaged
over the number of the utterances. Figure 4 shows the
SDR values obtained using the three regression methods.
The adaptive microphone arrays approach is attractive for speech enhancement and speech recognition (e.g.
[8]). For comparison, we apply the Generalized Sidelobe Canceller (GSC) [2] to our in-car speech recognition. Four linearly spaced microphones (9 to 12 in Figure
1) are used. The architecture of the used GSC is shown
in Figure 5. The three FIR filters are adapted sample-by23
22
] 21
B
d[
R
D
S 20
19
18
LRB
SRB
MRB
DST
input
x1 (n)
x2 (n)
x3 (n)
x4 (n)
3
4
w
w
w
w
1
2
3
output
delay
delay
ybf (n) +
yo(n)
ya(n)
u1 (n)
blocking
matrix
u2 (n)
u3 (n)
FIR 1
FIR 2
FIR 3
100
95
]
[%
t
c
re
r
o
c
90
85
80
75
70
DST-LRB
DST-SRB
DST-MRB
ABF-ABF
DST-DST
4. Summary
In this work, we have proposed nonlinear regression of
the log-spectra for in-car speech recognition by using
multiple distant microphones. The results of our studies
have shown that the proposed method can obtain good
approximation to the speech of a close-talking microphone. The effectiveness is also demonstrated in the improvement of the word recognition accuracies in 15 driv-
Figure 6: Averaged word recognition performance using regression approaches and adaptive beamforming approach.
ing conditions. Other methods for speech enhancement
may be combined with the proposed method to obtain
improved accuracy in recognition of speech in noisy environments.
5. References
[1] Y. Kaneda and J. Ohga, Adaptive microphonearray system for noise reduction, IEEE Trans. on
ASSP, Vol. 34, no. 6, pp. 1391-1400, 1986.
[2] L. J. Griffiths and C. W. Jim, An Alternative Approach to Linearly Constrained Adaptive Beamforming, IEEE Trans. on Antennas and Propagation, vol. AP-30, no. 1, pp. 27-34, Jan. 1982.
[3] I. McCowan, D. Moore, and S. Sridharan. Nearfield adaptive beamformer with application to robust speech recognition, Digital Signal Processing:
A Review, 12(1):87-106, January 2002.
[4] W. Herbordt, H. Buchner, and W. Kellermann, An
acoustic human-machine front-end for multimedia
applications, European Journal on Applied Signal
Processing, Vol. 2003, Num. 1, pp. 1-11, Jan. 2003
[5] T. Shinde, K. Takeda and F. Itakura., Multiple Regression of Log-Spectra for In-Car Speech Recognition, Proc. of ICSLP, pp. 797-800, 2002.
[6] A.J. Smola and P.J. Bartlett and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT2
Technical Report NC2-TR-1998-030, 1998.
[7] S. Haykin, Neural Networks A Comprehensive
Foundation, Prentice Hall, 1999.
[8] Xianxian Zhang and John H. L. Hansen, A
Constrained Switched Adaptive Beamforming for
Speech Enhancement & Recognition in Real Car
Environments, IEEE. Trans. on Speech and Audio
Processing, Vol 11, no. 6, pp. 733-745, Nov. 2003.