Академический Документы
Профессиональный Документы
Культура Документы
1
Noisy speech enhancement using discrete cosine transform
a,*
Ing Yann Soon , Soo Ngee Koh a, Chai Kiat Yeo b
a
School of Electrical and Electronic Engineering, Nanyang Technological University, Block S2, Nanyang Avenue, Singapore 639798,
Singapore
b
School of Applied Science, Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore
Received 19 June 1996; received in revised form 3 November 1997; accepted 30 March 1998
Abstract
This paper illustrates the advantages of using the Discrete Cosine Transform (DCT) as compared to the standard
Discrete Fourier Transform (DFT) for the purpose of removing noise embedded in a speech signal. The derivation
of the Minimum Mean Square Error (MMSE) ®lter based on the statistical modelling of the DCT coecients is shown.
Also shown is the derivation of an over-attenuation factor based on the fact that speech energy is not always present in
the noisy signal at all times or in all coecients. This over-attenuation factor is useful in suppressing any musical re-
sidual noise which may be present. The proposed methods are evaluated against the noise reduction ®lter proposed by
Y. Ephraim and D. Malah (1984), using both Gaussian distributed white noise as well as recorded fan noise, with fa-
vourable results. Ó 1998 Elsevier Science B.V. All rights reserved.
ReÂsumeÂ
Cet article illustre les avantages apportes par l'utilisation de la Transformation Cosinus Discrete (DCT) par rapport
a celle de la Transformee de Fourier Discrete (DFT) standard, pour le debruitage de la parole bruitee. On montre com-
ment deriver un ®ltre MMSE a partir de la modelisation statistique des coecients DCT. On montre egalement com-
ment deriver un facteur de sur-attenuation base sur le fait que, dans les signaux bruites, l'energie de la parole n'est pas
toujours presente a chaque instant ni dans chaque coecient. Ce facteur de sur-attenuation est utile pour supprimer
tout bruit residuel musical. Les methods proposees ont ete evaluees favorablement par rapport du ®ltre de reduction
de bruit propose par Ephraim et Malah (1994), en utilisant tant du bruit blanc guassien que du bruit de ventilateur
enregistre Ó 1998 Elsevier Science B.V. All rights reserved.
Keywords: Speech enhancement; MMSE amplitude estimation; Noise removal; Discrete cosine transform (DCT)
of non-corrected phase on the speech is discussed in To determine the upper bound of short-time
some detail in [12]. In [12], it was noted that if the amplitude estimation using the DFT as compared
phase is out by more than p=8, the speech becomes to using DCT, another experiment is carried out.
rough. If the phase is replaced by random noise In this experiment, Gaussian distributed white
uniformly distributed between ÿp to p, a rough noise is added to a clean speech, which is then di-
and completely unvoiced speech is obtained. On vided into 50% overlapping frames. The DFT is
the other hand, if the phase is replaced by zero, performed on the frames and the magnitude of
the reconstructed speech sounds completely voiced the noisy frequency component is replaced by the
and monotonous. Therefore it is not correct to view minimum of the magnitudes of the clean frequency
the phase as totally unimportant and especially for component and noisy frequency component. The
high levels of additive noise, the reconstructed reason behind this is that all transform based noise
speech quality will be aected. For DCT, the coef- suppression are basically attenuation schemes. If
®cients are real and can be considered to have a bi- the presence of noise results in a frequency compo-
nary phase value. The phase will depend only on nent having a lower magnitude, it is unlikely that
the sign of the coecient. This provides a better de- the noise suppression ®lter can increase the magni-
gree of noise margin as unless the added noise tude. The best estimate possible would be the noisy
changes the sign of the coecient, the phase is un- magnitude. However, if noise causes the amplitude
changed. Therefore if strong speech energy is pres- of the frequency component to be higher, an ideal
ent in a particular coecient, it is unlikely that the ®lter should lower the amplitude to the original
phase will be corrupted. If the noise energy is much speech amplitude. On the other hand, the noisy
higher than the speech energy in a particular coef- phase components are left unchanged. The ideally
®cient resulting in an erroneous phase, the coe- ®lter magnitude is then combined with the noisy
cient will be highly attenuated thus minimizing phase components to obtain the ideal ®ltered fre-
the eect of the erroneous phase. It is therefore like- quency component. The optimal reconstructed
ly that DCT would perform better than DFT. speech signal is then reconstructed using the
252 I.Y. Soon et al. / Speech Communication 24 (1998) 249±257
y
t x
t n
t:
4 known in this paper. Methods for estimating kn are
covered in some details in [4,5]. However, an accu-
Also, let the transformed signals of the clean rate estimation of kx is more dicult to achieve.
speech, noisy speech and noise be denoted by This paper uses the approach known as Decision
X
k, Y
k and N
k, respectively, where k denotes Directed Estimation developed by Ephraim and
the position of the coecient in the transform do- Malah [3] to estimate kx . The superiority of this es-
main. With the assumption that the DCT trans- timator is covered in some detail in [15]. The esti-
formed coecients are statistically independent, mate k^x for kx is given by the following equation:
the Minimum Mean Square Error (MMSE) esti-
k^x
k ak^x
kp
1 ÿ a maxfY
k ÿ kn
k; 0g;
2
mated amplitude X^
k can be obtained from Y
k
as follows:
11
X^
k EfX
k j Y
kg;
5
where maxf g is the maximum function used to
where Ef g denotes the expectation operator. ensure that a non-negative value is obtained as
Eq. (5) can be rewritten, using Bayes' theorem, as an estimate. k^x
kp is the estimated value of kx in
R1 the previous frame, while a is a constant which
ak pfY
k j ak gpfak g dak can be adjusted to achieve the best result.
^
X
k Rÿ1
1 ;
6
ÿ1
pfY
k j ak gpfak g dak The value of a is set to 0.98 in the computer
where pf g denotes the probability density func- simulations of the ®lters. Smaller values of a (e.g.
tion (PDF), and ak is a dummy variable represent- 0.8) are found to result in a higher level of musical
ing all possible values of X(k). tone in the residual noise. On the other hand, if a is
Under the Gaussian distribution assumptions, set to 1, severe distortions in the speech signals
pfY
k j ak g and pfak g are given by the following were heard. This observation agrees with that in
equations: [3]. The eect of varying a is discussed in detail
( ) in [16], which states that the value of a has to be
2
1
Y
k ÿ ak greater than 0.9 in order to counter the musical
pfY
k j ak g p exp ÿ ; noise eect and 0.98 is considered a reasonable val-
2pkn
k 2kn
k
ue for a. The same value of a is used in [15].
7
ÿa2k
1
pfak g p exp ;
8 4. Further reduction of residual noise using uncer-
2pkx
k 2kx
k
tainty of signal presence
where kx
k EfjX
kj2 g and kn
k EfjN
kj2 g.
Substituting Eqs. (7) and (8) into Eq. (6), X^
k The derivation of the MMSE ®lter in Section 3
can be easily shown to be given by is based on the assumption that speech signal ener-
gy is always present in the sampled speech data.
n
k However, it should be noted that even in the pres-
X^
k Y
k;
9
n
k 1 ence of speech, the signal energy is unlikely to be
where signi®cant for all the transform coecients. Insig-
ni®cant signal energy can be treated as absence of
kx
k
n
k :
10 speech. Furthermore, actual speech data also con-
kn
k sist of periods of silence. Both [3] and [5] have ta-
n
k is known as the a priori SNR by some au- ken note of this and modi®ed their ®lters
thors. The above derivation shows that the Wiener accordingly. It was emphasized in [5] that most
®lter is the MMSE amplitude estimator for the real noise ®ltering algorithms are inadequate when
transform case. speech is absent, hence additional attenuation
To use the above formula, both kn and kx have should be applied during periods in which speech
to be known. The value of kn will be assumed to be is absent. The residual noise of commonly used
254 I.Y. Soon et al. / Speech Communication 24 (1998) 249±257
spectral subtraction, power subtraction and other The conditional probability derived above is
algorithms tends to be musical in nature, and is used to further attenuate the estimated ampli-
considered to be very annoying to some users [6]. tude.
Using further attenuation as suggested by this sec-
tion helps to reduce the residual noise.
One means of doing so is to scale the ®lter out-
5. Results and discussions
put down by the conditional probability of speech
present given the received spectral amplitude,
A total of eight sets of speech data taken from
Y(k). Let the input be represented by two states,
the TIMIT database are used in our evaluation.
H0 and H1 , where
Four of the sentences are spoken by female speak-
H0 : speech absent; ers while the remaining sentences are by male
H1 : speech present: speakers. The duration of the sentences ranges
from 5±10 s. The speech data used are sampled
The modi®ed ®lter output, A
k, taking into ac- at 8 kHz and quantized to 16 bits.
count the conditional probability of speech pres- The proposed enhancement algorithm are tes-
ence given Y
k, is then given by ted on the speech data corrupted by two dierent
A
k P
H1 jY
kX^
k:
12 types of additive noise. The ®rst type of noise is
the widely used Gaussian white noise. It was re-
The approach is logical, since when the value of ported that this type of noise is more dicult to re-
P
H1 jY
k approaches one, the ®lter will revert move than any other noise source. Attempts at
back to the original ®lter. While when removing white noise usually produces an irritat-
P
H1 jY
k approaches zero, the ®lter will produce ing residual noise. The second type of noise added
a zero output. Using Bayes' theorem, the condi- to the speeches were recorded fan noise.
tional probability is given as follows: The noisy speech data are then divided into
P
H1 P
Y
kjH1 frames each of which consists of 256 samples with
P
H1 jY
k :
P
H1 P
Y
kjH1 P
H0 P
Y
kjH0 an overlap of 192 samples with the neighbouring
13 frame. Hanning windowing is then performed on
each frame before it is enhanced individually.
The conditional probabilities P
Y
kjH1 and The ®nal enhanced speech is reconstructed from
P
Y
kjH0 can be obtained from the Gaussian sta- the enhanced frames using the weighted overlap
tistical model. and add technique [11]. The overall block diagram
of the ®lter is shown in Fig. 3.
!
2
1 Y
k
P
Y
kjH0 p exp ÿ ;
14 The segmental signal to noise ratio (SEGSNR)
2pkn
k 2kn
k
is used as an objective test method int the evalua-
tion of the speech enhancement schemes.
1
P
Y
kjH1 p SEGSNR is chosen over other objective measures
2p
kx
k kn
k such as Signal to Noise Ratio (SNR) and MSE, be-
!
Y
k2 cause it seems to have better agreement with listen-
exp ÿ :
15 ing tests. The SNR and MSE measures are
2
kn
k kx
k
normally dominated by the higher energy voice
If an assumption is made that the probabilities portion. On the other hand, the use of SEGSNR
of H0 and H1 are approximately equal, Eq. (13) which is the average of all the SNR in non-over-
can be simpli®ed to the following: lapping segments gives the unvoiced portion its
1 proper weighting.
P
H1 jY
k p :
ÿn
kY
k2
1 exp 2
kn
kkx
k
1 n
k 1X r2fx
SEGSNR 10 log10 2 ;
17
16 n n rf
xÿ^x
I.Y. Soon et al. / Speech Communication 24 (1998) 249±257 255
where n is the total number of non-silence frames. A study of the results show that DCTF outper-
The classi®cation of the frames is performed man- forms EMF in the objective test for all except one
ually using the clean speech. particular speech with low added noise power. The
Both the DCT based speech enhancement ®lters improvement in SEGSNR is signi®cant especially
described in Sections 3 and 4 are implemented and for input SNR smaller than 0 dB. The superior
their results are compared with those of Ephraim performance is also more noticeable for actual re-
and Malah ®lter (EMF) [3]. The DCT based Wie- corded fan noise than white Gaussian noise. This
ner ®lter based on Section 3 will hereafter be is to be expected since the superior energy compac-
known as DCTF while that which is based on Sec- tion property applies to both noise and speech. If
tion 4 will be known as DCTF2. The results are the noise is also concentrated in fewer coecients,
given in Tables 1 and 2. it is also easier to suppress.
Table 1
Segmental SNR tests for white noise corrupted speech
Segmental SNR (dB)
White noise added speech Unprocessed EMF DCTF DCTF2
fd 6.271 11.927 11.817 11.270
fb 2.897 8.614 9.102 8.739
fa 1.772 7.708 8.077 7.52
fc 1.182 8.06 9.407 9.483
mb 3.744 7.78 7.824 7.357
ma )2.735 3.28 4.032 3.849
mc )5.02 1.874 2.438 2.241
md )10.166 )0.0686 1.929 2.087
Table 2
Segmental SNR tests for fan noise corrupted speech
Segmental SNR (dB)
Fan noise added speech Unprocessed EMF DCTF DCTF2
fa )1.045 11.342 13.688 13.318
fd )2.782 10.337 12.930 8.739
mc )13.76 )0.5062 3.373 3.592
md )15.309 )2.737 1.941 2.449
fc )15.631 )1.057 5.103 6.188
mb )19.176 )4.681 0.867 1.728
fb )20.08 )5.101 )1.566 2.749
ma )21.994 )6.99 )0.04 0.95
256 I.Y. Soon et al. / Speech Communication 24 (1998) 249±257
The results based on SEGSNR shows that both compared to DCTF or EMF. However, dierences
the DCTF and DCTF2 outperform EMF, espe- between DCTF and DCTF2 are less noticeable for
cially for the case of fan noise. However, for clean- the added fan noise.
er speeches, DCTF is better, while for very noisy Fig. 4 illustrates the original clean speech seg-
speech, the additional attenuation provided by ment from the ®le fb. Fig. 5 shows the same speech
DCTF2 gives a better result. EMF does not pro- segment with fan noise added. Figs. 6±8 show the
duce any musical tones in the residual noise which results of enhancement using the dierent algo-
is approximately white. However, the remaining rithms (EMF, DCTF, DCTF2). The y axis of
background noise is still signi®cant. The output Figs. 4±8 are the amplitudes of the signal while
from the DCTF has signi®cantly less residual noise the x axis units are time in seconds. The audio®les 2
than that produced by the EMF. This is especially http://www.elsevier.nl/locate/specom. 2 are made
the case for high noise situations. However, the re- available for listening. The audio®les have been
sidual noise using the DCTF sounds slightly less converted to 8 bit mulaw sun format from the
uniform. DCTF2 introduces slightly more distor- original 16 bit raw form.
tion for high SNR speech, but as the noise power
increases, it becomes superior to the DCTF. Gen-
erally, speech processed using DCTF2 sounds
much cleaner for higher degree of white noise, as 2
See http://www.elsevier.nl/locate/specom.
I.Y. Soon et al. / Speech Communication 24 (1998) 249±257 257