Speech Enhancement: Chunjian Li Aalborg University, Denmark

SPEECH ENHANCEMENT
Chunjian Li
Aalborg University, Denmark
3/22/2006 Lecture notes for Speech Communications

Introduction
Applications:
- Improving quality and inteligibility (hearing
aid, cockpit comm., video conferencing ...)
- Source coding (mobile phone, video
conferencing, IP phone ...)
- Pre-processor for other speech processing
applications (speech recognition, speaker
varification ...)
3/22/2006 Lecture notes for Speech

Communications
Introduction
Classification 1
- Single channel
- Multi-channel
* with accoustic barrier (Adaptive Noise Cancelling)
* without accoustic barrier (Beam forming, ICA)
Classification 2
- Spectral domain methods (Power Spectral Subtraction,
Amplitude Spectral Subtraction, Autocorrelation Subtraction, Non-
causal IIR Wiener Filtering)
- Time domain methods (Kalman Filter)
- Adaptive noise cancelling
- Adaptive comb filtering

Communications
Single Channel Speech
Enhancement
Stochastic Model
- Noise process: broadband, stationary (or
short-time stationary), uncorrelated to
speech, additive.
- Speech process: short-time stationary.
- Need short-time processing
y (n; m) s (n; m) d (n; m)

Communications
Single Channel Speech
Enhancement
Important relation in the Power Spectrum
domain:
y ( ; m) s ( ; m) d ( ; m)
This is true only when the noise is

uncorrelated with the speech signal.
* To be concise, the index m is droped in the following

discussion

Communications
Power Spectral Subtraction
s ( ) y ( ) d ( )
| Ss ( ) | N s ( )

S s ( ) | S s ( ) | e
j y ( )
(1)
* Power Spectral Subtraction methods use the noisy phase spectrum to

synthesis the enhanced signal
Communications
Generalized Spectral
Subtraction and its variants
Generalization
Eq(1) can be written as:

1 (2)
Ss ( ) | S y ( ) | | Sd ( ) |
j y ( )
e
When =1 , eq(2) is called Amplitude Spectral Subtraction

(Boll,1979).
Variant Correlation subtraction

rs ( ; m) ry ( ; m) rd ( ; m)

Communications
Comments on Spectral
Subtraction methods
Low complexity
Severe musical noise
Usually need further enhancement
- Smoothing in time and frequency; Rectification;
Amplitude Spectral Subtraction:
Power Spectral Subtraction:
Noisy speech sample:

Communications
Comments on Spectral
Subtraction methods
Oversuppressing and smoothing can
reduce residual noise but result in
distortion to the speech spectrum.
Oversuppressing ASS:
Oversuppressing PSS:
Smoothing in time:

Communications
Wiener Filtering
The non-causal infinite impulse response
Wiener filter (hereafter as Non-causal IIR
Wiener Filter) is recognized as a spectral
domain method although the filtering
problem started in time domain.
Non-causal IIR Wiener filter with AR
modeling of speech can be employed in
iteratative manner, such that signal
estimation and parameter estimation are
done based on each other.
Communications
Noncausal Wiener Filter
A linear Minimum Mean Squared Error Filter:

s(n) h ( q ) y ( n q ) h( n ) * y ( n)
m
Orthogonality principle:

E s (n) h(q) y (n q) y (n k ) 0
q

Rys (k ) h( q ) R
q
yy (k q) Wiener-Hopf equation

Communications
Noncausal Wiener Filter
Orthogonality principle (frequency domain):
ys ( ) H ( )yy ( )
Transfer function:
ys ( )
H ( )
yy ( )
MSE of the Wiener filter:

h( q ) R
2
E[(s (n) s(n)) ] s
2
ys (q)
q

Communications
Comments on Noncausal WF
Requires estimates of the power
spectra of speech and noise.
Performance depends very much on
the estimates of the speech and noise
spectra.
WF oversuppress the speech
spectrum, results in muffling effect.
WF does not process phase spectrum.
Communications
Muffling effect caused by over-suppression
Blue: Original
Black: Wiener filter
Green: Square-root Wiener filter

Communications
Roughness caused by phase noise
The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance.
Samples of muffling and roughness:
Clean samples:
Muffling:
Roughness:
Muffling & roughness:

Communications
Iterative Wiener Filtering
A parametric method using an all-pole
model
A sequential MAP estimator of both
speech waveform and LP coefficients.
[Lim, Oppenheim 1978]

Communications
All-pole modeling of speech
- Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal tract)
excited by white noise or pulse train (the glottal
pulses). The coefficients of the all-pole model can
be found by the Linear Prediction analysis, thus is
called LP coef., and the excitation is called the
residual.
- The LP model is of minumum phase, which is
generally not the true phase of the vocal tract.
Communications
The algorithm:
1. Estimate the LP coef. from the noisy
oberservation samples. Estimate the noise
spectrum during nonspeech activity.
2. Estimate the signal using the noncasual IIR WF
given the current estimate of LP coef. and current
estimate of the noise spectrum.
3. Estimate the LP coef. again given the current
estimate of the waveform.
4. Iterate until the convergence criterion is satisfied.

Communications
Comments:
- Convergence is not garanteed, a heuristic stop
criterion is needed
- Result in unrealisticly sharp formants and pole
jittering
- Suffer from musical noise
- Need some kind of smoothing
10 dB noisy sample:
Iterative WF:
Iterative WF with smoothing:
Communications
Further enhancement to IWF
Constrained IWF [Hansen,Clements 1987]
Apply spectral constraint inter-frame and intra-frame
using LSP transformation.
Pole-zero modeling [Flanagan 1972]
Replace WF with Kalman filtering [Gibson
1991]
Vector quantization method [Gibson 1988]
Use HMM [Ephraim 1988]

Communications
Phase issues
The majority of the noise reduction mthods
only process amplitude spectrum, while the
noisy phase spectrum is left unprocessed.
The reasons are:
- Human ears are less sensitive to phase
than to the amplitude spectrum.
- Masking of amplitude to phase (6dB/0.6rad
threshold).
For low SNR (<6dB) source, the noisy
phase causes roughness/reverberance.
Communications
MMSE approaches to speech
enhancement
Wiener filtering; MMSE amplitude spectrum
estimator; MMSE log-amplitude spectrum
estimator; Non-Gaussian prior MMSE
approaches.
Being the dominant technique because of
better performance than the Spectral
Subtraction methods.
Need a priori info. of the speech and noise
spectrum.
Communications
MMSE amplitude spectrum
estimator (Ephraim-Malah
filter)
Ephraim-Malah, 1984
The basis of the noise reduction
function of MELPe coding standard
Consists of two parts: Decision-
Directed method estimating the a priori
speech spectrum, and the MMSE
Short-Time Spectral Amplitude (STSA)
estimator
Communications
MMSE STSA estimator
Assumptions:
- Stationary additive Gaussian noise with known PSD.
- An estimate of the speech spectrum is available.
- Spectral components (DFT coefficients) are statistically independent
and each follows Gaussian distribution (the DFT amplitude follows
Rayleigh distribution).
- The DFT phase follows uniform distribution and is independent of the
amplitude.
The signal model: y (t ) x(t ) d (t )
Let Yk Rk exp( j k ) , X k Ak exp( j k ) , Dk denote the kth spectral

component of the noisy observation y(t), the signal x(t), and the
noise d(t).
Communications
MMSE STSA estimator
With the following PDFs:
1 1 Ak Ak 2
p(Yk | Ak , k ) exp | Yk Ak e j k | ,
2
p ( Ak , k ) exp
d (k ) d (k ) x (k ) x ( k )
and Bayes rule, the estimator Ak can be shown to be:

A k E[ Ak | Yk ]
vk v v v
exp( k )[(1 vk ) I 0 ( k ) vk I1 ( k )]Rk
2 k 2 2 2
Where I 0 () and I1 () denote the modified Bessel functions of zero

and first order, and vk is defined by:
k
vk k
1 k

Communications
MMSE STSA estimator
Where k and k are defined by:
(k ) Rk2
k x k
d (k ) d ( k )
Where x (k ) E[| X k | ] and d (k ) E[| Dk | ]

2 2
k and k are interpreted as the a priori and a posteriori signal-to-noise

ratio respectively.
k is estimated by the Decision-Directed method.

Communications
Decision-Directed method
An estimate of the a priori SNR.
A combination of Power Spectrum
Subtraction, halfwave rectification and
inter-frame smoothing.
2 (n 1)
A
k (n) k (1 ) max[ k (n) 1,0], 0 1
d (k , n 1)
is usually chosen to be 0.98 in order to
get the best smoothing performance. The
higher the is, the less musical noise, but
the more distortion to the speech.
Communications
Comments on the MMSE
STSA estimator
Comparison of the suppression gains of Wiener filter and MMSE STSA
-The instantaneous SNR can be

interpreted as the a priori SNR
estimated without smoothing.
-WF gains do not vary with the
instantaneous SNR, only vary with
the a priori SNR. Whereas the
MMSE STSA gains vary with both
instataneous SNR and a priori SNR.
-When the a priori SNR is high, the
MMSE STSA estimator has gain
curves very close to the WF. When
the a priori SNR is low, the MMSE
STSA shows higher gain which is
very much affected by the
instataneous SNR.
Communications
STSA estimator
A comparison of the suppression gains of PSS, WF and MMSE STSA estimator
Estimated A priori SNR Estimated A priori SNR
Solid line: power subtraction; dashed line: The MMSE STSA. Rpost denotes the A priori
Wiener filter. SNR estimated without smoothing (the
instantaneous SNR).
Communications
STSA estimator
The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This transit is
controled by the un-smoothed estimate of a priori SNR (Rpost).
The larger Rpost, the stronger the anttenuation.
This counter-intuitive behavior manages to flatten the spurious
spectral peaks caused by the noise at the low SNR part of the
spectrum. While WF tends to sharpen the spurious peaks at
the low SNR part of the specatrum.
The phase of the noisy speech is used as the phase of the
enhanced speech, because of the assumption of uniform
distributed phase. An independent MMSE estimate of the
phasor has nonunity modulus, thus can not be combined with
the MMSE STSA.
Suffer less musical noise than the WF.

Communications
MMSE Log-Spectral Amplitude
Estimator
A modification to the MMSE STSA based on the fact that a distortion
measure based on the mean-square error of the log-spectra is more
suitable for speech processing.
Minimize the distortion measure E[(log Ak log A k ) ]
2
The MMSE LSA estimator can be shown to be:

A k exp( E[ln Ak | Yk ])
k 1 e t
exp( dt ) Rk
1 k 2 k t
v
k
where k 1 k , k and k are a priori SNR and a
v
k
posteriori SNR as defined before.

Communications
MMSE Log-Spectral Amplitude
Estimator
Comparison of the suppression gains of MMSE STSA and MMSE LSA
- The gain curves of MMSE LSA are

always lower than that of MMSE
STSA, resulting in lower residual
noise.
- When the a priori SNR is high, the
gain curve of MMSE LSA is very flat
which is similar to Wiener filter.
When the a priori SNR is low, the
gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the
MMSE STSA does.
Decision-Directed
Wiener Filter: MMSE LSA:
Noisy sample
3/22/2006 Lecture notes
for Speech
(0 dB):
Communications
MMSE estimator with non-
Gaussian prior
How well does Gaussian model fit the real probability distribution of DFT
coefficients?
Histogram of speech DFT amplitude. Histogram of noise (recorded from

market place) DFT amplitude.
*The histograms are taken from one hour of speech

Communications
Gaussian prior
The probability density function of the DFT coefficients
of speech can be better modeled by Supper-Gaussian
functions (e.g. Gamma or Laplace) than the
Guanssian function [Rainer Martin 2002, 2003].
An even more exact probability density function is the
one talored to fit the shape of the histogram of the DFT
coefficients [Lotter, Vary 2003].
Using these density function in place of the Gaussian
density function (for speech or noise processes) in the
MMSE estimator can result in better noise reduction.
Non-Gaussian prior MMSE estimator is nonlinear, non-
zero-phased.

Communications
Gaussian prior
Compared with WF:
- Better output SNR (Gaussian/Gamma)
- Less musical noise (Laplace/Gamma)
- Less distortion to the speech

Communications
MMSE joint estimator for
amplitude and phase spectra
[C.-J. Li, S. V. Andersen, Inter-Frequency Dependency in

MMSE Speech Enhancement, NORSIG04]
3/22/2006 Lecture notes for Speech Communications

Why MMSE joint estimator?
Phase is found to be of importance for noise
reduction of low SNR sources. Whereas
Independent optimum amplitude estimator and
optimum phase estimator do not coexist.
Finite frame length and temporal power localization
introduce correlation between spectral components.
This correlation can be exploited to improve the
estimate of low SNR frequency bin.
Time localization can be modeled with the joint
MMSE estimator, but can not be modeled by the
frequency domain Wiener filter. Time localization
indicates how much the phase is linearly related.

Communications
Formulation
Signal model: y FS v
Where F is the inverse Fourier matrix, S is the Fourier coefficients vector,
and v is white Gaussian noise vector.
The MMSE estimator of S can be shown to be:
S E (S | y)
Cs F H (FCs F H C v ) 1 y
Cs and C v being the spectral covariance matrix of the signal

and the noise, respectively (need to be estimated).

Communications
Estimating covariance matrix
Let 1/A(Z) denote the transfer function of the all pole model of speech,
r denote the LPC residual, and H denote the Toeplitz analysis matrix
consisting the coef. of A(Z), such as:
r Hs
The covariance matrix of r can be written as a diagonal matrix with
the square of r as its diagonal elements. Then the covariance matrix
of s and S can be written respectively as:
Cs H 1Cr H H
CS FC s F H

Communications
Relation between joint estimator
and other MMSE estimators
In the joint estimator, the spectral

covariance matrix CS is assumed to be a
full matrix, while the Wiener filter and
MMSE LSA estimator assume it is a
diagonal matrix.
This allows the joint estimator exploits the
correlation between frequency
components, which is ignored by the
frequency domain MMSE estimators.
Communications
Correlation of frequency
components
The covariance matrix of the frequency components

Communications
Preliminary results
The TFE-MMSE estimator preserves the signal spectrum better than the Wiener filter.
Communications
Results
TFE-MMSE stimator
TFE-Kalman filtering
Compared to
WF
Noisy (10dB)

Communications
Problems
Show that the non-causal IIR Wiener filter
gives an estimate of the signal power
spectrum that is under-biased, while the
Square-root Wiener filter gives an unbiased
estimate of the spectrum.
Discussion: Is estimating phase important in
speech enhancement? Can phase affects
magnitude?

Communications

Speech Enhancement: Chunjian Li Aalborg University, Denmark

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speech Enhancement: Chunjian Li Aalborg University, Denmark

Загружено:

Авторское право:

Доступные форматы

SPEECH ENHANCEMENT

3/22/2006 Lecture notes for Speech Communications

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

y (n; m) s (n; m) d (n; m)

3/22/2006 Lecture notes for Speech

This is true only when the noise is

* To be concise, the index m is droped in the following

3/22/2006 Lecture notes for Speech

* Power Spectral Subtraction methods use the noisy phase spectrum to

When =1 , eq(2) is called Amplitude Spectral Subtraction

Variant Correlation subtraction

3/22/2006 Lecture notes for Speech

Amplitude Spectral Subtraction:

Power Spectral Subtraction:

Noisy speech sample:

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

The signal model: y (t ) x(t ) d (t )

Let Yk Rk exp( j k ) , X k Ak exp( j k ) , Dk denote the kth spectral

and Bayes rule, the estimator Ak can be shown to be:

Where I 0 () and I1 () denote the modified Bessel functions of zero

3/22/2006 Lecture notes for Speech

Where x (k ) E[| X k | ] and d (k ) E[| Dk | ]

k and k are interpreted as the a priori and a posteriori signal-to-noise

k is estimated by the Decision-Directed method.

3/22/2006 Lecture notes for Speech

-The instantaneous SNR can be

A comparison of the suppression gains of PSS, WF and MMSE STSA estimator

Estimated A priori SNR Estimated A priori SNR

3/22/2006 Lecture notes for Speech

The MMSE LSA estimator can be shown to be:

posteriori SNR as defined before.

3/22/2006 Lecture notes for Speech

- The gain curves of MMSE LSA are

Histogram of speech DFT amplitude. Histogram of noise (recorded from

*The histograms are taken from one hour of speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

[C.-J. Li, S. V. Andersen, Inter-Frequency Dependency in

3/22/2006 Lecture notes for Speech Communications

3/22/2006 Lecture notes for Speech

The MMSE estimator of S can be shown to be:

Cs and C v being the spectral covariance matrix of the signal

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

In the joint estimator, the spectral

The covariance matrix of the frequency components

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

3/22/2006 Lecture notes for Speech

Вам также может понравиться