Вы находитесь на странице: 1из 11

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO.

5, JULY 2001 513

A Real-Time Implementation of a Stereophonic


Acoustic Echo Canceler
Peter Eneroth, Steven L. Gay, Tomas Gänsler, Member, IEEE, and Jacob Benesty, Member, IEEE

Abstract—Teleconferencing systems employ acoustic echo only cause increased calculation complexity, but also a new fun-
cancelers to reduce echoes that result from the coupling between damental problem of the solution, as we will see.
loudspeaker and microphone. To enhance the sound realism, Four mono AECs, straightforwardly implemented in the stereo
two-channel audio is necessary. However, stereophonic acoustic
echo cancellation (SAEC) is more difficult to solve because of the case, not only would have to track changing echo paths in the re-
necessity to uniquely identify two acoustic paths, which becomes ceiving room but also in the transmission room! For example,
problematic since the two excitation signals are highly correlated. the canceler has to reconverge if one talker stops talking and an-
In this paper, a wideband stereophonic acoustic echo canceler is other starts talking at a different location in the transmission room.
presented. The fundamental difficulty of stereophonic acoustic There is no adaptive algorithm that can track such a change suf-
echo cancellation is described and an echo canceler based on a fast
recursive least squares (FRLS) algorithm in a subband structure, ficiently fast and this scheme therefore results in poor echo sup-
with equidistant frequency bands, is proposed. The structure has pression. Thus, a generalization of the mono AEC in the stereo
been used in a real-time implementation, with which experiments case does not result in satisfactory performance.
have been performed. In this paper, simulation results of this The theory explaining the problem of SAEC was first described
implementation on real life recordings, with 8 kHz bandwidth, are in an early paper [2] and later in [3] and [4]. The fundamental
studied. The results clearly verify that the theoretic fundamental
problem of SAEC also applies in real-life situations. They also problem is that the two channels usually carry linearly related sig-
show that more sophisticated adaptive algorithms are needed in nals which in turn make the normal equations to be solved by the
the lower frequency regions than in the higher regions. adaptive algorithm singular. This implies that there is no unique
Index Terms—Real-time implementation, recursive least squares solution to the equations but an infinite number of solutions and
(RLS), stereophonnic acoustic echo cancellation, subband filtering. it can be shown that all (but the physically true) solutions depend
on the transmission room. In [4], it is also shown that the only so-
lution to the nonuniqueness problem is to reduce the correlation
I. INTRODUCTION between the stereo signals from the transmission room and an ef-
ficient low-complexity method for this purpose was also given.
I N conferencing systems, such as teleconferencing and
desktop conferencing, acoustic echo cancelers (AECs) are
needed to reduce the echo that results from the acoustic cou-
Lately, attention has been focused on the investigation of other
methods that decrease the cross-correlation between the channels
pling between the loudspeaker and the microphone. The AEC in order to get well behaved estimates of the echo paths [5]–[8].
identifies the echo path and simultaneously reduces the echo The main problem is how to reduce the correlation sufficiently
by means of adaptive filtering. If the conferencing system has without affecting the stereo perception and the sound quality.
dual audio channels in each direction, the classical monophonic Even though these methods may improve the SAECRs’ ability
AECs will not provide sufficient echo suppression, and more to find the true solution, the normal equations to be solved are
sophisticated stereophonic acoustic echo cancelers (SAECRs) still ill-conditioned. The standard normalized least mean square
are needed. In this paper, we will show the fundamental (NLMS) adaptive algorithm is known to converge slowly in
problem of stereophonic acoustic echo cancellation (SAEC), these situations. More sophisticated algorithms such as the affine
possible solutions, and propose a structure that has proven to projection algorithm (APA) or the recursive least squares (RLS)
perform well in a real-time implementation. that are less affected by a high condition number, are preferred
In a stereophonic conferencing system, spatial audio infor- in SAECRs. The combination of four adaptive filters per AEC
mation is also transmitted. Not only will the listener get a more and sophisticated adaptive algorithms results in high calculation
realistic sound, but the listener will also be able to aurally lo- complexity, imposing the need for a subband structure.
calize the speaker at the other end. Studies have shown that In the following, we will explain the fundamental problem
this improves perception, especially when speech from several with SAECRs and possible solutions. We will propose a
speakers overlap [1]. However, there are now four acoustic echo high-performance SAECR structure, that has been verified in
paths to identify, two to each microphone (Fig. 1). This will not a real-time implementation. Finally we will show and discuss
Manuscript received June 18, 1999; revised December 8, 2000. The associate
results from real-life recordings using the proposed structure.
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Michael S. Brandstein. II. PROBLEMS AND SOLUTIONS OF STEREOPHONIC ACOUSTIC
P. Eneroth is with the Department of Applied Electronics, Lund University,
ECHO CANCELLATION
22100 Lund, Sweden (e-mail: peter.eneroth@tde.lth.se).
S. L. Gay, T. Gänsler, and J. Benesty are with the Bell Laboratories, Lucent In stereophonic acoustic echo cancellation, there are four in-
Technologies, Murray Hill, NJ 07974-0636 USA (e-mail: slg@bell-labs.com;
gaensler@bell-labs.com; jbenesty@bell-labs.com). dependent transmission paths between the two microphones and
Publisher Item Identifier S 1063-6676(01)04981-1. the two loudspeakers (Fig. 1). The impulse responses of all four
1063–6676/01$10.00 © 2001 IEEE
514 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001

where is the estimated cross-correlation vector and


is the estimated correlation matrix

(8)

The challenging problem with stereophonic acoustic echo can-


cellation lies in the condition number of this matrix. If we define
the misalignment as and
, it can be shown that
[4]

Fig. 1. Schematic diagram of stereophonic acoustic echo cancellation. is singular


is ill-conditioned
echo paths need to be estimated by the echo canceler. Usually
the two transmission room signals, and , originate (9)
from the same source, and are therefore highly correlated. Be-
cause of this, it is difficult to estimate the impulse responses, where the two latter statements in (9) require not to be
and . The circumstances under which conver- singular. Equation (9) is valid in the situation where no noise is
gence to the true echo paths of a SAECR is achieved has been added to the microphone signal (see Fig. 1).
thoroughly analyzed in [4]. A problem formulation and a sum- As shown in [4] and (9), the tails of the impulse responses both
mary of methods to reduce the correlation between the channels in the transmission and receiving rooms play a key role. Thanks to
are given in the following. the impulse response tails in the transmission room, we can obtain
a unique solution to the normal equation. However, because of the
A. Problem Formulation impulse response tails in the receiving room, we have potentially
Assume that the transmission room microphone signals are large misalignment. We assume of course that and
given by (Fig. 1) , since this is the realistic case to be dealt with. Theoretically,
and are infinitely long, but the normal reverberation time in
an office room is approximately 0.3 s.
(1) There are two ways to decrease the misalignment. The first
way is to use longer adaptive filters, but commensurately, the
where is the source signal in the transmission room and adaptive algorithm becomes very slow in terms of convergence
, are the transmission room echo paths of length speed and is more expensive to implement in terms of memory,
. The symbol “*” denotes convolution. For simplicity, we will arithmetic complexity, etc. Moreover, the solution is not robust
only study one return path from the receiving room to the trans- in the sense that it is ill-conditioned and still sensitive to trans-
mission room, but similar remarks will be valid for the other mission room changes. A second way, the practical approach,
path. The residual echo for this channel, , after the EC is is to decorrelate partially (or in totality) the two input signals.
The difficulty is to decorrelate the signals without decreasing
(2) the signal quality.
(3)
B. Decorrelation Methods
(4)
(5) The most straight-forward method to reduce the correlation
between two channels is probably to add independent random
Here, are the true responses of length of the re- noise to each channel, . This was described in
[3], but it is also pointed out that in order to sufficiently reduce
ceiving room and are the estimated responses
the correlation, the noise-level had to be greater than the level
of length . The symbol denotes the Hermitian transposition
of the maximum nonperceptible noise.
operator. Minimization of the weighted least squares criterion
To reduce the perceived distortion it would be preferable if the
decorrelating signal is similar to the original signal. But the core
(6) problem in SAEC is that the two channels are linearly related,
i.e., adding a signal that is linearly related to the original signal
will not reduce the correlation between the two channels. In [4],
results in solving the system of linear equations [9] it is suggested that a nonlinearly processed source signal should
be added to the source signal itself. It was found that adding a
half-wave rectified signal to the original signal performed well
(7)
in addition to having a simple and low-complexity structure.
ENEROTH et al.: REAL-TIME IMPLEMENTATION OF A STEREOPHONIC ACOUSTIC ECHO CANCELER 515

This can be expressed as

(10)

(11)

where determines the amount of added distortion. It has been


found that between 0.3 and 0.5 decreases the channel correla-
tion significantly and that the distortion is hardly audible in an
office environment [10]. The stereo perception is not affected.
In systems where the transmission path between the trans-
mission and the receiving rooms includes an audio codec, cer- Fig. 2. Stereophonic subband acoustic echo canceler. In total four analysis
tain coders will decorrelate the channels. In [5], the influence filterbanks decompose the transmission room and receiving room signals. In
that a perceptual audio coder, MPEG layer III [11], has on a each subband, the two-path two-channel FRLS algorithm cancels the major part
of the echo. The residual echo is reduced even further by the suppressor before
SAECR is analyzed. It is shown that the coder can decorrelate reconstructing the fullband error signal, e(n).
the signal because nonperceivable quantization noise is added
to the source signal. How efficiently the coder decorrelates the
signal depends on what features are used for compression. For also serves as a double-talk detector. Due to the high calcula-
example, advanced stereo coders usually operate in a joint stereo tion complexity, the adaptive filters are performed in a subband
mode, where two correlated channels are coded jointly. This structure, illustrated in Fig. 2.
can actually increase the correlation between the channels, and
should not be used if the coder also has to serve as decorrelator. A. Fast Recursive Least-Squares Adaptive Algorithm
Traditional perceptual audio coders achieve better audio A complete analysis of the FRLS algorithm is beyond the
quality for a given compression ratio than speech coders scope of this article. However, a general analysis of the RLS al-
when the source signal contains music, whereas speech coders gorithm can be found in [9] and stabilized two-channel versions
perform better on pure speech signals. In an attempt to combine are described in [13] and [15]. Nevertheless, the definition of
the best of the two worlds, coders, that softly switch mode the specific version of the two-channel FRLS used in the imple-
depending on the source signal, have been proposed. The mentation is given below.
decorrelating properties of one such coder, the MTPC coder From Fig. 1 the following quantities must be defined (see
[12], is also shown in the examples. (12)–(14) at the bottom of the page) where
length of the adaptive filters;
III. PROPOSED STRUCTURE FOR THE STEREOPHONIC coefficient number 1 of channel 1;
ACOUSTIC ECHO CANCELER coefficient number 1 of channel 2, etc.
In this section, all vital parts of the real-time implemented Note that the channels of the filter and state vector are
SAECR are presented. First of all, a decorrelator is needed to interleaved in this algorithm. The complete two-channel FRLS
reduce the correlation between the two transmission room sig- adaptive filter is given in Table I, where the following quantities
nals, and . In the system we have chosen to use the are used:
half-wave rectifier, presented in the previous section.
Even after the decorrelator, finding the correct echo paths is Forward and backward prediction
still not a well-conditioned problem. Therefore, the two-channel coefficient matrices;
RLS algorithm, which has shown great promise in SAEC ap- Forward and backward prediction
plications [13], was chosen for the adaptive filters. This algo-
error energy matrices
rithm has very fast convergence rate, even for signals with a
large eigenvalue spread of the correlation matrix. The two main Forward and backward prediction
disadvantages with the RLS algorithm are the high calculation error vectors;
complexity and stability problems with nonstationary signals, Kalman gain vector;
such as speech. The stability is improved by monitoring the
Inverse conversion factor;
state of the RLS algorithm, and to reinitialize parameters when
they become unstable. A two-path structure [14] is used to fur- Stabilization parameter;
ther reduce the effects of an unstable algorithm. This structure Forgetting factor.

(12)
(13)
(14)
516 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001

TABLE I
STEREOPHONIC FRLS ALGORITHM FOR COMPLEX ARITHMETIC. THE
CONJUGATE AND THE HERMITIAN TRANSPOSITION OPERATOR ARE DENOTED
AND , RESPECTIVELY. VARIABLES ARE DEFINED IN SECTION III.A. THE
PREDICTION PART IS COMPLETE, WHEREAS THE FILTERING PART ONLY
DESCRIBES ONE RETURN PATH

Fig. 3. Two-path adaptive filtering.

is used only to estimate the impulse response . Then, it has


to be determined if this new estimate is better than a previous
estimate, denoted . If the new estimate is better, is updated.
The output signal from the echo canceler, , is calculated in
the two-path structure using . It should be noted that in the im-
plementation, the two-path structure is adapted in the subbands,
and that the decision made in one subband is independent of the
states in all other subbands.
The most crucial condition to be met for a filter update is
that the short-time residual echo energy in the adaptive filter,
, is less than the residual echo energy in the two-path
structure, , multiplied with a fixed value

(15)

where denotes the channel number.

C. Subband Filterbank Design


The main reason for using a subband scheme is the reduc-
tion of calculation complexity, but other positive effects include
increased stability of the adaptive filter, because fewer taps are
adapted in each subband, and a structure that allows for efficient
implementations on parallel systems. The condition number of
the correlation matrix in each subband is also reduced, resulting
in increased convergence rate for the LMS class of adaptive fil-
ters. The two biggest disadvantages are the transmission path
In this version, stability is improved with the stability parameter delay that is introduced (exemplified in Table II) and possible
. But for operation on nonstationary signals, like speech-sig- aliasing due to downsampling. In [16] it has been shown that if
nals, further enhancements are needed. First of all, by moni- critical downsampling is used, i.e., if we have the same down-
toring , it is possible to detect if the algorithm is about to be- sampling ratio as number of subbands , aliasing will signif-
come unstable. If this is the case, the parameters in the pre- icantly decrease the performance of the adaptive filters. There-
diction part are reset to their start values, while the adaptive fore noncritical downsampling, i.e., , in conjunction with
filter estimate, , can be left unchanged. A suitable initial value filters that have good stopband attenuation were chosen. General
for and is whereas the energy estimates, discussions of filterbanks can be found in numerous books and
and could be initialized with a recursive estimate articles [17]–[20], but since the emphasis has been on critical
of the speech energy. During the time between restart and until downsampled filterbanks, an efficient structure with noncritical
the algorithm has reconverged, echo cancellation may be poor. downsampling is shown in the Appendix. Methods on how to
A two-path structure, presented next, improves performance in design prototype filters are also discussed.
these situations.
D. Computational Complexity
B. Two-Path Adaptive Filter In this section, we calculate the number of real valued mul-
In situations of large disturbances, for example double-talk, tiplications and additions the adaptive filter and the filterbanks
or if the adaptive filter becomes unstable, the filter may diverge need per fullband sample period. The Fourier transform in the
from a good estimate. If the estimate diverges, it would be better filterbank and most parts of the adaptive filter (FRLS) are per-
to use an earlier filter estimate until the adaptive filter has recon- formed with complex arithmetic. In this analysis, multiplication
verged. This is the purpose of the two-path adaptive filter struc- between two complex numbers is counted as four real multipli-
ture [14], illustrated in Fig. 3. In this structure, the adaptive filter cations and two real additions.
ENEROTH et al.: REAL-TIME IMPLEMENTATION OF A STEREOPHONIC ACOUSTIC ECHO CANCELER 517

TABLE II and the number of additions .


CALCULATION COMPLEXITY COMPARISON GIVEN AS NUMBER OF 10 REAL In Table II, complexity examples with different number of sub-
VALUED MULTIPLICATIONS PER SECOND AT 16 kHz SAMPLE PERIOD.
CORRESPONDING FULLBAND IMPULSE RESPONSE LENGTH L = 3168, bands are given. Three system configurations are considered:
DOWNSAMPLING RATE r = (3=4)M , NUMBER OF NON-CAUSAL TAPS PER Two-channel FRLS in all subbands, NLMS in all subbands, and
SUBBAND C = 5 [21]. IN THE HYBRID NLMS/FRLS SYSTEM, FRLS IS
USED IN THE LOWER HALF OF THE SUBBANDS, AND NLMS IN THE UPPER HALF
finally, FRLS in half of the subbands, and NLMS in the other
half.

IV. SIMULATIONS
In order to validate the system in a controlled manner and
verify the necessity to decorrelate the stereo signals, simula-
tions have been performed on data recorded in HuMaNet room
B [23]. First, a recording representing the transmission room
was performed. In this recording, the excitation signal was a
high-quality speech signal recorded in an anechoic chamber.
Two loudspeakers were used, representing two speakers at dif-
ferent positions, in order to create signals with spatial changes in
the transmission room. Also, signals representing the receiving
room were recorded, using the transmission room signal above
as excitation signal. The signals were recorded at a sampling
rate of 16 kHz and the average SNR (echo-to-noise ratio) was
approximately 40 dB, i.e., a very low background noise level.
The number of multiplications needed for the real-valued In the simulations in this section, the echo canceler had the fol-
two channel FRLS [13] is , and the number of additions is lowing settings:
, where is the length of the adaptive filter. This includes
number of subbands: ;
calculation of the residual signals for both channels. The
downsampling rate: ;
subband signals are complex valued, and the complex version
number of estimated impulse response taps in each sub-
of the two channel FRLS is given in Table I. This algorithm
band: ;
needs real-valued multiplications and real
number of corresponding fullband impulse response taps:
additions, where denotes the length of the adaptive filter
;
in the subbands. Echo cancellation in the subband structure
number of possible fullband noncausal taps [21]: 300.
described in previous sections, needs adaptive filters,
As shown in Section II-A, the echo canceler problem has an
Appendix A, but due to downsampling they are updated
infinite number of solutions when the two input signals, and
times the fullband rate. The length of the adaptive filters are
in Fig. 1, are linearly related. The magnitude coherence func-
, where compensates for the noncausal
tion
taps [21]. The total number of real multiplications per fullband
sample is and the number of additions
. (16)
The analysis, (26), and synthesis, (35), filterbanks include
two parts, polyphase filtering and fast Fourier transformation. is a measure of how correlated the two signals are [4], where
The polyphase filtering uses multiplications and additions shows that the two signals are completely linearly
per filterbank, where is the length of the prototype filter. related to each other. That is, the SAECR will have difficul-
Since the input signal of the Fourier transform in the analysis ties to converge to the correct solution in those frequency re-
filterbank is real-valued, and the output signal from the Fourier gions where the magnitude coherence function is close to one.
transform in the synthesis filterbank is also real-valued, only two In Fig. 4, the magnitude coherence function is shown for speech
Fourier transforms are needed for the four analysis filterbanks signals recorded as described above. The power spectral esti-
and one Fourier transform for the two synthesis filterbanks. In mates, , were calculated using the Welch method [24] with
order to separate the one complex valued Fourier transform into a Hanning window of 8196 samples on the signals recorded at
two real-valued transforms, extra additions are needed 16 kHz. In the estimations, 30 s long signals were used. Fig. 4(a)
[22]. If the Fourier transforms are implemented with a Radix 2 shows the magnitude coherence between the left and right chan-
structure, they each need real multi- nels for an unprocessed transmission room speech recording.
plications and additions [22]. Also the The correlation between the channels can be reduced by pre-
filterbanks need to be updated at times the fullband rate. The processing the signals. In Fig. 4(b), the magnitude coherence
total number of multiplications of the four analysis filterbanks for signals preprocessed with half-wave rectifiers, Section II-B,
is and the total number of with is shown. The signals can also be decorrelated
additions is . In the synthesis fil- by the use of a coder and a decoder [5]. In Fig. 4(c), the mag-
terbank, extra additions are needed to update the state vector nitude coherence function is shown for a signal that has been
in each filterbank, (35). The total number of multiplications for coded/decoded by an MPEG Layer III coder [11] and finally in
the two synthesis filterbanks is Fig. 4(d), the result using an MTPC coder [12] is shown. Both
518 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001

Fig. 4. Magnitude coherence between right and left transmission room signal
in the echo canceler. Average signal to noise ratio is approximately 40 dB.
(a) Unprocessed transmission room signals, (b) signals preprocessed with the
half-wave rectifier, = 0:5, (c) signal coded/decoded with an MPEG layer
III audio coder [11], coded at 32 kbit/s per channel, and (d) MTPC coder [12],
same conditions as Fig. 4(c).

coders coded the left and right channels separately at 32 kbit/s Fig. 5. Transmission room signals used as excitation signal in Figs. 6–8. The
speaker changes position at 5.1 s. (a) Left channel and (b) right channel.
per channel.
One way to show the effectiveness of the decorrelation is to
study how the performance of the echo canceler decreases after
a position change of the transmission room speaker. As a perfor-
mance index the normalized mean square error1 (MSE) energy
of the residual is used. The MSE is given by

(17)

(18)

where denotes the receiving room background noise signal


and LPF denotes a lowpass filter; in this case it has a single real
pole at 0.999. is analogously calculated. In all examples,
except the double-talk example, the background noise signal
Fig. 6. Residual echo after the echo canceler without nonlinear residual echo
is unknown, and cannot be subtracted according to (18). This suppression. The clip shows the convergence on a speech signal with 16 kHz
will somewhat increase the MSE. The excitation signals to be sampling rate, and a change of position of the transmission room speaker at
used in the examples are shown in Fig. 5. The left channel 5.1 s. The SNR was approximately 40 dB in both the transmission room and the
receiving room. (a) MSE [see (17)] of the residual unprocessed signal and (b)
is shown above the right channel. In this signal, the speaker MSE of signal processed with the half-wave rectifier, = 0:5.
moves from a position close to the left microphone, to a po-
sition closer to the right microphone at 5.1 s. Shown in Fig. 6(a)
is the MSE resulting from echo cancellation of a signal that has
not been processed with the half-wave rectifier. Especially no-
tice the severe increase of MSE after the instantaneous trans-
mission room speaker position change at 5.1 s. In Fig. 6(b), the
mean square error for the same excitation signal but processed
with the half-wave rectifier, , is shown. Under these
conditions, the SAECR converges toward the true solution, and
is therefore less sensitive to echo path changes in the transmis-
sion room. Because of this, the MSE is almost unaffected by
the transmission room speaker position change at 5.1 s. Simu-
lations have shown similar behavior for the coded/decoded sig-
nals used in Fig. 4(c) and (d) as for signals processed with the
half-wave rectifier . This suggests that the magnitude
coherence function is an effective measure of how the condition Fig. 7. Subband MSE [see (17)] at two time instances, solid line at 5.1 s and
of the correlation matrix (8) affects the performance of dashed line at 5.4 s, where the latter is directly after a speaker position change
in the transmission room. (a) Unprocessed signal, the same conditions as in
the SAECR. Fig. 6(a) and (b) signal processed with the half-wave rectifier, = 0:5, as in
In Fig. 7, the MSE as a function of subband number is shown Fig. 6(b).
for two time instances, before (solid line) and directly after
(dashed line) the transmission room speaker changes positions. in the lower subbands. This corresponds to the region where the
In Fig. 7(a), where an unprocessed signal was used as input for channels are highly correlated, compare Fig. 4(a). In Fig. 7(b), a
the echo canceler, it is shown that the increase in MSE is severest signal processed with the half-wave rectifier, , was used.
1Since we normalize with the power of the echo. We can regard this as the Since the channels are less correlated in this case, there are only
inverse of the echo return loss enhancement (ERLE). minor differences in the MSE before and after the transmission
ENEROTH et al.: REAL-TIME IMPLEMENTATION OF A STEREOPHONIC ACOUSTIC ECHO CANCELER 519

Fig. 8. Subband convergence comparison between the two-channel FRLS


and NLMS algorithms. The figures show the MSE performance for one typical
lower and one typical higher subband. Solid line FRLS, dashed line NLMS.
(a) Subband 7 (center frequency at 1.75 kHz) and (b) subband 18 (center Fig. 10. Double-talk situation, only the left channel is shown. (a) Echo-signal
frequency at 4.5 kHz). without double-talk, (b) double-talk signal, and (c) MSE performance: (dashed
line) SAECR without the two-path structure and (solid line) SAECR with the
two-path structure.

was however not used in the simulations. The suppressor con-


sists of three parts. The first is a short-time transmission room
energy based suppressor, which increases suppression as the
speech energy in the transmission room increases. The second
Fig. 9. Fullband convergence comparison between the two-channel FRLS
suppressor, an echo path gain based suppressor [25], can be
and NLMS algorithms. The figure shows the MSE performance for different viewed as a mild form of center clipping in that if the residual
SAECR setups. Solid line: FRLS is used in all subbands. Dashed line: NLMS echo is very strong, it is left unaffected, but when it is below a
is used in all subbands. Dotted line: FRLS in the lower 16 subband, NLMS in
the upper 17 subbands.
threshold, it is attenuated by an amount roughly proportional to
the residual echo signal. Finally, comfort noise is added to the
residual echo signal. Without comfort noise, the listener may be
room speaker changes positions. Fig. 7 also indicates that the annoyed by the rapid changes of suppression by the two sup-
channel decorrelation is more important in the lower frequency pressors.
regions than in the higher ones, in practical situations. The subband structure enhances the system in several ways,
Other simulations have shown that when a background noise including reducing the calculation complexity as has been
source, in our case fan noise from a personal computer, was shown in this paper. Another important advantage is the ability
emitted in the transmission room, the correlation between the to run the adaptive algorithms on parallel processing units.
channels was reduced. Convergence of the adaptive filters was In the real-time system, the analysis and synthesis filterbanks
improved, especially in the higher frequency regions. Though are processed on one DSP, whereas the adaptive filters are
channel decorrelation is still needed in normal office environ- distributed over several DSPs. To be more specific, the adaptive
ments, especially in the lower frequency regions. filter, in Fig. 3, is distributed, whereas the filtering, in
In the previous examples, the two-channel FRLS algorithm Fig. 3, is performed on the DSP which also executes the filter-
is used in all subbands. In order to reduce the calculation com- banks [26]. This way, no extra signal path delay is introduced
plexity, it is possible to switch to an NLMS algorithm in the by the parallel structure. As the inherent transmission signal
upper subbands without significantly reducing the performance delay is a clear disadvantage of subband structures, this is of
of the echo canceler. In Fig. 8, the MSE performance of the importance. Examples of the delay introduced by the filterbank
FRLS and the NLMS algorithm are shown for one typical lower are given in Table II.
and one typical higher subband. The impressive performance
gain of the FRLS algorithm only applies to the lower subbands.
The performance of a system with FRLS in the lower and NLMS V. SUMMARY
in the higher subbands is shown in Fig. 9.
Finally, a double-talk situation is shown. In contrast to the With the real-time implementation of the stereophonic
previous figures, the receiving room is simulated, using 4096 acoustic echo canceler presented in this paper, we have been
taps long room impulse responses. This in order to be able to able to confirm that the use of two channels significantly
remove the double-talk signal before calculating the MSE, (17). enhances the ability to aurally separate the speakers in video
The signal was processed with the half-wave rectifier, , conferencing systems. Therefore a listener in the receiving
and the two-channel FRLS algorithm was used in all subbands. room has an improved ability to distinguish one speaker in
The result is shown in Fig. 10. the transmission room, when other speakers, also situated in
The real-time system also includes a device to suppress the the transmission room but at other positions, are talking at the
residual echo after the adaptive filter (see Fig. 2). This device same time.
520 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001

It has also been confirmed that decorrelation is crucial for sta- as


bility of the system, both in real-time experiments and in off-line
simulations on real-life recorded signals presented in the pre-
vious section. The studies also confirm that without a decorre- .. (23)
.
lator, it is unlikely that the echo canceler converges to the cor-
rect solution. This is especially the case in the lower subbands,
and in situations when the transmission room background noise Due to the downsampling, new input samples are needed for
is low. Finally it is shown that the two-channel FRLS adaptive each new subband sample, denoted a frame. Now the output
algorithm is superior to the NLMS algorithm in the lower sub- sample in subband , frame can be expressed as
bands, but the performance gain is less in the upper subbands. , where the fullband input vector is defined as
The RLS algorithm is notorious for its stability problems.
However, with the stabilization enhancements presented in this (24)
paper, and a proper initialization, it has been possible to use
the algorithm in a controlled manner, and it has shown a very
fast convergence rate. Even if tracking/reconverging is some- By exchanging the modulation vector for the modulation
what slower than initial convergence (see Fig. 10). Double-talk matrix
situations can severely decrease the performance of the adaptive
filter. In the system, the two-path structure handles these situa- (25)
tions. It also takes care of problems when the FRLS is restarted.
Restarts are necessary for stabilization of the FRLS, e.g., in a a vector containing all subband samples at frame may be cal-
simulation where 24 h of real-life data was processed, each sub- culated as
band was restarted on average every 18 s.
Finally, the authors would like to comment simulation data (26)
used in this paper. Simulations have been conducted under sev-
eral different situations. It was chosen to use data with fairly The efficient DFT polyphase filterbank structure can be seen
high SNR (40 dB) in the paper. This since background noise in (26). Since is a real sparse matrix, can be calculated
will decorrelate the channels, and thereby reduce the “stereo- with real multiplications. Secondly, the modulation matrix
phonic” problem. That is, with lower SNR the MSE for a con- divided by the number of subbands, , is also known as
verged system will increase, but the increase of MSE due to the inverse DFT matrix. Therefore, the calculation complexity
transmission room speaker change [Fig. 6(a)] will be less ob- of (26) can be reduced by using efficient inverse fast Fourier
vious. transforms (IFFT). Let us return to (25) once again. Due to the
symmetry in the modulation vector (21), the following relation
APPENDIX I between rows can be seen in (25)
ANALYSIS FILTERBANK
The signal for subband , is calculated by bandpass (27)
filtering and downsampling of the fullband input signal
where denotes the conjugate operator. Consequently,
as long as the input signal and the
elements of the prototype filter matrix are real-valued.
(19) Therefore, it is only necessary to calculate the first
subbands, and also only necessary to apply the adaptive filters,
where the subband filter is denoted with length and the echo canceler, in the lowest subbands. It should
the downsampling factor is . The subband filters, , are be noted that complex valued adaptive filters are needed.
modulated versions of the low-pass prototype filter . If
denotes the number of subbands in the filterbank, the individual APPENDIX II
subband filters can be expressed with the modulation, SYNTHESIS FILTERBANK
, where The synthesis filterbank reconstructs the fullband signal,
, by a summation of the interpolated and filtered subband
signals . For reasons that will be seen later in the section,
(20) it is advantageous to use a state description of the filterbank
filters. The input signal to the filterbank, , is upsampled
(21)
times. Then, copies of the interpolated signal are each
(22) multiplied with the corresponding synthesis filterbank filter-tap
weight, . Each sample factor is added to a state variable
The prototype filter matrix is of size , and de- , in the state vector . For every fullband sample
notes a diagonal matrix with elements from the prototype filter interval, the state vector is shifted one position. The fullband
ENEROTH et al.: REAL-TIME IMPLEMENTATION OF A STEREOPHONIC ACOUSTIC ECHO CANCELER 521

signal can then be calculated as the sum of the upper state Since aliasing in the filterbank will drastically decrease the
variables, . Let us now define the state performance of the SAECR, aliasing suppression is an impor-
vector as one upper and one lower vector tant property of the prototype filter. Hence, alias suppressing
subbands in combination with noncritical down-sampling,
allow us to neglect aliasing cancellation in the filterbank. The
(28)
only requirement for the filterbank, neglecting small amounts
(29) of aliasing, is to ensure that the analysis–synthesis system is a
pure delay,

(30)
(36)
Due to the interpolation, every subband sample is fol-
lowed by zeros. For these zeros, the state vectors are only where is the prototype filter length, is the number of sub-
updated with shifts. The output signal can therefore be cal- bands, is the downsampling factor,
culated for samples, i.e., one frame at a time is the analysis filter, and is the
synthesis filter. In order to guarantee linear phase, we let
. If we define the noncausal filter
, which only differ from the analysis-syn-
(31) thesis system, , by a delay, an equivalent requirement
to (36) would be
Similar to the analysis filters, the synthesis filterbank filters
are modulated versions of the low-pass prototype filter (37)
. Therefore, the filter vector
Now, we can formulate the filter design problem as a minimiza-
(32)
tion problem. First we need to make sure that the passband is
sufficiently flat, by minimizing
is defined as . Here, is a sparse prototype matrix
of size
(38)
(33)
The region where two adjacent bands overlap also needs to be
where is defined as in (23) and is defined in (21). flat, i.e., we minimize
As previously mentioned, the state vectors can be updated on
a frame basis. The state vectors are shifted positions and the (39)
input subband sample, multiplied with , is added

Finally, we need to minimize aliasing by enforcing good stop-


(34) band attenuation

When reconstructing the output signal , it is enough to know (40)


the sum of the state vectors, as is shown in (31). Therefore, the
state vectors do not need to be updated individually, instead it is The total minimization problem can now be expressed as
sufficient to update the sum of the state vectors, as
(41)
(35)
subject to the constraints
where is defined in (25) and (42)
. Fi-
nally, (35) can be calculated with a computationally efficient (43)
IFFT algorithm, and the output for frame is then the upper
where are tradeoff parameters and
elements of the state vector.
. Quadratic programming [27] has been
found to be a fast and stable method to solve this minimization
APPENDIX III problem. The filters used in the simulations are designed with
PROTOTYPE FILTER FOR NONCRITICAL DOWNSAMPLING
this method, and are shown in Fig. 11. In [28], a method for
In this Appendix, we will formulate a filterbank design calculating from a given is presented. An alternative
method as a minimization problem. method to design filters that satisfies (36) can be found in [29].
522 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 5, JULY 2001

[16] A. Gilloire and M. Vetterli, “Adaptive filtering in subbands with critical


sampling: Analysis, experiments, and application to acoustic echo can-
cellation,” IEEE Trans. Signal Processing, vol. 40, pp. 1862–1875, Aug.
1992.
[17] M. Vetterli and J. Kovac̆ević, Wavelets and Subband
Coding. Englewood Cliffs, NJ: Prentice-Hall, 1995.
[18] G. Strang and T. Nguyen, Wavelet and Filter Banks. Cambridge, U.K.:
Wellesley-Cambridge, 1996.
[19] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Englewood
Cliffs, NJ: Prentice-Hall, 1993.
[20] N. J. Fliege, Multirate Digital Signal Processing. New York: Wiley,
Fig. 11. Filterbank for subband m = 0 and m = 1. Aliasing for subband 0 1994.
occurs around f = 1=(2r ). [21] W. Kellermann, “Analysis and design of multirate systems for cancella-
tion of acoustical echoes,” in Proc. Int. Conf. Acoustics Speech Signal
Processing, 1988, pp. 2570–2573.
[22] H. Sorensen, D. Jones, M. Heideman, and S. Burrus, “Real-values fast
This method has more stringent requirements, forcing and Fourier transform algorithms,” IEEE Trans. Acoust., Speech, Signal Pro-
to be equal to zero, but also needs a larger filter length in cessing, vol. ASSP-35, pp. 849–863, June 1987.
order to achieve good stopband attenuation. [23] D. A. Berkley and J. L. Flanagan, “HuMaNet: An experimental human-
machine communications network based on ISDN wideband audio,”
AT&T Tech. J., vol. 69, pp. 87–99, Sept./Oct. 1990.
[24] P. D. Welch, “The use of fast Fourier transform for the estimation of
ACKNOWLEDGMENT power spectra: A method based on time averaging over short, modified
periodograms,” IEEE Trans. Audio Electroacoust., vol. 15, pp. 70–73,
The authors would like to thank Dr. G. W. Elko without whose June 1967.
enthusiastic engagement this project would not have been nearly [25] E. J. Diethorn, private communication, 1998.
[26] P. Eneroth, S. L. Gay, T. Gänsler, and J. Benesty, “An implementation
as successful. The authors also highly value the dialogue with of a stereophonic acoustic echo canceler on a general purpose DSP,” in
Dr. D. Morgan during the project. Proc. ICSPAT, 1999.
[27] T. F. Coleman and Y. Li, “A reflective Newton method for minimizing
a quadratic function subject to bounds on some of the variables,” SIAM
J. Optim., vol. 6, no. 4, pp. 1040–1058, 1996.
REFERENCES [28] R. Boite and H. Leich, “A new procedure for the design of high order
[1] J. Benesty, D. R. Morgan, J. Hall, and M. M. Sondhi, “Synthesized stereo minimum phase FIR digital or CCD filters,” Signal Process., pp.
combined with acoustic echo cancellation for desktop conferencing,” 101–108, 1981.
Bell Labs Tech. J., vol. 3, pp. 148–158, July–Sept. 1998. [29] G. Wackersreuther, “On the design of filters for ideal QMF and
[2] M. M. Sondhi and D. R. Morgan, “Acoustic echo cancellation for stereo- polyphase filter banks,” AEÜ, vol. 39, no. 2, pp. 123–130, 1985.
phonic teleconferencing,” in IEEE ASSP Workshop Applications Signal
Processing Audio Acoustics, 1991.
[3] M. M. Sondhi, D. R. Morgan, and J. L. Hall, “Stereophonic acoustic echo
cancellation—An overview of the fundamental problem,” IEEE Signal
Processing Lett., vol. 2, pp. 148–151, Aug. 1995. Peter Eneroth was born in Kristianstad, Sweden, in
[4] J. Benesty, D. R. Morgan, and M. M. Sondhi, “A better understanding 1969. He received the M.S. degree in electrical en-
and an improved solution to the specific problems of stereophonic gineering and the Ph.D. degree in signal processing
acoustic echo cancellation,” IEEE Trans. Speech Audio Processing, vol. from Lund University, Lund, Sweden, in 1995 and
6, pp. 156–165, Mar. 1998. 2001, respectively.
[5] T. Gänsler and P. Eneroth, “Influence of audio coding on stereophonic During the Fall of 1998 and 1999, he was with
acoustic echo cancellation,” in Proc. IEEE Int. Conf. Acoustics Speech Bell Labs, Lucent Technologies, Murray Hill, NJ,
Signal Processing, 1998, pp. 3649–3652. as a Consultant. Since March 2001, he has been
[6] A. Gilloire and V. Turbin, “Using auditory properties to improve the with Telia Research, Sweden. His research interests
behavior of stereophonic acoustic echo cancellers,” in Proc. IEEE Int. include adaptive filtering, subband signal processing,
Conf. Acoustics Speech Signal Processing, 1998, pp. 3681–3684. echo cancellation, and real-time implemetations.
[7] S. Shimauchi, Y. Haneda, S. Makino, and Y. Kaneda, “New configura-
tion for a stereo echo canceller with nonlinear pre-processing,” Proc.
IEEE Int. Conf. Acoustics Speech Signal Processing, pp. 3685–3688,
1998.
[8] M. Ali, “Stereophonic echo cancellation system using time-varying
all-pass filtering for signal decorrelation,” Proc. IEEE Int. Conf. Steven L. Gay was born in Kansas City, MO, in
Acoustics Speech Signal Processing, pp. 3689–3692, 1998. 1958. He received the B.S. degree in electrical en-
[9] S. Haykin, Adaptive Filter Theory. Englewood Cliffs, NJ: Prentice- gineering from the University of Missouri, Rolla, in
Hall, 1996. 1979, the M.S. degree in electrical engineering from
[10] D. R. Morgan, J. L. Hall, and J. Benesty, “Investigation of several types the California Institute of Technology, Pasadena,
of nonlinearities for use in stereo acoustic echo cancellation,” IEEE in 1981 under the Bell Labs one-year-on-campus
Trans. Speech Audio Processing, submitted for publication. (OYOC) program, and the Ph.D. degree in electrical
[11] B. G. Haskell, A. Puri, and A. N. Netravali, Digital Video: An Intro- engineering from Rutgers, The State University of
duction to MPEG-2. London, U.K.: Chapman & Hall, 1997, ch. 4, pp. New Jersey, New Brunswick, in 1994.
55–79. He joined Bell Labs, Whippany, NJ, in 1980. After
[12] S. A. Ramprashad, “A multimode transform predictive coder (MTPC) working on speech coding at ITT-Defense Commu-
for speech and audio,” in Proc. IEEE Speech Coding Workshop, June nications Division from 1982 to 1984, he returned to Bell Labs to work in speech
1999. coding, echo cancellation, and other digital signal processing research and ap-
[13] J. Benesty, F. Amand, A. Gilloire, and Y. Grenier, “Adaptive filtering plications in various development departments. In 1991, he joined the Acoustics
algorithms for stereophonic acoustic echo cancellation,” in Proc. IEEE and Audio Research Department, Bell Labs, Murray Hill, NJ. His main research
Int. Conf. Acoustics Speech Signal Processing, 1995, pp. 3099–3102. topics include network and acoustic echo control, efficient adaptive filtering al-
[14] K. Ochiai, T. Araseki, and T. Ogihara, “Echo cancellation with two path gorithms, and real-time implementations. He co-edited the book Acoustic Signal
models,” IEEE Trans. Commun., vol. COM-25, pp. 589–595, June 1977. Processing for Telecommunication.
[15] M. G. Bellanger, Adaptive Digital Filters and Signal Analysis. New Dr. Gay co-chaired the 1999 International Workshop on Acoustic Echo and
York: Marcel Dekker, 1987. Noise Control and continues to serve on its standing committee.
ENEROTH et al.: REAL-TIME IMPLEMENTATION OF A STEREOPHONIC ACOUSTIC ECHO CANCELER 523

Tomas Gänsler (S’91–M’96) was born in Sweden Jacob Benesty (M’98) was born in Marrakesh,
in 1966. He received the M.S. degree in electrical en- Morocco, in 1963. He received the M.S. degree in
gineering and the Ph.D. degree in signal processing microwaves from Pierre and Marie Curie University,
from Lund University, Lund, Sweden, in 1990 and Paris, France, in 1987, and the Ph.D. degree in
1996, respectively. control and signal processing from Orsay University,
From 1997 to September 1999, he was an Orsay, France, in April 1991. During his Ph.D.
Assistant Professor at Lund University. During studies, he worked on adaptive filters and fast
1998 he was with Bell Labs, Lucent Technologies, algorithms at the Centre National d’Etudes des
Murray Hill, NJ, as a Consultant; in October 1999, Telecomunications (CNET), Paris, France.
he became Member of Technical Staff. His research From January 1994 to July 1995, he was with
interests include robust estimation, adaptive filtering, Telecom Paris working on multichannel adaptive fil-
mono/multichannel echo cancellation and subband signal processing. ters and acoustic echo cancellation. He joined Bell Labs, Lucent Technologies
(formerly AT&T), Murray Hill, NJ, in October 1995, first as a Consultant and
then as Member of Technical Staff. Since then, he has been working on stereo-
phonic acoustic echo cancellation, adaptive filters, source localization, robust
network echo cancellation, and blind deconvolution. He is co-editor/co-author
of the book Acoustic Signal Processing for Telecommunication.
Dr. Benesty was the co-chair of the 1999 International Workshop on Acoustic
Echo and Noise Control.

Вам также может понравиться