Вы находитесь на странице: 1из 6

Fast and adaptive blind audio source separation using

recursive Levenberg-Marquardt synchrosqueezing


Dominique Fourer, Geoffroy Peeters

To cite this version:


Dominique Fourer, Geoffroy Peeters. Fast and adaptive blind audio source separation using recursive
Levenberg-Marquardt synchrosqueezing. ICASSP 2018, Apr 2018, Calgary, Canada. <hal-01799529>

HAL Id: hal-01799529


https://hal.archives-ouvertes.fr/hal-01799529
Submitted on 24 May 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
FAST AND ADAPTIVE BLIND AUDIO SOURCE SEPARATION USING RECURSIVE
LEVENBERG-MARQUARDT SYNCHROSQUEEZING

Dominique Fourer Geoffroy Peeters

UMR STMS (IRCAM - CNRS - UPMC)


dominique@fourer.fr, geoffroy.peeters@ircam.fr

This paper revisits the Degenerate Unmixing Estimation bined with the Levenberg-Marquardt algorithm [12] to be-
Technique (DUET) for blind audio separation of an arbitrary come adaptive thanks to a damping parameter. These methods
number of sources given two mixtures through a recursively have shown their interest for disentangling multicomponent
computed and adaptive time-frequency representation. Re- signals or to obtain physically interpretable components [13].
cently, synchrosqueezing was introduced as a promising Thus, we propose to revisit a well known and efficient
signal disentangling method which allows to compute re- algorithm for blind source separation called Degenerate Un-
versible and sharpen time-frequency representations. Thus, mixing Estimation Technique (DUET) [4, 14], which operates
it can be used to reduce overlaps between the sources in the in the time-frequency plane using a disjoint orthogonality as-
time-frequency plane and to improve the sources’ sparsity sumption between the sources. This approach can estimate
which is often exploited by source separation techniques. an arbitrary number of sources from a stereophonic mixtures
Furthermore, synchrosqueezing can also be extended using in a blind configuration and has shown its efficiency on real-
the Levenberg-Marquardt algorithm to allow a user to adjust world audio signals. Due to its simplicity and its robustness,
the energy concentration of a time-frequency representation the DUET algorithm is suitable for an investigation of the TFR
which can be efficiently implemented without the FFT al- role in a source separation algorithm in realistic blind audio
gorithm. Hence, we show that our approach can improve source separation scenarios. We also show that our proposed
the quality of the source separation process while remaining approaches can be implemented in terms of recursive filtering
suitable for real-time applications. to allow real-time applications without using the Fast Fourier
Transform (FFT) algorithm.
Index Terms— blind source separation, time-frequency
The paper is organized as follows. The Levenberg-
analysis, synchrosqueezing, Levenberg-Marquardt algorithm.
Marquardt algorithm applied to the recursive synchrosqueezed
Short-Time Fourier Transforms (STFT) is introduced in Sec-
1. INTRODUCTION tion 2. A blind source separation method which combines
Source separation is the task which aims at estimating the DUET and synchrosqueezing is then proposed in Section 3
source components present in a mixture [1]. It could allow and evaluated by numerical experiments in Section 4. Finally,
a user to freely manipulate each isolated instrument in a poly- results and future works are discussed in Section 5.
phonic audio mixture and could find many other practical ap-
plications (e.g. karaoke, remixing, denoising, etc.). The blind 2. THE RECURSIVE LEVENBERG-MARQUARDT
degenerate case (where the number of sources is greater than SYNCHROSQUEEZING TRANSFORM
the number of observed mixtures) remains the most challeng-
ing. State-of-the-art methods should use strong assumptions Let X h (t, ω) denote for any time t and any angular frequency
over the sources, such as sparsity [2], harmonicity [3], or dis- ω, the STFT of a signal x(t) using a differentiable analysis
joint orthogonality [4] which can only be revealed thanks to a window h(t), defined as:
proper signal representation. Nowadays, these signal proper- Z
ties are also exploited by promising recent methods based on X h (t, ω) = x(u)h(t − u)∗ e−jωu du (1)
R
deep neural networks [5] or Kernel Additive Modeling (KAM) Z
[6] which have shown their ability to capture source-specific = e−jωt x(t − u) h(u)∗ ejωu du (2)
features from a time-frequency representation (TFR). R | {z }
g(u,ω)
The synchrosqueezing transform [7, 8] was introduced as

a sharpening method which contrarily to the reassignment z being the complex conjugate of z and j 2 = −1. Since
[9], provides reversible TFRs. This method was recently ex- |X h (t, ω)|2 provides a TFR called spectrogram, a signal re-
tended to frequency modulated signals [10, 11] and was com- construction can be provided using [12]:
Z
This research has received funding from the European Union’s Horizon 1 dω
2020 research and innovation program under grant agreement no 688122.
x(t − t0 ) = X h (t, ω) ejω(t−t0 ) (3)
h(t0 )∗ R 2π
for any time delay t0 ≥ 0 verifying h(t0 ) 6= 0. In [12, 15], we 2.2. Levenberg-Marquardt synchrosqueezing
showed that Eq. (2) can be viewed as a convolution product
of x with a filter g(t, ω) = h(t) ejωt , which can be efficiently Reflection on synchrosqueezing was continued, and by anal-
implemented in terms of a recursive filtering process if we use ogy, the Levenberg-Marquardt root finding algorithm has
k−1 t been used to compute new reassignment operators to adjust
a specific analysis window hk (t) = T kt(k−1)! e− T U (t), with the energy localization in the time-frequency plane through a
k ≥ 1 the filter order, T the time spread of the window and damping parameter µ [16]. This parameter could be locally
U (t) the Heaviside step function. Computation details and a matched to the signal content by a voice activity detector [17]
Matlab implementation of this approach is freely available as or by a noise only/signal+noise binary detector [17, 18]. The
a part of ASTRES toolbox in [13]. new operators are computed as:
   
2.1. Synchrosqueezed STFT t̂µ (t, ω) t −1 h
= − ∇t Rxh (t, ω) + µI2 Rx (t, ω)
ω̂µ (t, ω) ω
Synchrosqueezing is a post-processing technique which
(7)
enhances the time-frequency localization of a transform  
while allowing signal reconstruction. For the STFT, the syn- t − t̂x (t, ω)
with Rxh (t, ω) = (8)
chrosqueezing transform is based on the synthesis formula ω − ω̂x (t, ω)
given by Eq. (3) which leads to the following definition [12]:
 h h

∇t Rxh (t, ω) = ∂R ∂t (t, ω)
x ∂Rx
∂ω (t, ω)
(9)
Z
0
SX h (t,ω) = X h (t, ω 0 ) ejω (t−t0 ) δ(ω − ω̂x (t,ω0 )) dω 0 (4)
R
I2 being the 2 × 2 identity matrix and t̂x being the time reas-
signment operator computed as:
where δ(t) denotes the Dirac distribution. This transform
provides a sharpened TFR computed as |SX h (t, ω)|2 (cf. il-
 Th 
X (t, ω)
lustration in Fig. 1), when an efficient local Instantaneous t̂x (t, ω) = t − Re , with T h(t) = t h(t).
X h (t, ω)
Frequency Estimator (IFE) is used for ω̂x [11]. Usually, the (10)
frequency reassignment operator is used such as [9]: Thus, the Levenberg-Marquardt synchrosqueezing trans-
 Dh  form can be computed by replacing ω̂ in Eq. (4) by ω̂µ . This
X (t, ω) dh leads to new adaptive and reversible TFRs which can also be
ω̂x (t, ω) = ω + Im , with Dh(t) = (t).
X h (t, ω) dt efficiently computed through recursive filtering [12, 15].
(5)
An enhanced second-order IFE could also be used to compute 3. BLIND SOURCE SEPARATION
the vertical synchrosqueezed STFT as in [11]. Thus, the origi-
Now, let’s consider a two-channel mixture made of I ≥ 2
nal signal x can be recovered from the synchrosqueezed STFT
sources si . Each active source in the first channel x1 , is atten-
using the following reconstruction formula [12]:
uated by a factor ai and delayed by a duration τi (expressed
1
Z
dω in seconds), in the second channel x2 . Thus, the resulting
x̂(t − t0 ) = ∗
SX h (t, ω) (6) mixture can be modeled as:
h(t0 ) R 2π
XI
where the integration interval can profitably be restricted to x1 (t) = si (t)
the vicinity of the signal ridge for components extraction. i=1
I
X
x2 (t) = ai si (t − τi ) (11)
i=1

which can be expressed in the time-frequency domain as:


 S1h (t, ω)
 
 h  
X1 (t,ω) 1 ... 1 ..
= .
 
X2h (t,ω) a1 e−jωτ1 . . . aI e−jωτI  .
h
SI (t, ω)
(12)
(a) spectrogram |X h (t, ω)|2 (b) synchrosqueezing |SX h (t, ω)|2 Hence, our proposal consists in replacing the computed STFTs
Si by the desired TFR (i.e. a synchrosqueezed version) in the
Fig. 1. Comparison of the TFRs provided by the recursively whole proposed source separation algorithm.
computed spectrogram (a) and the squared modulus of the
3.1. Mixing parameters estimation
recursive synchrosqueezed STFT (b). The analysed signal is a
mixture made of 4 audio sources from the Bach10 dataset. DUET algorithm [4, 14] can recover both the mixing param-
eters and estimates of the original sources by assuming that
only one source is active at each time-frequency coordinate 4. NUMERICAL EXPERIMENTS
(t, ω). This allows us to write the following expressions when 4.1. Audio dataset
a source i is active at a given (t, ω) coordinate:
 h    Our experiments use the Bach10 dataset1 which contains 10
X1 (t,ω) 1
= Sih (t, ω). (13) musical excerpts of classical music for which the isolated
X2h (t,ω) ai e−jωτi
tracks are availables. Each musical piece is made of 4 sources
Thus, the mixing parameters can be estimated when X1 (t, ω) 6= (string instruments) resampled at Fs = 8000 Hz and trun-
0 (resp. X2 (t, ω) 6= 0) such as: cated at the 5 first seconds. The simulated stereophonic mix-
h
X (t, ω)
tures are computed using random mixing parameters ai ∈
âi (t, ω) = 2h (14) {0.4, 0.6, 0.8, 1} and τi ∈ [− F2s ; + F2s ] ensuring that all of
X1 (t, ω) the original sources have distinct parameters.
 h 
1 X2 (t, ω)
τ̂i (t, ω) = − arg , ∀ω 6= 0. (15) 4.2. TFR and W-disjoint orthogonality
ω X1h (t, ω)
For the sake of enhancing the robustness, DUET algorithm Two sources s1 , s2 are said window-disjoint orthogonal if
computes a smoothed 2D histogram using the symmetric at- their windowed Fourier transforms (or STFTs) verify [19]:
tenuation instead of â, computed as:
S1h (t, ω)S2h (t, ω) = 0 ∀(t, ω). (22)
1
α̂i (t, ω) = âi (t, ω) − (16)
âi (t, ω) We now propose to extend this definition to any TFR since
The signal energy is then distributed according to the esti- the source separation quality of DUET depends on how the
mates α̂ and τ̂ over the corresponding axes with a resolution sources can overlap in the time-frequency plane. In [14], the
∆α and ∆τ . Hence, the histogram is computed as: author proposes to measure the W-disjoint orthogonality of a
ZZ source i in a mixture, for a given separation mask Mi using:
H(α, τ ) = |X1h (t, ω)X2h (t, ω)|2 dtdω (17) Di (Mi ) =
(t,ω)∈I(α,τ )
ZZ ZZ
with: |Mi (t,ω)Sih (t,ω)|2 dtdω − |Mi (t,ω)Yi (t,ω)|2 dtdω
I(α, τ ) = {(t, ω) : |α − α̂(t,ω)| < ∆α , |τ − τ̂ (t,ω)| < ∆τ } R2 ZZ R 2

The parameters associated to each source can be deduced |Sih (t,ω)|2 dtdω
from the prominent detected peaks in H(α, τ ) for which the R2
mixing parameter âi can be recovered from α̂i using: X (23)
p where Yi (t, ω) = Sjh (t, ω) denotes the sum of all the
α̂i + α̂i2 + 4 ∀j6=i
âi = (18)
2 other sources present in the analyzed mixture.
which can be deduced after inverting Eq. (16).
using DUET separation mask (best orthogonality: µ=0.060)
3.2. Sources estimation 0.95

Each time-frequency coordinate associated to the histogram


averaged W−disjoint orthogonality

given by Eq. (17) can be associated to the prominent source 0.9


using its corresponding mixing parameters such as:
!
âk e−jωτ̂k X1h (t, ω) − X2h (t, ω)
J(t, ω) = arg min
k 1 + â2k 0.85

(19)
which allows the computation of the binary separation mask
Mi of each source computed as: 0.8

( STFT (Hann)
1 if J(t, ω) = i recur. STFT
Mi (t, ω) = . (20) recur. synch. STFT
0 otherwise 0.75
recur. L−M synch. STFT
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
µ
Finally, the TFR of each source is simply recovered by:
X1h (t, ω) + âk e+jωτ̂k X2h (t, ω) Fig. 3. Approximated W-disjoint orthogonality using differ-
Ŝi (t, ω) = Mi (21)
1 + â2k ent TFRs, as a function of µ. Results are averaged over 10
for which the waveform is reconstructed using the corre- mixtures made of 4 sources from the Bach10 dataset.
sponding synthesis formula (i.e. Eq. (3) when STFT is used 1 http://music.cs.northwestern.edu/data/Bach10.

or Eq. (6) for the synchrosqueezed STFT). html


Source separation quality (µ=0.060) Source separation quality (µ=10.000)
STFT (Hann) STFT (Hann)
45 recur. STFT 45 recur. STFT
recur. synch. STFT recur. synch. STFT
40 recur. L−M synch. STFT 40 recur. L−M synch. STFT

35 35

30 30

25 25
[dB]

[dB]
20 20

15 15

10 10

5 5

0 0

RQF SDR SIR SAR RQF SDR SIR SAR

(a) optimal value (µ = 0.06) (b) arbitrary high value (µ = 10)

Fig. 2. Comparison of the source separation results provided by the proposed methods applied on the Bach10 dataset. (a) and
(b) use different values of the damping parameter µ used by the recursive Levenberg-Marquardt synchrosqueezed STFT (other
TFRs are not affected by µ).

A high value for Di corresponds to a better expected to be known and identical The source separation quality is
source separation quality provided by DUET. Thus, a compar- measured in terms of ReconstructionQuality Factor (RQF)
|x[n]|2
P
ison of the averaged W-disjoint orthogonality measured using computed as [12]: RQF = 10 log10 P |x[n]−x̂[n]|2 and
n

Eq. (3) for each TFR computed on the Bach10 dataset, is dis- n
in terms of Signal-to-Interference Ratio (SIR), Signal-to-
played in Fig. 3. Here, Di of all the sources is also averaged
Distortion Ratio (SDR) and Signal-to-Artifact Ratio (SAR)
between all the pieces of the dataset. Each TFR is computed
which are commonly used measures computed through BSS
using M = 1024 discrete frequency bins. A classical STFT
Eval2 [20]. According to the results displayed in Fig. 2, best
computed using FFT with a Hann window of length 128 ms
separation results (in particular with a higher SIR) are reached
(1024 samples at Fs = 8000 Hz) with an 12 -overlap between
using the recursive Levenberg-Marquardt synchrosqueezed
adjacent frames is used as a baseline method. For the re-
STFT using µ = 0.06. This result confirm the expectation
cursively computed TFRs, an Infinite Impulse Response (IIR)
provided by the W-disjoint orthogonality in Fig. 3. As also
filter of order k = 5 using a window time spread equal to
expected, the recursive synchrosqueezed STFT outperforms
L = T Fs = 100 is used. For the reconstruction a delay equal
the STFT. However, the recursive version of the STFT ob-
to n0 = (k − 1)L samples, which corresponds to the maxi-
tains slightly better results than the STFT computed using the
mum of the causal window, is considered (cf. [12] for details).
FFT, despite a lower averaged value of Di . For the com-
Our results show that the synchrosqueezed STFT provides a
parison, Fig. 2(b) displays results for µ = 10 and shows
better W-disjoint orthogonality than the STFT (recursive or
that Levenberg-Marquardt synchrosqueezing obtains results
classical). When combined with the Levenberg-Marquardt
comparable to those provided by STFT when µ is too high.
algorithm, a maximal Di is reached with µ = 0.06. As theo-
retically investigated in [16, 12], a lower value of µ improves
5. CONCLUSION
the time-frequency energy localization and converges to the
recursive synchrosqueezed STFT results and higher values In this paper, we have proposed new extensions of the DUET
for µ converge to the recursive STFT results due to a poorer source separation algorithm using a recursive implementation
time-frequency localization. Interestingly, the best results are of the Levenberg-Marquardt synchrosqueezed STFT. We have
provided by a trade-off with a value of µ which is not too shown that synchrosqueezing allows to compute a sharpen
small. The validity of this choice is also confirmed by the and reversible TFR which can improve the disjoint orthogo-
source separation results presented in the next section. nality between the sources. This results in a significant im-
provement of the source separation results in comparison with
4.3. Blind source separation results classical TFRs (∆SIR≈+5dB in average). Future works will
Now, we compare the source separation results provided by consist in investigating new source separation methods based
the proposed methods using the same configuration as in on TFR masking using the synchrosqueezing technique.
Section 4.2, which are applied to the Bach10 dataset. For 2 BSS Eval: http://bass-db.gforge.inria.fr/bss_
each compared method, the mixing parameters are assumed eval/
6. REFERENCES [12] D. Fourer, F. Auger, and P. Flandrin, “Recursive ver-
sions of the Levenberg-Marquardt reassigned spectro-
[1] P. Comon and C. Jutten, Handbook of Blind Source Sep- gram and of the synchrosqueezed STFT,” in Proc. IEEE
aration: Independent component analysis and applica- ICASSP, Shanghai, China, May 2016, pp. 4880–4884.
tions, Academic press, 2010.
[13] D. Fourer, J. Harmouche, J. Schmitt, T. Oberlin,
[2] C. Chenot, J. Bobin, and J. Rapin, “Robust sparse blind S. Meignen, F. Auger, and P. Flandrin, “The ASTRES
source separation,” IEEE Signal Process. Lett., vol. 22, toolbox for mode extraction of non-stationary multi-
no. 11, pp. 2172–2176, Nov. 2015. component signals,” in Proc. EUSIPCO, Kos island,
Greece, Aug. 2017, pp. 1170–1174.
[3] Z. Duan, Y. Zhang, C. Zhang, and Z. Shi, “Unsupervised
[14] S. Rickard, Blind Speech Separation, chapter The
single-channel music source separation by average har-
DUET blind source separation algorithm, pp. 217–241,
monic structure modeling,” IEEE/ACM Trans. on Au-
Springer, 2007.
dio, Speech, and Language Processing, vol. 16, no. 4,
pp. 766–778, May 2008. [15] D. Fourer and F. Auger, “Recursive versions of the
Levenberg-Marquardt reassigned scalogram and of the
[4] A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separa- synchrosqueezed wavelet transform,” in Proc. IEEE
tion of disjoint orthogonal signals: Demixing N sources DSP, London, UK, Aug. 2017, pp. 1–5.
from 2 mixtures,” in Proc. IEEE ICASSP, Istanbul,
Turkey, June 2000, vol. 5, pp. 2985–2988. [16] F. Auger, E. Chassande-Mottin, and P. Flandrin, “Mak-
ing reassignment adjustable: the Levenberg-Marquardt
[5] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multi- approach,” in Proc. IEEE ICASSP, Kyoto, Japan, Mar.
channel audio source separation with deep neural net- 2012, pp. 3889–3892.
works,” IEEE/ACM Trans. on Audio, Speech, and Lan-
guage Processing, vol. 24, no. 10, pp. 1652–1664, June [17] Q.-H. Jo, J.-H. Chang, J.W. Shin, and N.S. Kim, “Statis-
2016. tical model-based voice activity detection using support
vector machine,” IET Signal Processing, vol. 3, no. 3,
[6] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and pp. 205–210, May 2009.
L. Daudet, “Kernel additive models for source separa-
[18] J. Huillery, F. Millioz, and N. Martin, “On the de-
tion,” IEEE Trans. Signal Process., vol. 62, no. 16, pp.
scription of spectrogram probabilities with a chi-squared
4298–4310, June 2014.
law,” IEEE Trans. Signal Process., vol. 56, no. 6, pp.
[7] I. Daubechies and S. Maes, “A nonlinear squeez- 2249–2258, June 2008.
ing of the continuous wavelet transform,” Wavelets in [19] S. Rickard and O. Yilmaz, “On the approximate
Medecine and Bio., pp. 527–546, 1996. W-disjoint orthogonality of speech,” in Proc. IEEE
ICASSP, Orlando, FL, USA, May 2002, vol. 1, pp. 529–
[8] F. Auger, P. Flandrin, Y.T. Lin, S. McLaughlin,
532.
S. Meignen, T. Oberlin, and H.T. Wu, “TF reassignment
and synchrosqueezing: An overview,” IEEE Signal Pro- [20] E. Vincent, R. Gribonval, and C. Févotte, “Perfor-
cess. Mag., vol. 30, no. 6, pp. 32–41, Nov. 2013. mance measurement in blind audio source separation,”
IEEE/ACM Trans. on Audio, Speech, and Language
[9] F. Auger and P. Flandrin, “Improving the readability of Processing, vol. 14, no. 4, pp. 1462–1469, July 2006.
time-frequency and time-scale representations by the re-
assignment method,” IEEE Trans. Signal Process., vol.
43, no. 5, pp. 1068–1089, May 1995.

[10] T. Oberlin, S. Meignen, and V. Perrier, “Second-order


synchrosqueezing transform or invertible reassignment?
Towards ideal time-frequency representations,” IEEE
Trans. Signal Process., vol. 63, no. 5, pp. 1335–1344,
Mar. 2015.

[11] D. Fourer, F. Auger, K. Czarnecki, Meignen, and Flan-


drin, “Chirp rate and instantaneous frequency estima-
tion: Application to recursive vertical synchrosqueez-
ing,” IEEE Signal Process. Lett., vol. 24, no. 11, pp.
1724–1728, Nov. 2017.

Вам также может понравиться