Академический Документы
Профессиональный Документы
Культура Документы
Abstract— In audio engineering, Pitch shifting is a term to signal). On the other hand, stationary signals do not occur
describe the process of changing the pitch of an audio signal in practice; consider the signal in the figure 1, containing
(which is related to the logarithm of the frequency), but opposite two frequencies ocurring at different times and its Fourier
to resampling, pitch shifting retains the duration of the original
signal. This process is very useful for example in correcting Transform magnitude spectrum in figure 2:
slightly out-of-tune singers or instruments. The dual problem
is changing the signal duration without affecting its pitch; this
is called time stretching. One of the most well-known techniques
for performing pitch shifting is called Phase Vocoding, but this
method presents some artifacts. This project aims to overcome
some of the deficiencies of the Phase Vocoder by using a wavelet-
based approach.
Index Terms— Pitch shift, vocoder, phase vocoder, wavelet,
Morlet
I. I NTRODUCTION
ITCH shifting describes the process of changing the pitch
P of an audio signal (which is related to the logarithm
of the frequency), but, opposite to resampling, pitch shifting
retains the duration of the original signal. This process is very
useful for example in correcting slightly out-of-tune singers or
instruments. The dual problem is changing the signal duration
without affecting its pitch; this is called time stretching. One of
the most well-known techniques for performing pitch shifting Fig. 1. A non-stationary signal
is called Phase Vocoding. This is a frequency-based technique,
whose main idea is to chop the input signal into very short
segments, perform a frequency domain analysis, change the
frequency content of that portion, and convert back to the time
domain. Basically, this is summarized as follows:
Fig. 3. A non-stationary signal’s spectrogram Fig. 5. A non-stationary signal with unresolved frequency components
bass drum or snare drum). This is due to the fact that despite quency1 and time). The main idea behind the use of wavelets
the STFT being “localized” both in time and frequency, it is is to decompose a signal into a (possibly infinite) sum of
not localized enough. Consider the effect of computing the functions which are both time- and frequency-localized. This
is very different to the FT or STFT where the basis functions
spectrogram with a longer and a shorter window width in
figures 4 and 5. were sinusoids (which are frequency-localized but not time-
localized since they extend from −∞ to +∞).
2
1 1
+ω 2 )
Ψ̂σ (ω) = cσ π − 4 ((σ − ω) eσω + ω) e− 2 (σ (2)
The "central frequency" ωΨ is the position of the global
Fig. 4. A non-stationary signal with unresolved time components maximum of Ψ̂σ (ω) which, in this case, is given by the
solution of the equation:
We see that in the figure 4, the precise location of the 2 2
− 1 e−σωΨ
(ωΨ − σ) − 1 = ωΨ
beginnings of the tones is uncertain. Likewise, the precise
frequency in figure 5 isn’t obvious. This is because the The parameter σ in the Morlet wavelet allows trade between
frequency resolution is inversely proportional to the analysis time and frequency resolutions. Conventionally, the restriction
window length. This is known as the Heisenberg uncertainty σ > 5 is used to avoid problems with the Morlet wavelet at
principle low σ (high temporal resolution).
In summation, the STFT provides constant bandwidth anal- 1 Actually, 1
ysis, i.e. high resolution in the high frequencies and low frequency .
2 Jean Morlet is a French geophysicist who did pioneering work in the field
resolution in the low frequencies. of wavelet analysis in collaboration with Alex Grossman. Morlet invented the
To overcome these problems, the wavelet transform was term “wavelet” (ondelette) to describe equations similar to those that had been
devised. Just as the STFT, the Wavelet transform is a two- around since the 1930s.
3 This criterion states that
R
R Ψσ (t) dt = 0. It is fundamental for the
dimensional representation of a one-dimensional signal. The existence of an inverse wavelet transform as well as for the applicability of
parameters are the scale and translation (analogous to fre- Parseval’s identity.
A WAVELET-BASED PITCH-SHIFTING METHOD 3
For signals containing only slowly varying frequency and components and formants should be transposed accordingly.
amplitude modulations (audio, for example) it is not necessary We analyze the effects of formant shifting in a later section.
to use small values of σ. In this case, κσ becomes very small As we have already stated, there are many algorithms nowa-
(e.g. σ > 5 ⇒ κσ < 10−5 ) and is, therefore, often neglected. days, both of public domain and commercial ones, which carry
Under the restriction σ > 5, the frequency of the Morlet out pitch shifting with varying degrees of success. To mention
wavelet is conventionally taken to be ωΨ ' σ. a few, the Phase vocoder is a very well-known technique which
is also very well suited for real time processing since it can
C. Discrete vs. Continuous Wavelet transform be implemented quite efficiently.
and Z ∞
2
|ψ (t)| dt < ∞
−∞
vocoder uses too fine a resolution in the low frequencies and The bottom line is, the RKHS condition means that there
too coarse a resolution on the high frequencies. This results exists some unique signal whose transform is a signal we mod-
in lost timing as well as the addition of beating or warbling ified. Put differently, if we begin with a signal, transformed
effects. The term smearing is used to describe the undesirable it to a different domain through a linear functional such as
side effects of Phase Vocoder processing. This is one of the the FT or the CWT, and modify this transformed signal in
motivations for using the CWT. some way, we don’t have any warranty that the modified
The CWT, by definition, provides a grid that is not a discrete transformed signal corresponds to any signal in the original
one as in the STFT, but a continuous one, as the name implies. domain! The condition than guaranties the existence of a time-
The grid is therefore, a variable scale one. The scale is given domain inverse is the RKHS condition, which is dependent on
by the dilation of the wavelet function. The time-frequency the kind of modification we perform on the dual domain.
representation of the signal can have resolutions tailored to
the areas of interest. By using the CWT, one can create a B. Mathematical Foundations of the Pitch Shifting Algorithm
grid which has finer resolutions for increasing frequencies As we have already stated, the classical approach to pitch
which is more suited to music. Unlike the Phase Vocoder, shifting is to first compute the STFT time-frequency represen-
the analysis grid created by the CWT is not a time-frequency tation of a signal. Afterwards, the grid spacing and coefficient
grid in the strict sense. Since the CWT performs multiple phases are scaled to create a synthesis grid. The inverse is
projections onto scaled wavelet functions, the output actually then computed to reconstruct the signal. It is summarized as
represents the similarity (i.e. the correlation) between the input follows:
signal and the wavelet function at each scale. A frequency is 1) Compute the STFT representation of the signal.
often assigned to each scale value, so that one can achieve 2) Convert coefficients into polar form
an approximate representation of the frequency content. This 3) Unwrap the phase and divide by the scaling factor c.
frequency is the dominant spectral component of the wavelet 4) Reconstruct signal using new synthesis scale.
function. Depending of the wavelet used, this frequency can
The crucial part here is the division of the phase by the scaling
be a very good or very poor estimate of frequency content,
factor c. Consider the following property (scaling property) of
viz. if the wavelet has a very spread out response, then the
the Fourier Transform:
usefulness of this “central” frequency isn’t that good, just as
1 ω
what happened in the spectrograms with the STFT. This is f (at) ⇐⇒ F
a major reason why the Complex Morlet wavelet is used for |a| a
audio signals. This wavelet has a very narrowband frequency We want to exploit this property in our approach to scale the
spectrum, which offers a good representation of the frequency pitch of an audio signal. Any complex number z can be written
content. in polar notation, as z = |z| ej∠z . In particular,
F (ω) = |F (ω)| ej∠F (ω)
III. M ETHODOLOGY
Therefore, F a = F ωa exp j∠F ωa .
ω
A. Reproducing Kernel Hilbert Subspace
Recall the definition of group delay
A reproducing kernel Hilbert space is a function space in
d∠F (ω)
which pointwise evaluation is a continuous linear functional. τ (ω) = −
dω
Equivalently, they are spaces that can be defined by reproduc-
ing kernels. The subject was originally and simultaneously We then get
d∠F ωa
developed by Nachman Aronszajn (1907-1980) and Stephan ω
Bergman (1895-1987) in 1950. τ = −a = aτ 0 (ω)
a dω
Let X be an arbitrary set and H a Hilbert space of complex- where
d∠F ωa
valued functions on X. H is a reproducing kernel Hilbert
τ 0 (ω) = −
space iff the linear map f 7→ f (x) dω
from H to the complex numbers is continuous for any x in
X. By the Riesz representation theorem, this implies that for
a given x there exists an element Kx of H with the property ω ω Z
ω
F = exp j∠ − τ dω
that: f (x) = hKx , f i ∀f ∈ H (∗) a
F
a a
def
The function K(x, y) = Kx (y) is called a reproducing ω Z
0
= exp j∠ − a τ (ω) dω
kernel for the Hilbert space. In fact, K is uniquely determined F
a
by condition (*).
1 ω
Z
For example, when X is finite and H consists of all f (at) ⇐⇒ exp j∠ − a τ 0 (ω) dω
|a| a
F
complex-valued functions on X, then an element of H can
be represented as an array of complex numbers. If the usual 1
f (at) ⇐⇒ |F (Ω)| exp (ja∠F (Ω))
inner product is used, then Kx is the function whose value is |a|
1 at x and 0 everywhere else. The magnitude is scaled by a constant factor, so we won’t
R In other contexts, (*) amounts to saying f (x) = take this into account since this corresponds only to a dif-
X
K(x, y)f (y) dy. ference in volume. The phase portion of the FT is related to
6 FILTER BANKS AND WAVELETS EEN698
the group delay of the signal’s frequency content. To maintain where ⊗ denotes convolution.
the same time duration when changing the frequencies, the The final result of Equation 6 can be quickly found in the
slope of the phase needs to be scaled to either slow down or frequency domain by taking the inverse Fourier transform of:
speed up the group delay to compensate for the time scaling
F̃ (an , b) = |a|f˜ (ω) ψ̃ (aω)
p
side effect. The hardest part in terms of computation is the
transformation of the input signal into the 2D time-frequency
representation. The quality of this transform will directly affect
the output pitch shift. It was stated earlier how the CWT results
E. Computation of the Inverse CWT
in a similar 2D signal representation. More care has to be
taken when modifying coefficients of the CWT however, as The inverse CWT is defined by equation 4. We see that it in-
the reconstruction conditions are much more complex than volves a double integral, which can be performed numerically
that of the STFT Phase Vocoder. Any modifications in the dual through some integration technique such as the trapezoidal
domain must be made while satisfying the Reproducing Kernel rule, quadrature integration, etc. The trapezoidal rule approx-
Hilbert Subspace (RKHS) property. Changing the phase com- imates the area under a curve by fitting a trapezoid between
ponents of the Complex Morlet CWT maintains this property. every adjacent sample point. There is also an approximate
Thus the algorithm used in the Phase Vocoder implementation reconstruction formula given by [2] which simplifies to a
can be used with the Complex Morlet CWT. single integration over the dilation scales.
Z ∞
1
C. The Pitch Shifting Algorithm f (t) ≈ Kψ F (a, t) 2/3 da (7)
−∞ a
The pseudo code for the entire CWT based pitch shift
algorithm is as follows:
IV. R ESULTS
A. Algorithm Testing
function pitch_shift(in c: real, in x:
real[], out y: real[]) The algorithm was implemented in MATLAB, and several
coefficients=cwt(x,scales); tests were carried out, which we describe below.
magnitude=abs(coefficients); Example 1: We pitch-shift a middle-C tone (256 Hz) by
phase=unwrap(angle(coefficients)); a factor of 2 and 12 (an octave up and down respectively).
coefs_shifted=magnitude*exp(j*phase*c); The spectrogram of the original and the two shifted signals is
scales_shifted=scales/c; shown in figures 8, 9 and 10 respectively, and the FT of the
y = icwt(coefs_shifted,scales_shifted); relevant frequency region is shown in figure 11.
p1 1
∞
= f (t) exp −j5 − dt
|a| a 2 a Fig. 8. Spectrogram of a middle-C tone
Z b − t 1 b − t 2 !
−∞
p1 ∞
= f (t) exp j5 − dt We see that the algorithm performs well: the frequencies
|a| a 2 a
Z
−∞
b − t are shifted correctly up and down, and the duration of the clip
p1
∞
= f (t) ψ dt is preserved.
|a| −∞ a
We rewrite this as Example 2: In the next test case, we wanted to see how the
algorithm performed as to preserving the temporal character-
1 b istics of a signal. To do this, a middle C tone burst was used
F (a, b) = p f (b) ⊗ ψ (6)
|a| a as the test signal, and shifted an octave higher.
A WAVELET-BASED PITCH-SHIFTING METHOD 7
Fig. 9. Spectrogram of a higher C tone Fig. 11. Fourier Transform of the three C tones
The signal is shown in figures 12, 13 and 14 as well as the These results show that our algorithm certainly works. A
spectrogram of the original and shifted signals. final test was performed involving speech and music signals.
We notice there is some smearing of the frequency content, For both of these classes of signals, the algorithm performs
but considering the abrupt frequency changes, the smearing is fairly well. In music signals, there is a very small distortion
not too large and there is negligible distortion upon playback. artifact in between notes, which corresponds to discontinuity
in the time-frequency grid. Voice signals are very susceptible
Example 3: Harmonic relations. In this test we verify to distortion when pitch shifting. Smearing can result in
that a shifted C major scale preserves the harmonic relations a loss of speech coherency, which is almost unnoticeable
between successive notes. Note that if we were to shift for instrumentals but is very noticeable for voice. Despite
frequencies (as opposed to pitch), we would be shifting the this, even in the extreme cases of the 12 semitone shifts,
frequency content by an additive (or subtractive, for that mat- much of what the speaker is saying is still clear. The major
ter) value. Instead, we are changing the frequency content by distortion that can be noticed is the timbre changes as pitch
a multiplicative factor, thus changing the pitch by a constant, is altered. The voice of the singer transforms from beast-like
additive term. to chipmunk-like as the pitch is changed. Unfortunately this
The spectrogram of the original and shifted signals are is one thing the algorithm does not account for.
shown in figures 15 and 16 respectively. 1) Formants: The algorithm shifts each and every fre-
By listening to the resultant audio files and from inspection of quency. This can cause the unique characteristics of a signal
the spectrograms, we see that the spacing (exponential spacing, to change, namely the formants. These are created by a
that is) between tones is definitely maintained. However, some particular person’s vocal tract and are what makes a certain
attenuation can be noticed near the end of the signal. voice distinguishable from someone else’s. They correspond
8 FILTER BANKS AND WAVELETS EEN698
R∞ 2
2 −∞
t2 |f (t)| dt
(∆t) =
E
R∞ 2
2 −∞
ω 2 |F (ω)| dω
(∆ω) =
2πE
2
Claim 1: If limt→±∞ t |f (t)| = 0, then ∆t∆ω ≥ 21 .
We recall the Cauchy-Schwartz Inequality for the L2 Hilbert
space. For any L2 functions z (x) and w (x) defined in the
interval [a, b],
Z 2 Z
b b Z b
2 2
z (x) w (x) dx ≤ |z (x)| dx |w (x)| dx (8)
a a a
R EFERENCES
[1] De Gersem P., De Moor B., Moonen M., “Applications of the continuous
wavelet transform in the processing of musical signals”, in Proc. of the
13th International Conference on Digital Signal Processing (DSP97),
Santorini, Greece, Jul. 1997, pp. 563-566
[2] De Gersem P., De Moor B., Moonen M., “Applications of wavelets
in audio and music”, in Record of the KVIV Study-day on Wavelet
analysis: a new tool in signal and image processing, Antwerp, Belgium,
Dec. 1996, 14 p.
[3] Siome Goldenstein, Jonas Gomes, "Time Warping of Audio Signals,"
cgi, p. 52, Computer Graphics International 1999 (CGI’99), 1999
[4] S.K. Mitra, “Digital Signal Processing”, 2nd Edition, McGraw-Hill
Science, McGraw-Hill, 2001.
[5] A. Oppenheim, R. Schafer. Discrete-time Signal Processing. Prentice-
Hall Signal Processing Series. Prentice-Hall, Upper Saddle River, NJ,
1999
[6] U. Zolzer, “DAFX Digital Audio Effects,” West Sussex, Eng-
land:Wiley,2002 pp. 201-282.
[7] R. Kronland Martinet, “The wavelet transform for analysis, synthesis
and processing of speech and music sounds,” Computer Music Journal,
vol. 12, Winter 1988, pp. 11-20
[8] J.P. Antoine, Two-dimensional wavelets and image processing, Institut
de Physique Théorique, Université Catholique de Louvain.
[9] J.R. Beltrán, F. Beltrán, Additive Synthesis-Based on the continuous
wavelet transform: a Sinusoidal plus Transient Model, Proc. of the
6th Int. Conference on Digital Audio Effects (DAFx-03), London, UK,
September 8-11, 2003
[10] A.M. Reza, Spire Lab, UWM, From Fourier Transform to Wavelet
Transform,White paper, 1999.
[11] J.L. Flanagan, R.M Golden, Phase Vocoder, Bell Syst. Tech. J., vol. 45,
pp. 1493-1509, 1966
[12] F. Hammer, Time-scale Modification using the Phase Vocoder: An
approach based on deterministic/stochastic component separation in
frequency domain, Diploma Thesis, Institute for Electronic Music and
Acoustics (IEM), Graz University of Music and Dramatic Arts, Graz,
Austria.
[13] P. Bastien, Pitch shifting and voice transformation techniques, TC-
Helicon.
[14] Wikipedia http://www.wikipedia.com