Cepstrum Pitch Determination: OICED-speech Sounds Result From The Resonant

Received 24 August 1966
9.3, 9.8, 9.9
Cepstrum Pitch Determination
A.
MICHAEL
NOLL
Bell Telephone Laboratories, Murray Hill, New Jersey07971
The cepstrum, definedas the power spectrumof the logarithm of the power spectrum, has a strong peak corresponding the pitch periodof the voiced-speech to segment beinganalyzed.Cepstrawere calculatedon a digital computerand were automaticallyplotted on microfilm.Algorithmswere developed heuristically for pickingthosepeakscorresponding voiced-speech to segments the vocalpitch periods. and This information was then usedto derivethe excitationfor a computer-simulated channelvocoder. The pitch quality of the vocoded speech judgedby experienced was listeners informalcomparison in teststo be indistinguishable from the original speech.
INTRODUCTION
OICED-speech on the periodicpuffs of sounds from resonant result the action of the vocal tract
air admitted throughthe vocal cords.For pitch-period determination, time periodicityof the source the signal must be obtained from the observedspeechsignal. Also, voiced-unvoiceddecisions require accuratedetermination of the presence absence such periodic or of puffs in the source signal. This deceptively simple problem has been the object of considerable research over the past few decades. Aside from its obvioususe in analysis speech of sounds from a pure research standpoint, an accuratepitch detectormust alsoperform adequately as an integral part of most speech-bandwidth compression schemes. The designof an accuratepitch detector that works satisfactorily with band-limited, noisy speechsignalsremains one of the challenging areasof speech processing research. In a previous paper, a new methodfor obtainingthe fundamental frequency pitch of humanspeech deor was scribed? Sincethe logarithmof the amplitudespectrum of a periodic time signal with richharmonic structure is itself "periodic" in frequency,the new method consistedof spectrumanalyzing this log amplitude spectrum. Adopting some new terminologyproposedby Tukey, the methodwas called"cepstrum"pitch detecI. HISTORICAL BACKGROUNDS tion, wherethe term cepstrum refersto the spectrum of In the fall of 1959,Bogert (of Bell Telephone Labothe log-amplitudespectrum.Computerprogramswere of written to perform short-time cepstrum analyses of ratories) noticed banding in spectrograms seismic signals.He realized that this banding was causedby
1A.M. Noll, "Short-Time Spectrum and 'Cepstrum' Techniquesfor Vocal-Pitch Detection," J. Acoust. Soc. Am. 36, 296302 (1964).
2A.M. Noll and M. R. Schroeder,"Short-Time 'Cepstrum'
speech, and the resultantpitch informationwas used to obtain the excitation for computer-simulated vocoders. The synthesized speech was quite encouraging as demonstrated tapesplayed at the sixty-seventh by meetingof the AcousticalSociety?' The early computer programs written to simulatethe cepstrumanalyzer have sinceundergone number of a changes towardssimplicityand efficiency. The results of analyses speechcepstrawere used to designan of automatic method for determining the pitch periods from the cepstralpeaks.This automaticpeak picker, thoughnot previouslydescribed, was used to obtain the excitation signalsfor the computer-simulated vocoders. Someinterestingand unexpected pitch fluctuations and pitch doublinghave been discovered during the observations speech of cepstrarequiredto develop the algorithms for the cepstral peak picker. These topics and new approaches explaining to and justifying cepstrum pitch determination werenot reportedin the previous papers;and now is alsoa goodtime to present the historical backgroundleading to the concept of short-timecepstrum analysis vocal-pitchdetection. for This paper treats all thesetopicsand concludes with descriptions some of possible hardware implementations of cepstrumanalyzers.
Pitch Detection,"J. Acoust.Soc.Am. 36, 1030 (1964).

The Journalof the Acoustical Societyof America 293
Downloaded 27 Feb 2013 to 67.20.202.101. Redistribution subject to ASA license or copyright; see http://asadl.org/terms
A.M.
NOLL
"periodic"ripples the spectra in and that thiswascharacteristicof the spectraof any signalconsisting itself of plus an echo.The frequencyspacingof these ripples equalsthe reciprocalof the difference time arrivals in of the two waves. Tukey (of both Princeton University and Bell TelephoneLaboratories)suggested that this frequency difference might be obtainedby first taking the logarithm of the spectrum, thereby making the ripplesnearly cosinusoidal. spectrumanalysisof the A log spectrumthen couldbe performedto determinethe "frequency" of the ripple. In early 1960, Bogert programmedTukey's suggestion a computerand proon ceededto analyze numerousearthquakesand explosions.Tukey, noticingsimilaritiesbetweentime series analysisand log-spectrum seriesanalysis,introduceda new set of paraphrased terms.The spectrumof the log spectrum wascalledthe "cepstrum,"and the frequency of the spectralrippleswere referredto as "quefrency." Bogert, Tukey, and Healy publishedtheir ideasin an article with perhapsone of the weirdesttitles ever encounteredin the scientificliterature: "The Quefrency Alanysisof Time Series Echoes:Cepstrum,Pseudofor Autocovariance, Cross-Cepstrum SapheCracking. and TM In the article, they very clearly expressed pessimistic a view for achieving adequate classification seismic of events by cepstral techniques.In fact, no definitive indicationof focal depth was found.
IMAGINARY
Their article was issued as an internal Bell Labora-
toriesmemorandum beforepublicationin Rosenblatt's

book. Schroeder read the memorandum and realized
that voicedspeech spectraalsohave ripples,and hence cepstrumanalysismight be suitablefor vocal-pitch determination. In June 1962, Schroedersuggested cepstrum-pitch determinationas an area worthy of further study.At that time, he and Atal had just completed a paper on methods performing for short-time spectrum analyses.Thus, the atmosphere perfect 4 was for the conceptof short-time cepstrumanalysisthat then developed. Seismic signals consist a single of event,and therefore only one cepstrum is obtained. Speech, however, changes with time, and a singlecepstrumof a long
speech signalwouldbe meaningless. Hence,a series of cepstra shortsegments the speech for of signalare required--a short-time cepstrum. scheme devised A was for performing suchshort-timecepstral analyses utilizing delay linesand multipliersas shownin Fig. 1. A computerprogram was written with a specialpurpose block-diagramlanguage to simulate this
method, and the short-time spectraand cepstra were automatically plotted by the computeron microfilm. 5
PART
OF
SPECTRUM
I ADDER '
SIN (0'7'
-I '<'--GE SAND NERATOR MULTIPLIERS

'<--- MULTIPLIERS -1 TIM E-WINDOW -1 HOLD CIRCUITS
The cepstra voiced for speech intervals strong had peaks corresponding the pitch period.The conclusion to was quite definite:Althougha single cepstrum analysis of seismic eventswasnot promising seismic for classification, short-time cepstrum analysis speech of performed excellently a newmeans as forvocal-pitch determination. In recent papers, heutristics cepstrum the of analysis for extractingechoed signals from noisehas been developed Bogerrand Ossanna. by 6 Also,a moregeneral formalism separating of convolved signals its relaand tion with cepstrumanalysishas been treated by Oppenheim. 7
SAMPLE AND
,ER DELAY LT LINE

' cos
CEPSTRUM xx sampled-data for device
Fro. 1. Block diagram of
ADDER
REAL OF PART SPECTRUM S ,
ADDER
$o
performing cepshort-time strum analysis.
3B. P. Bogert,M. J. R. Healy, and J. W. Tukey,in Proceedings theSymposium TimeSeries of on AnalySis, M. Rosenblatt, by Ed.
(John Wiley & Sons,Inc., New York, 1963), Chap. 15, pp. 209-243. 4 M. R. Schroederand B. S. Atal, "GeneralizedShort-Time Power Spectra and Autocorrelation Functions," J. Acoust. Soc. Am. 34, 1679-1683 (1962).
5j. L. Kelly, Jr., Carol Lochbaum, and V. A. Vyssotsky, Block DiagramCompiler,"Bell SystemTech. j. 40, 669-677 "A
(1961).
0B. P. Bogert and J. F. Ossanna, "The Heuristicsof a Stationary ComplexEchoed GaussianSignal in Stationary Gaussian Noise," IEEE Trans. Information Theory IT-12, No. 3, 343 (1966). A. V. Oppenheim, Nonlinear Filtering of ConvolvedSignals,"Mass. Inst. Technol. Res. Lab. Electron. Quart. Progr. Rept. No. 80, 168-175 (January1966).
294 Volume 41 Number 2 1967
CEPSTRUM
II. CESTRUM-PITCH DETERMINATION
PITCH
DETERMINATION
VOICED
In its most basic form, the system for producing voicedspeech sounds consists only of the vocal source and the vocaltract asshown Fig. 2. The source in signal s(t) is the periodicpuffs of air admitted through the vocal cords.The effectof the vocal tract is completely specified its impulseresponse suchthat the outby h(t) put speech signalf(t) equals convolution s(t) and the of h(t). Alternatively,if S(w) is the spectrum the vocal of source and H (co) the transferfunction or spectrumof is the vocal tract, then the spectrumof the speech signal equals the product of S(w) and H(w). Expressed algebraically,
SOURCE h s(t)
VOCAL
,q TRACT
SPEECH
f (t)=
s(t),
h(t)
Fro. 2. Basic systemfor the productionof voicedspeechsounds. h(t) is the impulseresponse the vocal tract. of
approach to pitch determination is, in general, unsatisfactory. 8

The solution is to devise a new function in which the
effectsof the vocal sourceand vocal tract are nearly independent or easily identifiable and separable. The Fourier transform of the logarithmof the power spectrum is sucha new function and, indeed,separatesthe
effects of the vocal source and tract. The reason for this
f(t)-s(t),h(t),
with
(1)
(2)
is that the logarithmof a productequalsthe sumof the logarithmsof the multiplicands'
loglF) [-1og[l$()] . l() []

F (w)= 5f (t)-], (3)
(4)
(0) ()
-log I8)l +log IHe)I .
The Fourier transformof the logarithmpowerspectrum preserves additive property and is the
where denotesconvolution, 5 denotesFourier transformation, and the Fourier transformsof s(t) and h(t)
are assumed to exist.
The
source and tract
effects are now additive
rather
The source signal and, therefore,the speech signal, than convolvedas in the autocorrelation.The importwith the assistare quasiperiodic voiced-speech for sounds. the period anceof this can be intuitively explained If anceof Fig. 3. The effectof the vocaltract is to produce is T seconds, then the powerspectrum F(w)]2 of the speechsignal consists harmonicsspacedT- Hz. a "low-frequency" ripple in the logarithm spectrum, of manifests itself Thus, the power spectrumof a voicedspeech signalis while the periodicityof the vocal source ripplein the logarithmspectrum. "periodic" along the frequency axis with "period" as a "high-frequency" equal to the reciprocal the periodof the time signal Therefore, the spectrumof the logarithm power specof to being analyzed. The obvious way to measure this trum has a sharp peak corresponding the highfrequency source ripplesin the logarithmspectrum and "period" in the powerspectrum to take the Fourier is to transformof the spectrumthat will have a peak corre- a broader peak corresponding the low-frequency sponding the "period." This spectrumof the power formant structurein the logarithm spectrum.The peak to to spectrum morecommonly is knownas the autocorrela- corresponding the sourceperiodicity can be made more pronounced squaringthe secondspectrum. by tion function of the original time signal. MathematiThis function,the squareof the Fourier transformof the cally,the autocorrelation function r(r) is defined as logarithm power spectrum,is called the "cepstrum," r(r)--=E F (w)12-]. (6) borrowingTukey's terminology. To prevent confusion betweenthe usual frequency The speech powerspectrum equals productof the components a time function and the "frequency" the of spectra the vocalsource of and the vocaltract. But the ripplesin the logarithmspectrum, Tukey has usedthe Fourier transformof a productequalsthe convolution paraphrasedword quefrency describingthe "frein of the Fourier transformsof the two multiplicands. quency"of the spectralripples.Quefrencies have the Thus, units of cyclesper hertz or, simply, seconds. Adopting r(r)=i7E S(w)[2 H(w) 123 (7) this terminology, cepstrum the consists a peak occurof
ring at a high quefrency equalto the pitch periodin seconds and low-quefrency information corresponding =r,(r)*rh(r), (9) to the formant structurein the logarithmspectrum. Thus far, no mention has beenmade about the time wherer,(r) and rh(r) are the autocorrelation functions lengthof thesignal underanalysis. mentioned As before, of s(t) and h(t), respectively. The effectsof the vocal for seismic signals,a singlecepstrumanalysisis persource and vocal tract are therefore convolved with each formed for the whole seismic event. But speech param-
= [ls ) l
(8)
other in the autocorrelation
functions. This results in
broadpeaksand in somecases multiplepeaksin the
8M. R. Schroeder, "Vocoders' Analysis and Synthesis of
Proc.IEEE 54, No. 5, 720-734 (1966). autocorrelation function; thus, an autocorrelation Speech," The Journalof the Acoustical Society America of 295
A.M.
NOLL
o
i
FREQUENCY (Hz)
Fz6. 3. Logarithm power spectrum (top) of a voiced speech segmentshowinga spectralperiodicity resulting from the pitch periodicity of the speech. The power spectrum the logarithm of spectrum, cepstrum(bottom), or
therefore has a sharp peak corresponding to this spectral periodicity.
QUEFRENCY ('SECONDS)
SPEECH
SIGNAL "
/ (k-I) Tj
HAMMING
TIME WINDOW
TIME
-T w
+ Tw
FIG. 4. Basicoperations required obtaining for the short-timecepstrum a speech of signal.The hamming time windowof length Tw secmovesin jumps
of Tj sec.
K,TM SHORT-TIME SPECTRUM:
=-
TW
K,TM LOG POWER SPECTRUM:
tH SHORT-TIM CPSTRUM:
C(q)
0
oG f(o)la cos dq q
296
Volume41
Number2
1967
CEPSTRUM
PITCH
DETERMINATION
eters--and, in particular, pitch--change with time; thereforea seriesof cepstrafor short time segments of the signalare required.This is accomplished multiby plying the time signalby a function that is zero outside somefinite time interval. The function performssomething like a window through which the time signal is viewed,and its effectsare discussed later in more detail. As shownin Fig. 4, the time-limited signalis spectrum analyzedonceto obtainthe log spectrum and then again to produce the cepstrum. A new portion of the time signalthen entersthe window and is similarly analyzed to produceanother cepstrum.This process, when performed repetitively, results in a seriesof short-time cepstra.The time window,if desired,couldalsolook at overlappingportionsof the signal. The resultantcepstraare automaticallyexaminedto determinethe maximum peakscorresponding voiced to speech intervals and the frequencyof thesepeaks.This informationis used to decideif the speech segmentis voicedor unvoiced and, if voiced,to determinethe pitch period.
Both the effects of the time window and a mathe-
matical justification for the spectral ripples were neglectedin the precedingdiscussion are now taken and up. The time-limited signal to be analyzedis
The resultsof the preceding showthat if the original speechsignal s(t),h(t) is not time-limited, then the complexspectrumconsists an infinite seriesof imof pulsesspacedT - Hz and with amplitude tt(n2rr/T) XSo(n2rr/T). If the non-time-limitedsignal is bandlimited, the complex spectrum would be frequency limitedor zerofor Icol >comax. effect timelimiting The of the speechsignal with a multiplicative time window w(t) is a convolutionof the corresponding spectral windowW(co)with the spectralimpulses the nonof time-limited complexspectrum.Thus, the impulses are broadened assume shape W (co). and the of The complex spectrum is now no longer frequency-limited, since W(co) is the transform of a time-limited function and, therefore, cannot be zero over any finite frequency interval. Hence, the complexspectrum is not strictly frequency-limited,but can be describedas being approximatelyfrequency-limited W(co)has very small if side lobes. Also, the main lobe of W(co) determines the spectral resolution,and thereforea W(co) with low-amplitude side lobes and a narrow main lobe is required. Although these requirementsare mutually exclusive,a good compromise the hamming time is
window,
w(t) =0.54+0.46 cos(rrt/Tw) ;
l tl rw
(20)
=0; [t >rw. from Eq. 1, wherew(t) is the time window,definedto The hamming spectralwindow has a maximum side be zerofor It[ > Yw.But, the periodic source signal s(t) lobe 44 dB belowits peak response. can be represented the superposition an infinite as of III. NUMERICAL COMPUTATION OF CEPSTRA series identicalsignals of so(t)repeatedeveryT seconds:
s(0: E x0(t-.r) a(t--nr). (15)
The Fourier transformF(co)of somefunction of time f(t) is definedas
g(t) = [-s (t),h (t)-].w(t)
(13)
=so(t), E
Substitutioninto Eq. 13 gives
F(co)=
f(t)e-;tdt.
(21)
If f(t) is time limited by some multiplicative time windoww(t) such that w(t)=0 for Itl > rw and if
(16)
g(0 = {Es0(t), E
complex exponentiationis separated into real and imaginaryparts, Eq. 21 becomes
The Fourier transformor complexspectrumG(w) of g(t) is

060) =
F ) =
f_w
Tw
(0/(t) cos (t)
s0() E
a -
/)
,we)
(17)
-jj_ w(t)f(t) (22) sin(cot)dt.

Tw
2; z/(oo)$o(oo) oo-,
,w
(18)
W (w), (19)
Furthermore,sinceF(co)has a time-limited transform, namely, w(--t)f(--t), then by Nyquist's sampling theorem applied to the frequency domain, cocan be represented co=raAco, as where Aco<_2rr/(2Tw). Also, sincef(t) is band-limitedto 0 to coc/(2r) Hz, t can be represented t=lAt, where At=2rr/(2coc).Thus, the as integrations Eq. 22 can be replacedby summations, in
where So(w),H(w), and W(co) are the Fourier trans9R. B. Blackman and J. W. Tukey, The Measurement Power of Spectra(Dover Publications,Inc., New York, 1959). formsof so(t),h (t), and w(t), respectively.
The Journalof the Acoustical Society America of
A.
M.
NOLL
40
DECIBELS
T
40
DECIBELS
12
15
0 I 2 3 4 0
FREQUENCY (kHz)
QUEFRENCY (mSEC)
t2
15
FREQUENCY (kHZ)
QUEFRENCY (mSEC)
Fro. 5. Short-time logarithm spectra (left) and short-time cepstra (right) for a male talker (L.G.) recordedwith a condenser
microphone. 40 msec-long The hamming time movedin jumpsof

10 msec.
Fro. 6. Short-time logarithm spectra (left) and short-time cepstra (right) for a male talker (F.L.C.) recordedfrom a 500type telephone with carbonmicrophone. set
so that F (w) becomes
F (mAw) Atl----L w(1At)f(1At) (lmAtAw) = Y'. cos

L
-jat
where L= TwAt.
208 Volume 41
5-'. w(1At)f(1At)sin(lmAtAw),(23)
This equationled to the concept a delay line for of storing2L+ 1 samples the input signal(sampleand of hold circuitsat the taps of the delay line) so that the signal being analyzed remains constantduring the analysis(windowmultipliers, functiongenerators for cosine and sine, and addersas shownin Fig. 1). The real and imaginaryparts of the spectrum produced by
Number2
1967
CEPSTRUM
PITCH
DETERMINATION
I
4O DECIBELS
T 40 DECIBELS
t2
15
FREQUENCY (kHz)
QUEFRENCY (mSEC)
FIe. 7. Short-time logarithm spectra (left) and short-time cepstra (right) for a female talker (S.S.) recordedwith a condensermicrophone.
12
15
FREQUENCY(kHZ}
QUEFRENCY (m..SEC)
this sampled-dataspectrumanalyzer are squaredand added to generatethe power spectrum.The logarithm of the power spectrumis usedas the input to a similar power-spectrum analyzerwhose output is the cepstrum. This sampled-dataanalyzer was simulated on an IBM-7094 digital computer usingthe BLODI comby piler. The input speechto the computer was bandlimited to 4 kHz, sampled every10 secs, digitized; -4 and
Fro. 8. Short-time logarithm spectra (left) and short-time cepstra (right) of "(scr)eaming," spoken by a female talker (S.S.) and recordedwith a condenser microphone.A doubling in pitch period occurs the end of the utterance. at
the time window extended from --15 to [ 15 msec. The
results reported in the previous paper were obtained with this computersimulation,which consumed nearly 2 h of computertime to analyze only 2 secof speech.
The Journalof the AcousticalSocietyof America 299
A.M.
NOLL
The programwas extremelyunwieldyand changes in any parameters were difficult. Obviously,somestreamliningof the program wasrequiredif further progress in cepstrum-pitch detection wereto be accomplished. Only a single spectrum a series spectra defined in of is by Eq. 22. If the time window movesin jumps of Tj sec,then the kth short-timespectrum Fk(m) is defined as
L
of logIF(co) simply gives log[F(c0) Thus, the Is Is.

Fourier transformof C(r) equalsthe convolution of
log F(co)Is with itself.Sincelog F(co)I is very small s for Ico>coo, convolution very nearlylimitedto the is
the interval co_< 2co. Nyquist's theoremcan therefore be applied,and the cepstrumcan be sampledso that r=n/Xr with /Xr_<2r/(4co,). Thus, the kth short-time cepstrum Ck(n) can be calculated as
M
F(m)= 5'. w(l)f[-(k- 1)K-+-l-] cos(lm/Xt/xco)

L
C(n)= Y', loglFa(m)Iscos(mnaraco), (27)
where/xco<_2r/(4Tw), M=co/Aco, n=0,1, ..., N with N some arbitraryupperlimit on the desired quefrencies in the cepstrum. whereL= Tw//Xt,K= Tj//xt and m=O,1, ..., co/Aco. The numericaloperations indicatedby Eqs. 25 and The kth short-time powerspectrum the magnitude 27 wereprogrammed the FORTRAN language. is in To squared the kth short-timespectrum' of conserve execution time, all sineand cosine operations were performedas table lookupsfrom calculatedsine L and cosine tables.Also,the computation the sineand of [F(m) s={ E w(1)f[-(k--1)K-+-l-]cos(lm/Xt/xco)}scosinetransforms utilized even and odd symmetryin the input signalto reducefurther the number of calcuL lations. Nevertheless,the program was still very -Jr-{ w(l)f[-(k--1)K-+-l-]sin(lm/Xt/xco) 5-'. }s. lengthy and required about 0.8 h to compute the for Recently,an algorithm (25) cepstra about2 secof speech. hasbeendeveloped Cooley by andTukey for performing Fouriertransformations? This algorithm Althoughthe complex spectrum may be sampled at fastnumerical into the cepstrum programand /xco_< 2r/(2Tw),the power spectrum should sampled has beenincorporated be at/xco<_ 2,r/(4Tw). This is because Fouriertransform has resultedin a programabout eight timesfaster than the of the power spectrumis the autocorrelation function the previousone. A very important factor in the computercalculation that for a signaltime limited to +Tw secis itself timelimited to +2Tw sec.By Nyquist'ssampling theorem, of short-time cepstrahas been facilities for the autothe power spectrumthereforemust be sampledat matic plotting of the spectraand cepstra.Thesefacili/xco<_2r/(4Tw). Strictlyspeaking, the Fouriertrans- ties consistof a cathode-raytube and camera, both if form of the power spectrumis time-limited, then the under the direct controlof the digital computer. Fouriertransform the logarithmpowerspectrum of is generallynot time-limited.But from experience that IV. EXAMPLES OF SPEECH SPECTRA AND CEPSTRA the aliasingis negligible, the log power spectrumis The computertechnique described the preceding in sampled the sameinterval as the powerspectrum. portions thispaperwasusedto analyzea few selected at of Sincethe computeris usedin taking logarithms,the sentences words.The speechwas low-pass and filtered logarithm of zero is forced to be noninfinite. to 4 kHz and sampled every 10 sec.The hamming -4 The cepstrum C(r) is now formally definedas the time windowwas40 mseclongand movedin jumpsof powerspectrum the logarithmpowerspectrum. of Since 10 msec.The spectral components were calculatedat the log powerspectrum an evenfunction,this defini- frequencyintervals of 12.5 Hz up to a maximum freis tion is equivalent the square the cosine to of transform quencyof 4 kHz; the cepstralcomponents were calcuof the log powerspectrum, or latedat invervals 0.0625msec to a maximum of up quefrency of 15 msec.The resultsof the calculations were C(r)-log F(co)I cos(cot)rico s . (26) automaticallyplotted on microfilm by the computer with corresponding spectra and cepstra shown adjacent to eachother. Time progresses downwards jumpsof in For C(r) to be sampled,the Fourier transformof 10 msec. Figure 5 showsthe spectra and cepstra of a male C(r) mustbe band-limited. However, (r) is theprodC uct of two cosinetransforms,and thereforethe Fourier talker (L.G.) recorded with a condenser microphone; transform of C(r) is the convolutionof the Fourier Fig. 6 is for a differentmale talker (F.L.C.) recorded transformsof the individual cosinetransforms.But,
--j
w(l)f[-(k--1)K+l-]sin(lm/Xt/xco),(24)
function, the Fourier transformof the cosinetransform tion 19, 297-301 (1965).
300 Volume41 Number2 1967
since cosine the transform log[F(co) is also even chineCalculation Complex of ls an of FourierSeries," Math. of Computa-
10 W. Cooley and J. W. Tukey, "An Algorithm for the Maj.
CEPSTRUM
PITCH
DETERMINATION
o SEC
APPROXIMATE BEGINNING OF DOUBLE
L.
.]
10 MSEC
0.1 SEC
PITCH
PERIOD
INDICATION
IN CEPSTRA
FIG. 9. Speechwaveform of the "ing" portion of "(scr)eaming," showingthe approximatelocation of the switch to double pitch period indicated by the
cepstra.
0.1 SEC
0.2
SEC
0.2 SEC
0.3 $EC
from a 500-type telephone with carbontransmitter; a famale talker (B.M.) and recorded set with a condenser Fig. 7 is femalespeech(S.S.) recorded from a condenser microphoneis shownin Fig. 10. The 12th, 13th, and microphone.In all three examples,the voiced-speech 14th cepstrahave small secondrahmonicsat about 8.8 intervalsare clearlyindicatedby the sharppeaksin the msec that are smaller in amplitude than the fundacepstra. The cepstral peaks in Fig. $ for the voiced- mental cepstralpeak at about 4.4 msec.However, the speech intervalsof Curves11-15 are particularly inter- 19th through 21st cepstrahave secondrahmonicswith esting since they consist of a major peak with two amplitudesexceeding the fundamental.This type of smallerpeakson either side.This occurred because the doublingof pitch period imbeddedin voiced speech pitch was changing rapidly such that each 40-msec sounds wrong when usedas excitationfor a vocoderand analysis interval contained different pitch periods. is thereforeconsidered undesirable. as The spectrafor Actually, the 40-msechammingwindowlooksmostly at the doublepitch consist harmonics of corresponding to only the center 20 msec since the tails of the window the 4.4-msec pitch period with interlaced harmonics are strongly weighted down in amplitude. Thus, very that fade in and out across spectrum. the This type of little smoothing is actually present, and the largest spectrum caused minute jitter in the pitch-pulse is by cepstralpeak corresponds the dominantpitch period timing.11If the vocal sourcesignal s(t) is assumedto to mostly within the 20-mseccenterinterval. consistof air puffs at T-i- e, 2T, 3T-i- e, -, then 0, Figure 8 showsthe spectra and cepstraof the utterance (scr)eaming spokenby a female talker (S.S.) into 2,ra condenser microphone. about the 12th cepstrum, s(t) = y. [.,0(t- 2.r)+ s00At a second"rahmonic" appearsand gradually growsin (28) amplitudeuntil, at about the 17th cepstrum, ampliits tude exceeds fundamental peak at about 5.2 msec. the =So(t)* Y', The fundamental peak then disappears, leaving only the cepstralpeak at 10.4 msec. This would imply a doublingof pitch period at the end of the ".-. ing" The Fourier transform the summation of portioncorresound,and, indeed,speech synthesized with the doubled sponding the jittered pulses to is excitationsounds natural and compares better with the original than excitation that doesnot doublein period Z a . (20) at the "-.. ing" portion. The spectra corresponding .... 2T/ to this transition show the alternate harmonicsgradually growingin amplitude until they fill in the gaps But, betweenthe harmonics corresponding the lowerpitch to (30) l+e period.The actual speech waveformis shownin Fig. 9, of and the point of transition is indicated.Although the so that the power spectrumconsists impulsesevery doubling is discernibletowards the end of the signal, 1/2T Hz with an amplitude fluctuation of [-lq-cos the cepstrumgives an indication of doubling earlier w(Tq-). If there is not jitter, then =0; and, since than would be determinedby visual inspectionof the [-l+coscoT-]=Ofor co=(-/T)n (where n= 1,3,5, ..-),
w aveform.
The spectraand cepstraof the word chase spokenby
u B. Gold and J. Tierney, "Pitch-Induced Spectral Distortion in ChannelVocoders," Acoust.Soc.Am. 35, 730-731 (1963). J.
The Journalof the Acoustical Society America of
301
A.
M.
NOLL
40
DECIBELS
40
DECIBELS
i5
FREQUENCY {kHz)
QUEFRENCY (mSEC)
Fro. 11. Short-time logarithm spectra (left) and short-time cepstra(right) of "(o)be(y)," spokenby a male talker (R.C.L.) and recorded with a condenser microphone. explosion The occurs at the sixth spectrumand cepstrum.
impulses appearat 1/2T-Hz intervalsand thenperiodicallyfadein andout across spectrum. the The jitter can be calculatedfrom the frequencyin the spectrumat which the amplitude of the impulsesare first equal, since at this frequency the Nth cosinewave with period1/(Tq-e) Hz has a maximumsituatedexactly between two adjacentimpulses. For the spokenword chase,this occurredat 3 kHz corresponding an to e0.08 msec,whichis smallerthan the accuracy one of previous measurement pitch perturbations? of The spectraand cepstrashownin Figs. 11-13 are for a male speaker (R.C.L.) recordedwith a condenser microphone. These speech utterances were chosen by O. Fujimura in his investigations Bell Telephone at Laboratories speech of sounds. The first set of spectra and cepstra showthe explosion the wordobey in (occur0 i 2 3 4 0 3 6 9 t2. '15 FREQUENCY (kHz) QUEFRENCY(m $EC) ing at the sixthline of Fig. 11) asexemplified a comby Fro. 10. Short-time logarithm spectra (left) and short-time pletely ripple-free spectrum. Figure 13 shows the cepstra(right) of "chase,"spoken a femaletalker (B.M.) and spectraand cepstrafor the voiced fricative portion of by recorded with a condenser microphone. The 19th through 21st the word razorat the sixth throughninth lines. cepstra have secondrahmonicsthat exceedthe fundamental and
that would result in an undesired indicationof pitch-period
doubling.
V. AUTOMATIC TRACKING OF CEPSTRAL PEAKS
The cepstral peaks corresponding voiced speech to the oddharmonics disappear, therebyleavingimpulses intervals can easilybe pickedvisually. However, these every1IT Hz. However, e is not zero,the spectrum 1. Lieherman,"Perturbationsin Vocal Pitch," J. Acoust. if p. Soc. starts with impulses spaced1IT Hz, but gradually Am. 33, 597-603 (1961).
302 Volume41 Number2 1967
CEPSTRUM
PITCH
DETERMINATION
4O
DECIBELS
12
t5
0 2 3 4
FREQUENCY
QUEFRENCY (mSEC)
12
15
FREQUENCY (KHZ)
QUEFIRENCY(mSEC)
Fro. 13. Short-timelogarithm spectra (left) and short-time (right)of "(r)azor,"spoken a maletalker (R.C.L.) by and recorded with a condenser microphone. The explosion occurs cepstra and recorded with a condenser microphone. The voicedfricative at the sixth spectrumand cepstrum. occurs the sixth throughninth spectraand cepstra. at
FIG. 12. Short-timelogarithmspectra (left) and short-time
cepstra (right) "(b)abbl of (ed),"spoken a male by talker(R.C.L.)
components the power in peaks mustbe pickedautomatically cepstrum if tech- Thus, the higher-quefrency decrease the timewindow as convalved with niques to be used a pitchdetection are in scheme. This spectrum the becomes unwieldy for section thepaper of describes heuristic the development itself.Although mathematics an exactsolution, is reasonable expectthe higherit to of an algorithm picking cepstral for the peakthat best components the logarithm the power in of describes pitch of thespeech that timeinterval. quefrency the for to similarly, thereby explaining the The criterion of "best" was evaluated by using the spectrum decrease of in pitch data as excitation a computer-simulated needof weighting the higherquefrencies the cepof vocoder and thencomparing vocoded the speech with the strum. The linear weightingwith range of 1-5 was chosen empirically using by periodic pulsetrainswith originalspeech. as program. The examples cepstra of indicate that the cepstral varyingperiods input to the cepstrum The cepstral peaksat the end of a voiced-speech peaks clearlydefined are quitesharp. are and Hence, usually decrease amplitude would in and fall thepeak-picking scheme to determine maximum segment is the the The is the valuein the cepstrum exceeding some specified thresh- below peakthreshold. solution to decrease by factor(2) overa quefrency range of old. Sincepitch periodsof lessthan 1 msecare not threshold some pitch period usually encountered, intervalsearched thepeak 4-1 msecof the immediatelypreceding the for when trackingthe pitch in a series voiced-speech of in the cepstrum 1-15 msec. is The reverts its normalvalue to Since cepstral the peaks decrease amplitude in with segments. threshold cepstrum range aftertheendof theseries increasing quefrency, linearmultiplicative a weighting overthewhole wasapplied overthe 1-15-msec range. The weighting of voiced segments. was 1 at 1 msec and 5 at 15 msec. The Fourier transform Thereis alsothepossibility an isolated that cepstral might exceed threshold, the thereby resulting a in of thepower spectrum the timewindow of equals the peak
convolution of the time window with itself,
false indicationof a voicedspeech segment.In fact,
5:[ W (w) 2]= 5:[W(w) (-w) ] W

=w(t)*w(-t).
some isolated flapsof the vocalcords have beenob(31) servedas the cause suchan isolatedcepstralpeak. In of any event,suchpeaksshould not be considered as
The Journalof the Acoustical Societyof America 0
A.
M.
NOLL
INVESTIGATE
PITCH
DOUBLING
YES
PICK MAXIMUM
PEAK IN INTERVAL OF + 0.5 MS OF
[QUEFRENC OF PEAK]
r
OFMAXIMUM PEAK
!
t
YES/
+-1.0MSOFQUEFRENCY
OFPEAKOFNTH /
ININTERVAL_WITHIN NO "
1/2 INITIAL1 INITIAL VALUE lTO THRESHOLD ITOTHRESHOLD SETVALUE[ CEPSTRUM I / IsET
[SECOND L P' iR_AHMON, T? E C G;ToIuMBALNE_o H

As TCH[A/5AS TCHPEAK/
PEAK X
i WEIGHTING
OF CEPSTRUM PICK MAXIMUM PEAK OF WEIGHTED CEPSTRUM
LINEAR
voicedspeech interval.
AND QUEFRENCY OF)
FIG. 14.Flowchartof the algorithm used to decide the Nth cepstrum if represents a
STORE AMPLITUDE
MAXIMUM PEAK
YES
NO
YES
L 'T INITIAL 1,2 v,ul
INVESTIGATE
JTC'" VALUE Ni'I',L

INVF q
ES':. YES C
STAFf OR
QUEFRENCY OF PITCH PEAK AT NTH CEPSTRUM SET EQUAL TO
AVERAGE PITCH QUEFRENCY OF N-lST AND N+lST CEPSTRA
!
YES CE
TRA, KIN6
TRACKINGPITCH / I PEAK
PITCH/
AT NTH/
VOICED x,
/UNVOICE
AT NTH
AT / NTH AT
304
Volume41
Number2
1967
CEPSTRUM
PITCH
DETERMINATION
PITCH PERIOD
voiced, and this is accomplished disregarding by any cepstralpeaksexceeding thresholdif the immedithe ately precedingcepstrum and immediately following cepstrum indicateunvoiced speech. This meansthat the immediatelyfollowingcepstrum must be peak searched before a decisioncan be made about the present cepstrum. Hence, a delay of one cepstrummust be introducedto eliminatethis requirement knowledge of about the future. Before deciding about the "present" cepstrum, however, knowledge about the precedingand followingcepstrumis also requiredfor the algorithm used to eliminate another problem, namely, pitch doubling. An exampleof legitimatepitch doublingoccurred at the end of the wordscreaming, shownin Fig. 8. Howas ever, the second rahmonic a cepstral of peak sometimes exceedsthe fundamental, and the second rahmonic shouldnot be chosen representing pitch period. as the Thus, the peak picking algorithm should eliminate false pitch doublingcausedby a second rahmonic but shouldalsoallow legitimatepitch doubling.For legitimate doubling,there is no cepstralpeak at a one-half quefrency,but for erroneous doubling,there is sucha peak at one-half quefrency since this is the fundamental. The algorithmcapitalizes upon this observation by lookingfor a cepstralpeak exceeding threshold the in an interval of 4-0.5 msecof one-half the quefrencyof the double-pitch peak.If sucha peakis found,thenit is assumedthat it representsthe fundamental, and the double-pitch indicationis wrong. The thresholdis reduced by a factor of 2 if the maximum peak in the
+0.5-msec interval falls within 4-1.0 msec of the im-
lo MSEC
SMOOTHED PITCH PERIOD
./
{///{///i/l/i/l/j///
RAMP/// .//,
t/// t
Fro. 15. Method for deriving pitch pulses from pitch period data suppliedby cepstralpeak picker.
information about the N-+-lth cepstrum and N-- 1th cepstrumis then used to decide if the Nth cepstral peak represents isolatedvoicedsegment an isoan or lated absenceof voicing in a seriesof voiced-speech segments. The final result is an indication of whether the Nth cepstrumrepresents voicedor an unvoiced a speechsegment.If the segmentis voiced, the pitch periodis alsogiven. A computer program was written to perform the operations requiredby the algorithm.The voicingand pitch-periodinformation were both printed on paper and written on magnetictapesfor later processing.
VI. VOCODER EXCITATION
mediately precedingpitch period. Pitch doublinghas occurred whenever the cepstral peak exceedingthe thresholdis at a quefrencyof >_1.6 times the immediately preceding pitch period. A flow chart of the peak-pickingalgorithm is shown in Fig. 14. The algorithmdetermines whether the cepstral peak of the /Vth cepstrum represents voiced a speechsegment.Information about the N--lth ceptrum is stored,and the/V-+-1 cepstrum peak picked th is before decidingabout the/Vth cepstrum.The N-kith cepstrum readin, linear weightingis applied,and the is maximumpeak is picked.If the preceding two cepstra represented voiced-speech segments, pitch tracking then is in effect,and the thresholdis reducedto its initial value if the quefrency the peakis within q-1.0 msecof of the quefrencyof the pitch peak of the/Vth cepstrum. The previouslydeterminedpeak in the N-+-lth cepstrum is now comparedwith the threshold. Pitch doublingis investigatedwhether the peak exceeds or
The final judge of any vocal-pitchdetectionscheme is its ability to performsatisfactorily determining in the
excitation for a vocoder. Vocodor excitation in the form
of pitch pulsesduring voicingand white noiseduring nonvoicingthus had to be derived from the resultsof the cepstralpeak picking. The cepstralpeak picker producedtwo outputs on digital magnetictape. The first tape containedvoicing information as two dc levels corresponding a voiced to or unvoicedspeechinterval. The levels were constant for the 10-msec corresponding the speech to time jumps. The second tape containedthe pitch periodas dc level signalsthat also were constantfor 10 msec.These two tapesformed the input to the excitationgenerator. The voicing and pitch-period signalsare first each smoothed a pair of 33-Hz low-pass by filters.The pitch does not exceed the threshold. Both cases are checked pulsesare derived from the smoothed pitch signal as sincethe peak might represent pitch doublingand yet shownin Fig. 15 by runninga counterup until it equals not exceedthe initial value of the threshold.But, the the smoothed pitch signal.An impulse then emitted, is fundamental peak couldstill exceed -} initial value and the counteris resetto zerobeforeagainstartingits the threshold. the maximum peak exceeds threshold, count. If the smoothedpitch signal is measured in If the it is tentatively chosen a pitch peak representing tenths of a millisecond and the counter counts in tenths as a then the timing betweenthe emitted voiced-speech segmentat the N-+-lth cepstrum.The of a millisecond,
The Journalof the Acoustical Society America of 305
A.
M.
NOLL
ANALYZER
TRANSMISSION
SYNTHESIZER
RANDOM
FILTER
]PITCH-PULSEj cj
I NosE I J
ISHORT-TMEI ICEPSTRAL[---m 34Hz I
CEPSTRUM PEAK I
ANALYZER ) I PICKER ILOW-PASSI' ) FILTER

VOIClNGI 34Hz
227Hz
SPEECH 'NPUT
,I _JJ GENERATOR j__

ADDER-.
.,,
Fro. 16. Block diagram of 13-specchannel vocoder with excitation
J I-,s
FILTER
....
FILTER -I
Hz w I- -1 I Hz I ,, ,
', ', ',
227Hz gwl I
F,E I----IODULOI----I ,L
x I
I
'
I IBAND-PASSl
' ' ,
trum
m HZgwJ m]STRUCTED 227
I"C0-
derivedfrom a cepstrum pitch detector.
'
, '
' II I SECH I OUTPUT
D-P
FILTER
,,CT
,,.,
JLOW-PASSI,
FILTER
Ig-*ssl I
/,Hzwl I
FILTER MODULATOR
I I-ssJJ J
FILTER J
2??BW HZ
34HZ
I I,Hzl
In its most basicform, cepstrum-pitch detectionrequires two spectrumanalyses with logic circuitry for picking the cepstralpeak corresponding the pitch to period of a voiced-speech segment.Thus, a means for performing two spectrum analysesin real time is required for a hardwareimplementation a cepstrumof pitch detector.The requirements real-timeoperation of and good frequency resolution the spectrum in analyzers are somewhat difficult to satisfyand have thereforeresuitedin the correctopinionthat a hardwarecepstrum analyzerwould be difficult to construct. However, techniquesare available for performing real-time spectrumanalyses that couldbe adaptedto cepstrum analysis. One such method performs the spectrum analysisby a circulating delay line with a time-variable phase shifter operating upon a heterodyned version of the time signal. This method, described Bickel and Bernstein has beensuccessfully by 15 usedby Weiss,Vogel,and Harris in an implementation of a cepstrumanalyzer. Still anothermethod,similar 16,17 In particular, pitch-period the doubling heendof the to a spectrumanalyzerdescribed Gill, usesa heteroat by word screaming was determinedto be aurally correct dyne filter operating on a time-sweptversion of the by sucha comparison originalwith vocoded of speech. input signal. Kelly and Kennedy have utilized this is The syntheticspeech from the computer-simulated cepstrum-excitation channelvocoder was compared z4 M. Kelly and R. N. Kennedy,"An ExperimentalCepstrum j. both with the originalspeech and with the synthetic Pitch Detector for Use in a 2400-bit/sec Channel Vocoder,"
at Societyof America speech from a computer-simulated voice-excited vo- presented the 72nd meetingof Acoustical (Nov. 1966), Paper 1H3.
impulses equals pitch period.The smoothed the voicing signal is used to control a double-throw switch for choosing either pitch pulsesor white noiseas a final excitationoutput. This techniquewas devisedand simulated on the computer by M. M. Sondhi using the BLODI programminglanguage.The output of the program was still anotherdigital magnetictape, whichwas then used as the excitation input to a 13-spectrumchannelvocoderdesigned Golden? The vocoderwasalsosimuby lated on the computerusingthe BLODI programming language.The spectrumchannelinformationwas derived from a computer-simulated vocoder analyzerand, togetherwith the excitation,they formed the input to the computer-simulated synthesizer. The whole operation from speech signalto simulatedvocoderoutput is shownin Fig. 16. The digital computergeneratednumerous visual outputs on microfilm including the short-timespectraand cepstra,the voicingand pitchperiod variations, the original speech signal,and the vocodedspeechsignal.These visual outputs were extremelyvaluablein devising final versions all the the of differentportionsof the chainmaking up the complete pitch-detection scheme. The completescheme,including the vocoder, was usedto modify and improveall portions the chainby of comparing vocoded the speech with the originalspeech.
pitch quality of the channelvocoder with cepstrum pitch detectionwas judged to be excellentby experiencedvocodercritics.This optimismwas sufficientto initialize constructionof a real-time cepstrumpitch
detector. 14
VII. IMPLEMENTATION OF CEPSTRUM ANALYZERS
coder and the same computer-simulated channel vocoder, but with the full-band speechas excitation.
Althoughonly a few sentences spoken four talkers by wereusedin theseinformalpaired-comparison tests,the

laR. M. Golden, "Digital Computer Simulationof a SampledData Voice-Excited Vocoder,"j. Acoust.Soc.Am. 35, 1358-1366
(1963).
z5H. J. Bickel and R. I. Bernstein,U.S. Patent No. 3,013,209. 16 J. Bickel, "Spectrum Analysis with Delay-Line Filter," H. IRE WESCON Cony. Rec. 1959 (Part 8), 59-67 (1959). z*M. R. Weiss,R. P. Vogel,and C. M. Harris, "Implementation of a Pitch Extractor of the Double-Spectrum-Analysis Type," j. Acoust.Soc.Am. 40, 657-662 (1966). zsj. S. Gill, "A Versatile Method for Short-Term Spectrum Analysisin 'Real-Time,'" Nature 189, No. 4759, 117-119 (14 Jan. 1961).
306
Volume41
Number 2
1967
CEPSTRUM
PITCH
DETERMINATION
methodin yet anothersuccessful implementation also includinglogic circuitry to track the cepstralpeak)4 They have also derived vocoderexcitationfrom their cepstraand have producedexcellent-quality vocoded speechutilizing a completehardware systemof cepstrum analyzer and vocoder. Both methods utilize analog-circuit techniques during all or part of the spectrum analysis. Digital techniques, however, have progressed the state where a comto pletely digital implementation shouldbe possible. The Cooley-Tukeyalgorithmgreatly reduces numberof the multiplications additions, and andmight be of practical usein sucha completely digital cepstrum analyzer. Another promising method utilizes the spectrum analyzingpropertiesof a lens?.2 lens forms at its A focal plane an image that is the Fourier transformof the image at the object plane. Sincethis is a spatial Fourier transform, the signal must be frozen in time with light intensitymadeproportional signalamplito tude. A coherent light source requiredto illuminate is the spatial representation the signal,and there are of some questions concerning the most efficientway to convertthe time signalinto sucha spatial signal.But, the technique seems particularly promising (since parallel processing very convenient), that thouis so sands signals of could analyzed be almost simultaneously.
VIII. PSEUDO-AUTOCOVARIANCE OR CEPSTRUM?
matically, their definitionof the cepstrumis
CBogert = {IFsin[lOg (T) [F(w)12-]} 2 4-{oUoglF()lD} ', (32)

where sin and oos denote Fourier sine transformation
and Fourier cosinetransformation, respectively; F(w) is the complex Fourier transform the originalprocess; of and F(w)=O for w<0. The pseudo-autocovariance is
Ro(r)= cos[-log (w)I' [F
(33)
and its square is identical with the definition of the cepstrumusedin this paper. A pseudoquadrature autocovariance can be defined as
R(r) = ffin[-1og (w)I'3, IF

SO that
(34)
(35)
()3
In their article in Rosenblatt's book, Bogert et al.,3 define the cepstrumas "autocovariance and Fourier transformation. . . [-of the log spectrum the origiof nal process." Sincethe Fourier transformof the autocovarianceof somefunction is identical with the power
spectrum the samefunction,the cepstrum of should be equivalentto the power spectrumof the log power spectrum the originalprocess. of Furthermore, since the log power spectrumis an even function of frequency, the. cepstrumshouldequal the squareof the cosine
transformof the log power spectrum. Later in the article, Bogerr et al. define a pseudoautocovariance "the Fourier transform of [-the as
log..-power spectrum." The "pseudo" prefix is IX. CONCLUSION logicallyusedsincethe Fourier trans.form the nonof loggedpower spectrumis the usual autocovariance. Someof the advantages claimedfor cepstrumpitch Thus, the cepstrumshould equal the square of the detection and confirmedby computersimulation are, pseudo-autocovariance. in their definitionof the But, first, that the fundamental frequencycomponentneed cepstrum, Bogerret al. had meant to assume that the not be present in the time signal, since the spectral logspectrum existed all positive for frequencies (private ripplesor fine structure causedby the harmonicsgive communication). a result, their cepstrumequalsthe riseto the cepstral As peak.For thisreason, cepstrum pitch sum of the squares the sinetransformand the cosine detectionis particularly well suited to suchbandpassof transform of the log power spectrum. Stated mathe- filteredsignals telephone as speech. Sinceonly the power spectrumis used,phaseis completelyignored.Additive 9L. J. Cutrona, E. N. Leith, C. J. Palermo, and L. J. Porcello, white noiseis not too degrading it doesnot destroy if "Optical Data Processing and Filtering Systems," IRE Trans. the spectralripples.Actually, a clearlydefinedcepstral InformationTheory IT-6, 386-400 (1960). signals with a 6-dB 20 Julesz,A.M. Noll, and M. R. Schroeder, B. "Optical Cep- peak has been obtainedfor speech strum Analysis"(unpublished morandum). signal-to-noise ratio over the 40-msecanalysisinterval.
The Journal the Acoustical of Society America of 307
Two different definitionsof the cepstrumcan certainly lead to someconfusion, but in this paper the cepstrum has consistentlybeen defined as the square of the cosinetransformof the log power spectrum. The digital computerwas programmedto calculate the followingshort-time functions: the square of the cosinetransform of the one-sided power spectrum log (pseudo-autocovariance squared), the square of the sine transform of the one-sidedlog power spectrum (pseudoquadrature autocovariance squared), and the sum of the squaresof the cosineand sine transorms of the one-sidedlog power spectrum (Bogert's cepstrum). The input signal was a male talker recorded from a 500-type telephonehandsetwith additive white noise (signal-to-noise ratio approximately12 dB). The three short-time functionswith the corresponding log power spectrum are shown in Fig. 17 for a voiced speech segment.The pseudoquadrature autocovariance is very noisy, so that Bogert'scepstrumdoesnot have peaks as sharp as the pseudo-autocovariance alone. Clearly, in retrospect, theseresultsare goodjustification for using only the pseudo-autocovariance speech for pitch detection.
A.M.
NOLL
(d)
(a)
(b)
(c)
40
DECIBELS
3 (kHz)
12 (msec)
15
12 (msec)
15
12 (msec)
15
FREQUENCY
QUEFRENCY
QUEFRENCY
QUEFRENCY
Fro. 17. (a) Short-timelogarithm spectra, (b) pseudo-autocovariance squared,(c) pseudoquadrature autocovariance squared,and (d) Bogert'scepstra (definedas the sum of the squares the cosine of and sinetransforms the logarithmspectra)for a male talker of (F.L.C.) recorded from a 500-typetelephone handsetand with additivewhite noise(sio'nal-to-noise m 12 dB). ratio
narrow-band white noise, since such noise would at most obscure only a few spectralripples.
because has long beenrecognized it that accuratepitch information is the most challengingaspectof vocoder design. The spectrum-channel informationhas perhaps Cepstrumpitch detection to someextentchanged beenreducedto its true relative importance. has But where does all this effort lead us?. It seems that our over-all conceptof a vocoder.Previously, most vocoderdesignis becoming conceptually more complidiagrams of a channel vocoder showed considerable insigdetail about the channelfilters while the pitch detector cated with asymptotic, though not necessarily was usually shownas a small block at the bottom, al- nificant,improvements quality.The vocoder in schemes though the pitch detector itself was sometimesquite and pitch detectors are becoming increasingly exotic, by pitch detection. Also, such elaborate.However, the spectrum-channel information as exemplified cepstrum is obtainedasan intermediatestepduringthe cepstrum- new speech transmission methods microwave, as satelof over analysisprocess. Thus, our new conceptof a vocoder lites,and the promise light communication laser changethe presentrestrictions analyzershows involveddiagramof a pitch detector beamsmight someday an with the spectrum-channel informationobtainedas a on available bandwidth. The future of vocoders for bandwidth compression mightseem bleak.Why by-product!(SeeFig. 18.) Perhaps is morerealistic, speech this
Of course,cepstrumpitch detectionis insensitiveto
308
Volume41
Number2
1967
CEPSTRUM
PITCH
DETERMINATION
PITCH DETECTOR
PITCH
Fl. 18. New conceptof spectrumchannel vocoderin which the spectrumchannel information is obtained as a by-product of the cepsttumpitch detector.
AND PITCH [ SPEECH L_., ANALYZER ANALYZER _ PICKERPEAK o CEPSTRAL I[ L ALGORITHM TRACKING
VOICING
INPUT
SPECTRUM [x I
[ SPECTRUM
PERIODS
SPECTRUM CHANNEL
INFORMATION
continue,then, with vocoderdevelopment, and--in particular--why be concerned with pitch detectors? Special-purpose vocoders can be usefulin removing certain types of speechdistortion. For example, the "Donald Duck" quality of speech spoken the helium in environmentused in certain underwater quarters such as Sealabcan be eliminatedby frequencyshifting of the vocoder channel signalsY The transmission speech of
very important tool in speech research fosteringreby searchin pitch fluctuationsand patterns.Thus, further researchand development of pitch detectors is warranted not only to producespeech bandwidth-compressionvocoders alsoas a fundamentaltool for speech but research and for special-purpose vocoders. As mentioned previously, cepstrum analysis performs remarkablywell as a vocal-pitchdetector.However, a has canbe madeprivate or secure the use of vocoders. more generalconclusion evolvedfrom the concept by 8 An accuratepitch-detectionschemewould becomea of cepstrumanalysis: that the spectrumitself can be regardedas a signaland can be processed standard by signal-analysistechniques.With such a viewpoint, 2112 A/I .l,--l> "I ....... ;no' Naturalness and !nte!!igibi!ity of of Helium-Oxygen Speech UsingVocoderTechniques," Acoust. cepstrumanalysisand other signal processing the J. Sac. Am. 40, 621-624 (1966). spectrum not seemquite so exotic. do
The Journalof the AcousticalSocietyof America
00

Cepstrum Pitch Determination: OICED-speech Sounds Result From The Resonant

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Cepstrum Pitch Determination: OICED-speech Sounds Result From The Resonant

Загружено:

Авторское право:

Доступные форматы

Received 24 August 1966

9.3, 9.8, 9.9

Cepstrum Pitch Determination

Bell Telephone Laboratories, Murray Hill, New Jersey07971

Pitch Detection,"J. Acoust.Soc.Am. 36, 1030 (1964).

Their article was issued as an internal Bell Labora-

toriesmemorandum beforepublicationin Rosenblatt's

-I '<'--GE SAND NERATOR MULTIPLIERS

,ER DELAY LT LINE

CEPSTRUM xx sampled-data for device

Fro. 1. Block diagram of

performing cepshort-time strum analysis.

approach to pitch determination is, in general, unsatisfactory. 8

is that the logarithmof a productequalsthe sumof the logarithmsof the multiplicands'

loglF) [-1og[l$()] . l() []

-log I8)l +log IHe)I .

source and tract

effects are now additive

other in the autocorrelation

functions. This results in

broadpeaksand in somecases multiplepeaksin the

8M. R. Schroeder, "Vocoders' Analysis and Synthesis of

K,TM SHORT-TIME SPECTRUM:

K,TM LOG POWER SPECTRUM:

w(t) =0.54+0.46 cos(rrt/Tw) ;

g(t) = [-s (t),h (t)-].w(t)

complex exponentiationis separated into real and imaginaryparts, Eq. 21 becomes

The Fourier transformor complexspectrumG(w) of g(t) is

(0/(t) cos (t)

-jj_ w(t)f(t) (22) sin(cot)dt.

microphone. 40 msec-long The hamming time movedin jumpsof

so that F (w) becomes

F (mAw) Atl----L w(1At)f(1At) (lmAtAw) = Y'. cos

the time window extended from --15 to [ 15 msec. The

of logIF(co) simply gives log[F(c0) Thus, the Is Is.

F(m)= 5'. w(l)f[-(k- 1)K-+-l-] cos(lm/Xt/xco)

C(n)= Y', loglFa(m)Iscos(mnaraco), (27)

10 W. Cooley and J. W. Tukey, "An Algorithm for the Maj.

The spectraand cepstraof the word chase spokenby

The Journalof the Acoustical Society America of

FIG. 12. Short-timelogarithmspectra (left) and short-time

cepstra (right) "(b)abbl of (ed),"spoken a male by talker(R.C.L.)

false indicationof a voicedspeech segment.In fact,

5:[ W (w) 2]= 5:[W(w) (-w) ] W

PEAK IN INTERVAL OF + 0.5 MS OF

[SECOND L P' iR_AHMON, T? E C G;ToIuMBALNE_o H

JTC'" VALUE Ni'I',L

SMOOTHED PITCH PERIOD

ISHORT-TMEI ICEPSTRAL[---m 34Hz I

ANALYZER ) I PICKER ILOW-PASSI' ) FILTER

,I _JJ GENERATOR j__

Fro. 16. Block diagram of 13-specchannel vocoder with excitation

m HZgwJ m]STRUCTED 227

derivedfrom a cepstrum pitch detector.

' II I SECH I OUTPUT

Althoughonly a few sentences spoken four talkers by wereusedin theseinformalpaired-comparison tests,the

matically, their definitionof the cepstrumis

CBogert = {IFsin[lOg (T) [F(w)12-]} 2 4-{oUoglF()lD} ', (32)

Ro(r)= cos[-log (w)I' [F

R(r) = ffin[-1og (w)I'3, IF

Of course,cepstrumpitch detectionis insensitiveto

The Journalof the AcousticalSocietyof America

Вам также может понравиться