Академический Документы
Профессиональный Документы
Культура Документы
Xuejing Sun
.
Speech Acoustics Laboratory, Department of Communication Sciences and Disorders Northwestern University Evanston, I L 60208, USA
ABSTRACT
The capability of producing different voice qualities is highly desirable in modem speech synthesis systems. Diphone based synthesizers using TD-PSOLA can generate high quality synthetic speech. However, one of the drawbacks of such systems in comparison to the formant synthesizer or the LPC synthesizer is its inflexibility in Voice Quality Conversion (VQC). In this paper, 1 present a VQC method for the TDPSOLA synthesis system. For vocal fry, the ST-signals are multiplied by Kaiser window with alternate magnitude; for breathy voice, the ST-signals are first convolved with a one-pole filter, and then combined with shaped noise signals, and finally multiplied by a Hanning window. All the windowed ST-signals are then overlap-added as in standard TD-PSOLA. The perceptual evaluation test shows that this method can generate the desired voice quality successfully. (The samples can be WWW URL address: obtained through this htt~://mel.soeech.nwu.edu/sunxi/voc-abstract.htm)
1. INTRODUCTION
In a speech synthesis system, the ability to synthesize different voice qualities is nontrivial. Not only can it make the synthetic speech more natural, but it can also make the synthetic speech more expressive. Primarily through controlling the glottal volume velocity waveform, Klatt and Klatt (1990) were able to synthesize different voice qualities in a formant synthesizer. Successful results have also been obtained with a LPC synthesizer [ 1][2]. These speech synthesis techniques directly employ source-filter theory, and are inherently more flexible in terms of voice quality synthesis. However, their quality of synthetic speech has generally been less satisfactory when compared with that of PSOLA-based systems. Recent developments in waveform coding technique-PSOLA [ 5 ] , especially TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add), have greatly improved the quality of diphonebased speech synthesis systems. In the family of PSOLA techniques, TD-PSOLA is the most computationally efficient method, and is one of the most popular synthesis techniques nowadays. LP-PSOLA (Linear Predictive PSOLA) and FDPSOLA (Frequency Domain PSOLA), though able to produce equivalent result, require much more computational power. In terms of synthesizing different voice qualities, however, TDPSOLA has no inherent ability except that it uses pre-recorded diphones with desired voice quality. This could be a timeconsuming process and cost more storage, therefore, is not commonly adopted. An ideal solution is to plug in an extra module to convert normal voice to desired voice quality when
953
TD-PSOLA technique. Then I describe the procedures of converting normal voice to different voice qualities. After this 1 present the perceptual evaluation test result. In the end, I briefly summarize the current study and discuss the potential future research.
(3) The result of increasing attenuation of the side lobe is the increases of the width of the main lobe, which makes the original spectrum smoother, and the spectral slope is reduced to a lesser extent. This is not bad for vocal fry because vocal fry is characterized by a reduced slope of glottal spectrum. Another side effect of this frequency interpolation process is formant bandwidth widening. Fortunately, there is no obviously perceptual consequence of this [6]: As mentioned earlier, an important perceptual characteristic of vocal fry is roughness. This can be achieved by using altemate pulse cycles. In reality, such altemate pulse cycles often consist of both amplitude altemation and period altemation. It has been found that the rough voice quality can be achieved by solely manipulating one of these two aspects [8]. In the present study, therefore, only altemate amplitude has been employed to avoid pitch modification since that is not of concem.
-2 0
'
200
400
600
800
I lo00
954
1
4
0.8 0.6
0.4 0.2 0 0
0
50
100
150
200
250
200
400
600
800
1000
Figure 3. Noise energy envelope shaping function A pitch-modulated square wave and a triangular-like envelope shaping function have been tried [2][7]. In the current study, I use a pitch-modulated cosine function (Figure 3). Spectral tilt. This is achieved by convolving the ST-signal with a one-pole spectral shaping filter. (4)
Figure 2. (a) Original speech segments /a/ in male voice. (b). Synthetic speech segments (vocal fry). It should be noted that if we combine some pitch modifications with the current scheme, we would have better perceptual effects because vocal fry usually occurs at low pitch and altemate period cycles can also induce a roughness sensation. Nonetheless, for the purpose of comparison, pitch modification is excluded from the present study. Figure 2. shows a speech waveform segment of the original speech and the synthesized speech, respectively.
where U,is a real pole inside the unit circle, and usually near the unit circle. It can produce a spectral slope of approximately -6 dB/octave. Add the noise to the ST-signal one by one. Multiply the "noisy" ST-signal by a Hanning window of the same length. Center the windowed ST-signal at new pitch marks, then perform overlap-add procedure. Note that in the present study, pitch modification is not of concem, so the new pitch marks will be the same as that of the original. Note that there are several parameters that are adjustable, such as the noise level and the degree of spectral tilt, etc. Their values vary with gender, age, and some other individual characters. However, for a particular voice database in speech synthesis system, the values can be constant.
2.
A,
which is actually
related to the original gain derived from LPC at certain ratio. The ratio is usually in the range of [0.1 0.41. The noise is then band passed at 2k-8k. It has been shown that the temporal envelope of the noise signal is an important factor for the naturalness of the synthetic speech [2].
2000
4000
6Ooo
I 8ooo
Frequency (Hz)
955
40
E Z
20
d .
.e
3
0
0 ,
Kaiser window; for breathy voice, the ST-signals is convolved with a spectral-shaping filter first, and then shaped noise signals are added. The results of perceptual evaluation test indicate that the algorithm can effectively convert modal voice into the desired voice quality. Future research includes utilizing the present VQC method in a TD-PSOLA speech synthesis system, and conducting more comprehensive perceptual evaluation experiment to test the quality and the naturalness. It is also hoped that the above method could be applied to the voice gender conversion as mentioned in the Introduction.
-20
-40
2000
8000
(b)
Figure 4. (a). Spectrum of the original vowel /a/ in female voice (b) Spectrum of converted breathy voice of the same vowel.
7. REFERENCE
[ I ] Childers, D.G. Glottal source modeling for voice conversion. Speech Communication, 16 (2): 127-138, 1995. [2] Childers, D.G., and Lee, C.K. Vocal quality factors: f the Analysis, synthesis, and perception. Journal o Acoustical Society of America, 90(5): 2394-2410, 1991. [3] Klatt, D.H., and Klatt, L.C. Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America, 87(2): 820-856, 1990. [4] Lawlor, B., and Fagan, A.D. A Novel Efficient Algorithm for Voice Gender Conversion. Proceedings of the 14Ih International Congress o f Phonetic Sciences, San Francisco, August 1999, Vol. I , pages 77-80. [5] Moulines, E., and Charpentier, F. Pitch-Synchronous Waveform Processing Techniques for Text-To-Speech Synthesis Using Diphones. Speech Communication, 9: 453-467, 1990. [6] Moulines, E., and Laroche, J. Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communicarion, 16 ( 2 ) : 175-205, 1995. [7] Stylianou, I. Harmonic plus Noise Models for Speech, combined with Statistical Methods, for Speech and Speaker Modification. P h B . Thesis, ENST-Telecom Paris, 1996. [8] Titze, I.R. Principles of Voice Production. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1994. [9] Wendahl, R.W. Laryngeal analog synthesis of harsh voice quality. Folia Phoniutrica 15: 241 -250, 1963.
6. SUMMARY
In this study, a voice quality conversion algorithm within TDPSOLA was formulated and tested. The goal is to enable a TDPSOLA speech synthesis system to produce desired voice quality as needed. For vocal fry, the ST-signals are modified with the
956