Вы находитесь на странице: 1из 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/224675275

Voice Signal Processing For Speech Synthesis

Conference Paper · June 2006


DOI: 10.1109/AQTR.2006.254660 · Source: IEEE Xplore

CITATIONS READS

9 1,252

4 authors, including:

Ovidiu Buza G. Toderean


Universitatea Tehnica Cluj-Napoca Universitatea Tehnica Cluj-Napoca
24 PUBLICATIONS   77 CITATIONS    69 PUBLICATIONS   211 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Speech Synthesis for Romanian View project

naviro View project

All content following this page was uploaded by Ovidiu Buza on 14 March 2015.

The user has requested enhancement of the downloaded file.


Voice Signal Processing For Speech Synthesis
Ovidiu Buza, Gavril Toderean, Alina Nica
Department of Telecommunications, Technical University of Cluj, Romania
26 – 28 G. Baritiu Str., 400027, Cluj-Napoca, Romania
email: Ovidiu.Buza@cs.utcluj.ro, Gavril.Toderean@com.utcluj.ro

Abstract - Our goal was to build a software application capable The spectrum characterizes the signal in frequency domain,
to manipulate and analyse speech signal, extract characteristic giving information about spectral components. Speech signal
parameters needed for speech synthesis and to enhance the spectrum is calculated as the product of transfer function of
speech quality. This article presents speech signal parameters the phonatory system H(ω) and spectrum of glottal excitation
used in speech synthesis, the realized application and its E(ω). | H(ω) | is called spectral winding and gives the mode as
facilities, and main experimental results obtained.
spectral components vary within frequency.
Spectral peaks of high energy correspond to the maximums
I. INTRODUCTION
of spectral winding and are characteristics for each phoneme in
Today exist some applications dedicated to signal speech. These peaks are called formants and are useful
processing, like MatLab, LabView, or to audio signal parameters in analysis and synthesis of speech signal.
processing, like CoolEditor or Goldwave [4]. Although they Excitation signal is calculated in two ways, depending of the
provide many facilities that can be used in signal processing, type of sound: for vocals the excitation is a uniform impulse
like digital filters or determining characteristic coefficients, train, and for the consonants – a white noise. In the case of
these applications can be used in general purpose analysis vocals is important to specify excitation signal frequency, or
(CoolEditor, Goldwave) or providing built-in API for signal fundamental frequency. This parameter is connected with
processing applications (MatLab, LabView). Specific speech prosody, giving different ways of pronunciation and
analysis of speech signal and adaptive filtering must be done subjective emotional states.
in own applications by the developers.
So in our work we have realized an application dedicated Speech Signal Acoustical Parameters
to vocal signal analysis and processing, application named Perception of speech signal is influenced by three factors :
SPE – Speech Processing and Enhancement Application. volume, pitch and timbre. Volume is a measure of sound
SPE application offers facilities for processing and analysing intensity and corresponds with the amplitude of the signal.
speech signal and other operations specifically needed in Pitch is given by the fundamental frequency of the speech
speech synthesis. All these facilities and experimental results signal and represents a measure of how a specific subject is
are presented in this paper. perceiving the sound. Timbre is determined by the harmonics
of the sound and corresponds with frequency components of
II. SPEECH SIGNAL MAIN PARAMETERS the signal spectrum.
Although speech signal is a continuous, non-stationary
signal, it can be assumed stationary on short segments of time III. SPEECH SIGNAL ANALYSIS
(about 20 ms). On these segments speech signal main Many methods for speech signal analysis start with
parameters vary in small amounts and can be determined by Fourier transform, that gives complete description of signal
different methods. components in frequency domain. Expression for Discrete
Speech represents the output signal of the phonatory system Fourier Transform is [1]:
modelled as a linear system having glottal signal as excitation
input. Speech signal s(t) is calculated as convolution of the
N 1
impulse response of the system h(t) and excitation signal e(t)
[2]:
t
X (k )   x (n )e
n0
 j 2  nk / N
(4)

s(t )   e( )h(t   )d (1)


 k = 1,2,...,N ,
s ( t )  h ( t ) * e( t ) (2)
where: X(k) is the value of spectral component k,
Translating (2) in frequency domain, we obtain the spectrum x(n) represents the signal sample n,
of the speech signal: N is number of samples.

S ( )  H ( ) E ( ) (3)
The calculus is made in complex domain, so each cbSize represents the size of extra wave information.
component X(k) is characterized by two real values:
amplitude Ak and phase φk SPE Application offers following facilities for vocal signal
analysis:
Ak= |X(k)| , φk=arg(X(k)). (5) - manipulating speech recorded files (wave files);
- speech signal visualisation;
Signal spectrum leads to the detection of specific - pitch synchronous time window analysis;
parameters of voice signal, like spectral envelope and the - phase and amplitude spectrum analysis;
formants. - automate detection of fundamental frequency;
- automate detection of speech formants;
IV. THE SPEECH PROCESSING APPLICATION - selective frequency filtering;
- signal harmonics editing for signal analysis and
Graphical interface of SPE application is presented in enhancement.
figure 1. Main application window allows simultaneous
display of wave, amplitudes and phases of the signal. SPE Main SPE facilities are available through the application
facilities are organized in two main categories : general wave toolbars. Corresponding commands are :
facilities, analyse and processing facilities.
SPE allows working with standard wave files : creating, - wave file manipulating : NEW, OPEN, SAVE for creating,
opening, saving and editing speech recordings in WAVE opening and saving WAVE format files ;
format files, which have following structure [5]: - wave signal, amplitude and phase spectra visualisation:
VIEW WAVE, VIEW FFT, VIEW PHASE allow displaying
struct WAVEFORMATEX{ of wave, FFT amplitude spectrum and phase windows ;
WORD wFormatTag; - wave editing: SELECT/UNSELECT allows selecting
WORD nChannels; speech signal windows or segments; CUT, COPY, PASTE
DWORD nSamplesPerSec; allows editing of loaded signal;
DWORD nAvgBytesPerSec; - finding zero crossing: ZERO CROSS LEFT/RIGHT will
WORD nBlockAlign; move left or right border of the selection zone to zero
WORD wBitsPerSample; crossing neighbours points;
WORD cbSize; - wave and spectrum information: VIEW FFT INFO allows
}; interactive visualisation of information associated with
amplitude spectrum (FFT) or phase spectrum (PHASE). User
Wave structure parameters have following meaning: can display current frequency and spectrum value interractive
wFormatTag is type of Waveform-audio format, way on mouse moving in spectral displaying area ;
nChannels specifies number of channels, - changing wave displaying resolution : ZOOM IN –
nSamplesPerSec represents sampling rate, increases displaying resolution of current window (applies for
nAvgBytesPerSec is data rate, wave, amplitude and phase windows). Command reduces
nBlockAlign is data block alignment, window dimension in the benefit of analyse precision ;
wBitsPerSample gives sampling bit resolution, ZOOM OUT is the complementary command, decreasing
displaying resolution of current window ; zoom out increases
amount of information displaying in the window, but reduces
analyse precision. SAMPLE ZOOM OUT allows controlled
reducing of displaying resolution. It is used for progressive
increasing of analyse dimension by adding samples one by
one in analyse window. SAMPLE ZOOM IN –
complementary command, allows progressive reducing of
analyse dimension by removing samples from analyse
window and so improving analyse precision. ZOOM ALL –
allows entire display of wave signal or spectra.
- sound playing : PLAY and PLAY LOOP – for playing
wave file or selected wave ; PAUSE, STOP to interrupt
playing.
- selective filtering : FILTER allows selective frequence
filtering and signal spectrum editing for analysis and signal
enhancement. This command provides interactive filtering of
frequencies and graphical editing of speech signal formants
and harmonics.
Figure 1. SPE Main Application Window
Working with SPE application second formant) is assumed to give the fundamental
frequency. This rule applies only for voiced segments of
First of all user will open a speech signal recording from a speech, because consonants or vocal noises have no
WAVE file. Speech wave will be displayed in the upper part fundamental frequency.
of the screen, and FFT global spectra of amplitudes and An example of automated formant detecting is shown in
phases – in the bottom. Calculus of spectra is limited to 16k figure 2. Here is illustrated a voiced segment of speech where
samples, and this means that spectra don’t correspond with corresponding formants are detected at 204 Hz, 414 Hz, 613
entire wave but only to the first 16k samples of the wave. In Hz, 824 Hz . Fundamental frequency is given in this case by
the most cases this value is quite enough for spectral analysis the frequence of the first formant: 204 Hz.
of a windowed signal.
Magnitude of current window will be displayed in the
status bar. Tw parameter indicates dimension in time units
(seconds, milliseconds and microseconds), and Nw parameter
shows sampling dimension. Sampling frequency will be also
shown in FFT displaying zone, by Fs parameter. Sampling
frequency doubles spectral window magnitude because of
FFT mirroring effect: spectrum values are symmetrical from
a value that equals middle of sampling frequency. So spectral
window is calculated to display only first half of spectrum
values. FFT displaying zone will show also spectral
resolution, that is frequency distance between two
consecutive values of spectrum.
For analysing specific segments of speech signal, SPE
application allows windowed analysis. User can select
interactive way an analyse window and automatically adjust
window borders to zero crossing values. This fact gives a
maximum precision in computing the FFT spectrum, and so Figure 2. Formants and fundamental frequency of a voiced segment of
Gaussian windowing method is not necessary. Rules for speech
computing FFT spectra are the following:
(1) if no selection exists, then spectra are calculated on As previously said, SPE application allows selective
current displayed window; frequence filtering and spectrum editing, very important for
(2) if user has selected a distinct analyse zone, spectra are speech analysis and signal enhancement. The FILTER
calculated for selected zone. command provides interactive filtering of frequencies and
As user wants to make a specific analyse, he can select graphical editing of speech signal formants and harmonics.
displaying of wave form, amplitude spectrum or phase By dragging the mouse in FFT spectrum zone, user can easily
spectrum. Application will display as default all the three remove bands of frequencies for noise suppression or
analyse windows. increase other bands for rising signal energy in some specific
In spectral analyse windows some useful information is zones.
available. By moving mouse pointer on the amplitude or User can also modify spectrum formants or harmonics in
phase spectrum, user gets the precise values of frequencies order to modify sound timbre and will be able for immediate
and magnitude of spectrum, amplitude or phase. listening of sound acoustics modification. The experiments
Once a WAVE file has been loaded and speech signal was that have been done show that a good quality voice own a
displayed, application will display both amplitude and phase rich set of harmonics. Especially high order harmonics are
signal spectra. Into speech analysing, more important is determinative for quality of speech. This is useful inside the
amplitude spectrum, that gives information about the quality process of creating a vocal database used in speech synthesis,
of speech. By analysing this amplitude spectrum, one can get where some vocal segments could be enriched by adding
important parameters of speech: fundamental frequency higher harmonics.
corresponding to the tonality, formants or speech timbre, Figure 3 shows the difference between two sounds (/A/
spectrum magnitude corresponding to speech volume. vowel) before and after enrichment with higher frequencies.
Application automatically detects formants or local In the second case, the perception of sound is better.
maximums of spectrum envelope, maximums that have to go Harmonics are always inserted at multiples of the first
above a prior defined threshold. Magnitude, central frequence formant frequence.
and bandwidth of each formant are computed, that are
important parameters for speech synthesis. Central frequence
of the formant with higher amplitude (in most cases first or
Figure 3. Vowel /A/ enriched with higher harmonics

V. EXPERIMENTS MADE ON SPEECH SIGNAL


VI. CONCLUSIONS
Using SPE application, many experiments and spectral
analyses have been done. The goal of these experiments was We have described in this paper an original application
to determine specific characteristics of speech signal named Signal Processing and Enhancement Application -
corresponding with pronunciation of various phonemes in SPE, designed for analysing and processing of speech signal.
Romanian by distinct speakers. We have presented main facilities of the application and also
Vocal signal was recorded by the mean of an unidirectional how to work with application. Experiments accomplished by
dynamic microphone SM-500, then signal was sampled and working with SPE were presented and only first results were
stored in PCM WAVE format files using a Creative Labs shown here. In future articles we will present the whole set of
Soundblaster. There have been also analysed audio signal experiments used for speech synthesys.
samples generated through a Creative Labs Synthetizer.
Experiments were focused on the idea to determine the REFERENCES
characteristics of speech signal that give a higher sound [1] A. Mateescu, Semnale şi sisteme- Aplicaţii în filtrarea semnalelor, Ed.
quality. Thus, we have accomplished following analyses: Teora, 2001
[2] E. Lupu, P. Pop, Prelucrarea numerică a semnalului vocal, vol.1, Ed.
spectral analysis of vowels pronounced by different speakers, Risoprint, 2004.
spectral analysis of consonants, spectral analysis of [3] T. Dutoit, “High-quality text-to-speech synthesis: an overview”,
multitonal and modulated sounds ; perceptive sound http://tcts.fpms.ac.be/ synthesis/introtts.html
analyses : perceptive analysis of sounds emitted at different [4] O. Buza, “Stadiul actual în analiza şi prelucrarea semnalului vocal",
doctoral paper, Electronics and Tele-communications Faculty, Technical
phases, relationship between timbre of the sound and auditive University of Cluj-Napoca, 2005,unpublished
perception. [5] O. Buza, “Contribuţii în analiza şi sinteza semnalului vocal”, doctoral
We have studied with SPE application main factors that are paper, Electronics and Telecommunications Faculty, Technical
significant for a good quality of speech recordings: University of Cluj-Napoca, 2005, unpublished
normalization of signal energy across speech segments,
constant rhythm of speech (that is given by a constant
fundamental frequency for each type of sound), rich timbre
with high harmonics, and not less important, professional
conditions of speech recording.
We have also done vowel sound analyses specific for
Romanian speech synthesis. In this way we have studied the
characteristics of Romanian vowel sounds in different
phonetic and prosodic contexts. Diagrams 1 and 2 show a
comparative study of vowel durations and frequencies made Atack Median Final
on three characteristic segments: attack , middle and final.
These segments are illustrated in figure 4.
Figure 4. The three segments of vowel pronunciation
DIAGRAM 1
DURATION OF ROMANIAN VOWEL SOUNDS – STATISTICS

Atack Median Final


(%) (%) (%) Total (ms)
(ms) (ms) (ms)
A 42.2 25 89.6 53 37.5 22 169.3
E 45.4 23 87.7 44 66 33 199.1
I 51.6 22 122 52 60.3 26 233.9
O 41.1 27 73.2 48 38.5 25 152.8
U 34.3 27 51.6 41 40.7 32 126.6

DIAGRAM 2
FREQUENCES OF ROMANIAN VOWEL SOUNDS – STATISTICS
Percents are calculated relative to median segment

Atack Median Final


(%) (%) (%)
(Hz) (Hz) (Hz)
A 120 110 109 100 124 114
E 112 106 106 100 117 110
I 130 115 113 100 123 109
O 118 109 108 100 119 110
U 116 117 99 100 99 100

View publication stats

Вам также может понравиться