Вы находитесь на странице: 1из 7

1.

7 Speech Synthesis and Voice Recognition


D. H. F. LIU (1995) B. G. LIPTÁK (2005)

THE NATURE OF SOUND (DARPA) couple the ability to convert speech into electronic
text with the artificial intelligence to understand the text. In
Sound normally reaches the ear through pressure waves in these systems the computer acts as an agent that knows
the air. Vibratory motion at frequencies between 16 Hz and what the users want and how it is to be accomplished, including
20 kHz is recognized as audible sound. Sound pressure levels the voice output. Altogether, this type of speech synthesis
and velocities are quite small, as listed in Table 1.7a. Sound allows untrained manufacturing workers the uninterrupted use
pressures are measured in root-mean-square (RMS) values. of their hands, which could translate into increased productivity
The RMS values are arrived at by taking the square root of and cost savings.
the arithmetic mean of the squared instantaneous values.
The RMS pressure of spherical sound waves in a free
field of air can be described as: SPEECH SYNTHESIS
2
Pr = P0 /r dynes/cm Figure 1.7b shows the components of a computer voice
response system. A synthesis program must access from a
where Pr is the RMS sound pressure at a distance r from the stored vocabulary a description of the required sequence of
sound source and P0 is the RMS pressure at a unit distance words. It must put them together, with the proper duration,
from the source. intensity, and inflection for the prescribed text. This descrip-
tion of the connected utterances is given to a synthesizer
device, which generates a signal for transmission over a
INTRODUCTION voice circuit. Three different techniques can be used for
generating the speech: (1) adaptive differential pulse-code
Several companies and universities have developed voice modulation (ADPCM), (2) formant synthesis, and (3) test
recognition systems. These systems and others built under synthesis. Their typical bit rates for vocabulary storage are
grants from the Defense Advanced Research Projects Agency shown in Table 1.9c.

TABLE 1.7a
Mechanical Characteristics of Sound Waves

RMS Sound Pressure RMS Sound Particle RMS Sound Particle Sound Pressure
2
(dynes/cm ) Velocity (cm/sec) Motion at (1,000 Hz cm) Level (dB 0.0002 bar)
−9
Threshold of hearing 0.0002 0.0000048 0.76 × 10 0
−9
0.002 0.000048 7.6 × 10 20
−9
Quiet room 0.02 0.00048 76.0 × 10 40
−9
0.2 0.0048 760 × 10 60
−6
Normal speech at 3' 2.0 0.048 7.6 × 10 80
−6
Possible hearing impairment 20.0 0.48 076.0 × 10 100
−6
200 4.80 760 × 10 120
−3
Threshold of pain 2000 48.0 7.6 × 10 140
−3
20 × 10 76.0 × 10
3
Incipient mechanical damage 480 160
−3
200 × 10 760 × 10
3
4800 180
2000 × 10
3
Atmospheric pressure 48000 7.6 200

64

© 2006 by Béla Lipták


1.7 Speech Synthesis and Voice Recognition 65

Vocabulary +
Digital
store Σ Q C
Sampled channel

input
L
Answer-back Digital P Σ
message Synthesis
speech
(English text) program
synthesizer Speech
output
D Σ LPF
Speech
Speech output
formation
rules L P

Q-quantizer P-predictor
FIG. 1.7b C-coder L-logic
Block diagram of a computer voice response system. (Courtesy of D-decoder
Bell Laboratories.) (a)

Adaptive Differential Pulse-Code Modulation (ADPCM) 1.0

Signal amplitude
Figure 1.7d shows the schematic of a simple ADPCM decoder.
The synthesis program pulls the digitally coded words from 0.5 DPCM
the disk, where they are stored in the required sequence, and Signal
∆ Digital
supplies them to an ADPCM decoder, which produces the
101 100 011 010 101 111 111 111 111 111 010 output
analog signal. The only control of prosody (accent and voice 0
0 1 2 3 4 5 6 7 8 9 10
modulation) that can be effected in the assembling of the
Sample time
message is the adjustment of time intervals between succes-
sive words, and possibly of their intensities. No control of 1.0
voice pitch or merging of vocal resonance at word boundaries
Signal amplitude

4∆
is possible.
2∆ ADPCM
This technique is suitable for messages requiring modest 0.5
Signal
vocabulary size and very lax semantic constraints. Typical
Digital
messages are voice readouts of numerals, such as telephone
101 100 011 010 101 111 111 101 011 100 011 output
numbers, and instructions for sequential operations, such as 0
0 1 2 3 4 5 6 7 8 9 10
equipment assembly or wiring.
Sample time
(b)
Formant Synthesis DPCM -Differential Pulse-code
Modulation
Formant synthesis uses a synthesis model in which a word ADPCM -Adaptive DPCM
library is analyzed and parametically stored as the time vari-
FIG. 1.7d
ations of vocal-tract resonance, or formants. Speech formants
Adaptive differential coding of speech. (a) Block diagram of coder.
are represented as dark bars on sound spectrogram in the
(Adapted from U.S. Patent 3,772,682.) (b) Comparison of wave-
bottom panel of Figure 1.7e. These frequency variations and forms coded by 3-bit DPCM and ADPCM. (Courtesy of Bell Lab-
voice pitch periods can be analyzed by computer, as shown oratories.)
in the upper two panels of the figure.

In formant synthesis, the synthesizer device is a digital


TABLE 1.7c filter, as shown in Figure 1.7f. The filter’s transmission res-
Features of the Different Coding Techniques Used in Speech onance or antiresonance are controlled by the computer, and
Generation its input excitation is derived by programmed rules for voice
Coding Duration of Speech in pitch and for amplitudes of voiced sounds and noise-like
Technique Data Rate (bits/s) 10 Bits of Storage (min) unvoiced sounds. The parametric description of the speech
information provides the control and modification of the pro-
ADPCM 20k 1
sodic characteristics (pitch, duration, intensity) of the syn-
Formant synthesis 500 30
thetic speech and produces smooth formant transitions at
Text synthesis 700 240 successive word boundaries. This technique has also been

© 2006 by Béla Lipták


66 General

13.0 message into spoken form. It generates speech from an infor-


Period in (msec)

mation input corresponding to a typewriter rate. In one design,

Voice pitch
11.0
9.0 the vocabulary is literally a stored pronouncing dictionary.
7.0 Each entry has the word, a phonetic translation of the word,
5.0
word stress marks, and some rudimentary syntax information.
0 200 400 600 800 1000 1200 1400 The synthesis program includes a syntax analyzer that
Time (msec) examines the message to be generated and determines the
role of each word in the sentence. The stored rules for pros-
ody then will calculate the sound intensity, sound duration,
3
and voice pitch for each phoneme (the sound of human voice)
Frequency (kHz)

in the particular context.

frequencies
Formant
2 Figure 1.7g shows a dynamic articulatory model of the
human vocal tract. The calculated controls cause the vocal-
1 tract model to execute a sequence of “motions.” These
motions are described as changes in the coefficients of a wave
equation for sound propagation in a nonuniform tube. From
0
0 200 400 600 800 1000 1200 1400 the time-varying wave equation, the formants of the deform-
Time (msec) ing tract are computed iteratively. These resonant frequencies
and the calculated pitch and intensity information are sent to
the same digital formant synthesizer shown in Figure 1.7f.
3
The system generates its synthetic speech completely from
Frequency (kHz)

stored information. Language text can be typed into the sys-


Original
speech

2
tem and the message can be synthesized online.

0
VOICE RECOGNITION
0 200 400 600 800 1000 1200 1400
Time (msec) Figure 1.7h illustrates one general structure for a machine to
FIG. 1.7e acquire knowledge of expected pronunciations and to com-
Analysis of the sentence “We were away a year ago.” The sound pare input data with those expectations. A precompiling stage
spectrogram in the bottom panel shows the time variation of vocal- involves the user’s speaking sample pronunciations for each
tract resonances or formants (dark areas). Computer-derived esti- allowable word, phrase, or sentence, while identifying each
mates of the formant frequencies and the voice pitch period are with a vocabulary item number. Later, when an unknown
shown in the top two panels. (Courtesy of Bell Laboratories.) utterance is spoken, it is compared with all the lexicon of
expected pronunciations to find the training sample that it
used experimentally for generating automatic-intercept mes- most closely resembles.
sages and for producing wiring instructions by voice.
Word Boundary Detection
Text Synthesis
Silent pauses permit easy detection of word onsets at transi-
Text synthesis is the most forward-looking voice response tions from no signal to high energy and word offsets as energy
technique. It provides for storing and voice-accessing volumi- dips below the threshold. Word boundary confusions can
nous amounts of information or for converting any printed occur, when short silences during stop consonants (/p,t,k/)
may resemble intended pauses. A word such as “transporta-
An tion” could appear to divide into three words: “trans,” “port,”
Random and “ation.” Recognizers must measure the duration of the
number ×
gen.
pause to distinguish short consonantal silences from longer
Recursive
digital D/A
deliberate pauses.
Pitch filter
pulse × Feature Extraction
gen. Pole-zero
data Acoustic data provide features to detect contrasts of linguistic
Av
P significance and to detect segments such as vowels and conso-
FIG. 1.7f nants. Figure 1.7i shows a few typical acoustic parameters used
Block diagram of a digital formant synthesizer. (Courtesy of Bell in recognizers. One can extract local peak values, such as 1 in
Laboratories.) the top panel, as an amplitude measure. The sum of the squares

© 2006 by Béla Lipták


1.7 Speech Synthesis and Voice Recognition 67

Nasal tract Nostrail


Muscle force

UN
Velum
Ps Uo UM

Trachea Vocal T t Vocal tract Mouth


Lungs bronchi
cords

UN
ZN
Zv PA
ZG
UM
AG UG
Ps Zu
Cord
model PA

At AN

Ps(t) Q(t) AGo(t) N(t) A(x, t)

Subglottal Cord Rest Nasal Tract


pressure tension area coupling shape

FIG. 1.7g
Speech synthesizer based upon computer models for vocal cord vibration, sound propagation in a yielding-wall tube, and turbulent flow
generation at places of constriction. The control inputs are analogous to human physiology and represent subglottal air pressure in the
lungs (Ps); vocal-cord tension (Q) and area of opening at rest (AGo); the cross-sectional shape of the vocal tract A(x); and the area of
coupling to the nasal tract (N). (Courtesy of Bell Laboratories.)

Sample pronunciations
Utterance Feature Pattern
Utterance identity boundary extraction standardization
detection
Training process
Recognition process
Lexicon of expected pronunciations
Identity 1 Identity 2 Identity N

Unknown
utterance Utterance Feature Pattern Word
boundary extraction normalization matcher
detection

Hypothesized words

Prosodic Syntactic Semantic Pragmatic


analysis analysis analysis analysis

Cues to Sentence Interpretations Machine


linguistic structures response
structures

FIG. 1.7h
Typical structure for a trainable speech recognizer that could recognize words, phrases, or sentences.

© 2006 by Béla Lipták


68 General

1
2 3
4
5 6 7
Time
Amplitude

Zero crossings To
Pitch period
Fo = 1/To
Frame 2355 Frame 2359
(a)

Sharp spectral cutoff due to


telephone bandwidth filtering
Amplitude

F1
F2 F3

Amplitude
Spectral
LPC

Frame 2359
FFT Frame 2358
Frame 2357
Frame 2356
Frame 2355
Frequency Frequency

e
m
Ti
(b) (c)

FIG. 1.7i
Typical acoustic parameters used in speech recognizers. (a) Time waveform showing parameters that can be extracted. (b) Frequency
spectrum of the waveform of a, with the detailed frequency spectrum derived from a fast Fourier transform (FFT) smoothed by linear
predictive coding (LPC) to yield a smooth spectrum from which formants can be found as the spectral peaks. (c) Smoothed LPC spectra
for five successive short time segments, with formants F1, F2, and F3 tracked as the spectral peaks.

of all waveform values over a time window provides a measure vocal tract and can be tracked as a function of time in panel
of energy. Vowel-like smooth waveforms give few crossings per (c). This indicates the nature of the vowel being articulated.
unit time; noiselike segments give many crossings per unit time.
The time between the prominent peaks at the onsets of Pattern Standardization and Normalization
pitch cycles determines the pitch period T0, or its inverse, the
rate of vibration of the vocal cords, called fundamental fre- A word spoken on two successive occasions has different
quency or F0. Resonance frequencies are indicated by the amplitudes. A recognizer must realign the data so that proper
number of peaks per pitch period. The third pitch period portions of utterance are aligned with corresponding portions
shows seven local peaks, indicating a ringing resonance of of the template. This requires nonuniform time normaliza-
seven times the fundamental frequency. This resonance is the tion, as suggested by the different phonemic durations.
first formant or vocal tract resonance of the /a/-like vowel. It Dynamic programming is a method for trying all reasonable
is the best cue to the vowel identity. alignments and yields the closest match to each template.
Another pattern normalization involves speaker normaliza-
The frequency content of the short speech is shown in
tion, such as moving one speaker’s formants up or down the
panel (a). The superimposed results of two methods for ana-
frequency axis to match those of a “standard” speaker that
lyzing the frequency of speech sample are given in panel (b).
the machine has been trained for.
The jagged Fourier-frequency spectrum, with its peak at har-
monics of the fundamental frequency, is determined using a Word Matching
fast Fourier transform (FFT), and the exact positions of major
spectral peaks at the formants of the speech are determined The above processes yield an array of feature values versus
using an advanced method called linear predictive coding time. During training, this array is stored as a template of
(LPC). The peaks indicate the basic resonance of the speaker’s expected pronunciation. During recognition, a new array can

© 2006 by Béla Lipták


1.7 Speech Synthesis and Voice Recognition 69

be compared with all stored arrays to determine which word is disallow meaningless but grammatical sequences such as “zero
closest to it. Ambiguities in possible wording result from error divide by zero.” A pragmatic constraint might eliminate unlikely
and uncertainties in detecting the expected sound structure of sequences, such as the useless initial zero in “zero one nine.”
an utterance. To prevent errors in word identifications, recog- It is possible to restructure the recognizer components to
nizers will often give a complete list of possible words in avoid propagation of errors through successive stages of a
decreasing order of agreement with the data. recognizer by having all the components intercommunicate
directly through a central control component that might allow
High-Level Linguistic Components syntax or semantics to affect feature-extraction or word
Speech recognizers need to limit the consideration of alterna- matching processes, or vice versa.
tive word sequences. That is the primary purpose of high-level
linguistic components such as prosodics, syntax, semantics,
and pragmatics. Prosodic information such as intonation can PRACTICAL IMPLEMENTATION
help distinguish questions from commands and can divide
utterances into phrases and rule out word sequences with Figure 1.7j illustrates a typical circuit-board implementation
incorrect stress patterns. of a method of word recognition using LPC coding for spec-
A syntactic rule can disallow ungrammatical sequences, like tral data representation and dynamic programming for time
“plus divide” or “multiply clear plus.” A semantic rule might alignment and word matching. An analog-to-digital converter

LPC
Coefficients RAM
vocabulary
a11 … a1n
storage
am1 … amn

Analog-to- Distance
LPC measures Decision
digital and
analysis chip
converter dynamic
Microphone and amplifier programming

LPC-Linear predictive coding


(a)

V 3 6 6 6 2
Y 7 5 1 1 6

A 8 1 5 5 6
Reference

A 8 1 5 5 6

F 1 7 7 7 2
F 1 7 7 7 2
F A Y Y V
Unknown

(b)

FIG. 1.7j
Practical speech synthesizer based on linear predictive coding (LPC) and dynamic programming. (a) Major components of a circuit-board
speech recognizer using LPC analysis. (b) Matrix of speech sound differences between reference words and word portions of unknown
inputs. Alternative alignments are allowed within the parallelogram, and the path with least accumulated distance (heavy line) is chosen
for best alignment of reference and input words.

© 2006 by Béla Lipták


70 General

chip amplifies and digitizes the microphone handset signal. input, or two inputs framed of “y”-like sounds may be associ-
An LPC chip performs the acoustic analysis and determines ated with one such frame in the reference.
the necessary coefficients for specifying an inverse filter that Dynamic programming is also applicable to word
separates such smooth resonance structure from the harmon- sequences, looking for beginning and end points of each word,
ically rich impulses produced by the vocal cords. and matching with reference words by best paths between such
A matrix of these LPC coefficients versus time represents end points. The same procedure can be applied to larger units
the articulatory structure of a word. Word matching can be in successive steps, to yield “multiple-level” dynamic pro-
based on comparing such coefficients with those throughout gramming, which is used in some commercial recognition
each word of the vocabulary for which the recognizer is trained. devices.
As shown in Figure 1.7j, the analysis frames of the input
are on the horizontal axis; those of a candidate reference word
from the training data are on the vertical axis. The distances Bibliography
between the respective frames of the input and reference
words are entered into the corresponding intersection cells in Bristow, G., Electronic Speech Recognition, New York: McGraw-Hill, 1986.
the matrix. The distances along the lowest-accumulated- Bristow, G., Electronic Speech Synthesis, New York: McGraw-Hill, 1984.
Burnett, D. C., et al., “Speech Synthesis Markup Language,” http://www.w3.
distance alignment of reference and input data are accumu-
org/TR/speech-synthesis, copyright 2002.
lated. If that distance is less than the distance of any other Dutoit, T., An Introduction to Text-To-Speech Synthesis, Dordrecht, The
reference word inserted in the illustrated reference word, then Netherlands: Kluwer Academic Publishers, 1996.
that reference word is accepted as the identity of the input. Flanagan, J. L., “Synthesis and Recognition of Speech: Teaching Computers
Dynamic programming is a method of picking that path to Listen,” Bell Laboratory Record, May–June 1981, pp. 146–151.
http://www.3ibm.com/software/speech/
through all successive distance increments that produces the
Levinson, S. E., and Liberman, M. Y., “Speech Recognition by Computers,”
lowest accumulated distance. As shown in the heavy line in Scientific American, April 1981, pp. 64–76.
Figure 1.7j, a couple of frames of “f ”-like sounds of the refer- Yarowsky, D., “Homograph Disambiguation in Speech Synthesis,” Proceedings,
ence word may be identified with a single “f ”-like frame of the 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz, NY, 1994.

© 2006 by Béla Lipták

Вам также может понравиться