Speech and Audio Processing and Coding

1
Speech and Audio Processing

and Coding
Dr Wenwu Wang
Centre for Vision Speech and Signal Processing
Department of Electronic Engineering
w.wang@surrey.ac.uk
http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

2
Course components
Components
Speech Processing
Wk 1-3, by Wenwu Wang
Speech Coding
Wk 4-6, by Ahmet Kondoz
Audio
Wk 7-8, by Ahmet Kondoz
Wk 9-10, by Wenwu Wang
Assessment
15% coursework + 85% exam
6 exam questions with 2 for each of the above component,
you only need to do 3 with one from each of the three
components
1 coursework in speech processing
3
Outline of speech analysis
Introduction & applications
Speech production and modelling
Speech perception
Signal processing techniques
Autocorrelation, Fourier transform of speech, spectral
properties, convolution, periodicity estimation
Linear prediction and inverse filtering of speech
Transfer function, linear prediction, estimation of linear
prediction coefficients, LPC order, inverse filtering,
prediction gain
Cepstral deconvolution
Real cepstrum, complex cepstrum, quefrency, pitch
estimation via cepstrum, comparison of spectrum envelope
obtained by cesptrum with that obtained by LPC
4
Alternative textbooks to speech
analysis and audio perception

Digital Processing of Speech Signals, by Lawrence R. Rabiner
& Ronald W. Shafer
Signal and Systems, Alan V. Oppenheim, Alan S. Willsky
An Introduction to the Psychology of Hearing, Brian C. J.
Moore
Acoustics and Psychoacoustics, David M. Howard, & James
Angus
5
Introduction & applications
What is speech & language?
The act of speaking; the natural exercise of the vocal organs; the
utterance of words and sentences; oral expression of thought or feeling
(Oxford English Dictionary)
The whole body of words and of methods of combination of words used
by a nation, people, or race; Words and the methods of combining them
for the expression of thought (Oxford English Dictionary)
Didfference: speech is communicated orally (i.e. by mouth); while
language consists of rules for combining words together to convey
meaning which can be communicated through non-oral mechanisms,
such as, written, hand-signals, pictures, Morse code, etc.
Connections: speech is the oral communication of meaningful
information through the rules of a specific language.

6
Introduction & applications (cont.)
From the viewpoint of the speech processing/communication engineers
Speech is regarded as a sequence of small speeh-like units, e.g. words or
phonemes.
Often the main objective is to enhance, encode, communicate, or
recognise these units from real speech or to synthesise them from text.
Our focus here will be on the physical attributes of speech rather than the
semantic level, as retrieval of semantic content from speech is a much
more difficult and largely unsolved problem.
A speech communication system should generally allow the effective
transmission of speech in any language. Whilst rules of language and
vocabularies differ greatly, the physical attributes of speech have much more in
common between different language.
Transmission of semantic information between speakers assumes huge
amounts of prior information. For example, its raining cats and dogs talking
about weather rather than animals, but not obvious for computers.
We will consider speech as the oral production of sound by natural
exercising of the vocal organs, resulting in a sequence of different
phonemes or words.
7
Applications of speech processing

Speech telecommunications and encoding
Preservation of the message content and perceptual quality of the transmitted
speech
Minimising the bandwidth required for speech transmission
Speech enhancement
Deals with the restoration of degraded speech caused by, e.g., additive noise,
reverberation, echoes, interfering speech or background sounds (cocktail party
effect).
Adaptive filtering, spectral subtraction, Wiener filtering, harmonic selection, blind
source separation, and computational auditory scene analysis
A Real World Application Scenario:-
Cocktail Party Problem
) (
1
t s
) (
2
t s
) (
2
t x
) (
1
t x
Microphone1
Microphone2
Speaker1
Speaker2
Underdetermined Blind Source
Separation
Sources:

Mixtures:

Estimated sources:

Speakers
Mixtures
RT
60
=30ms
Mixtures
RT
60
=150ms

Mixtures
RT
60
=400ms
ConvICA
ConvICA
ConvICA
Estimated IBM
Estimated IBM
Estimated IBM
Smoothed IBM
Smoothed IBM
Smoothed IBM
Blind Separation of Reverberant Speech
Mixtures with Different Reverberation Time
Blind Separation of Real Recordings
Male Speech with TV on
Sensor signals
Conv. ICA
Conv. ICA
+IBM
Conv. ICA
+IBM+Cepstral
Smoothing
Separated source signals
12
Applications in speech processing
Speech and speaker recognition
Speech recognition involves the automatic conversion of the message content of
the speech signal into a written format. Its performance depends highly on signal
quality, speaker identity, language, the size of the word dictionary, etc.
Speaker recognition involves the identification of the speaker based on the
statistical properties of the speech signal, which can be thought of as biometric
data.
Applications include voice interaction or dictation with computers, speaker
verification for security, automated telephony, spoken language translation, etc.
linguistic studies, etc.
Speaker diarisation
It studies the question of who speaks and when.
It is usually done through speaker segmentation (finding the speaker change point
in an audio stream) and speaker clustering (grouping together the segments based
on speaker identity or characteristics).
13
Applications in speech processing
Speech synthesis
It refers to artificial production of speech from text.
Applications include speech-driven interfaces with computers, automated telephony,
general announcements in train stations, airports, etc.
Quality measures include naturalness (how much the output sounds like the speech
of a real person) and intelligibility (how easily the output can be understood).
Speech analysis
Refers to the analysis of speech, such as waveform and spectrum, for disclosing its
more detailed contents.
Applications include user-feedback during language learning, speech therapy,
linguistic studies, etc.
14
Linguistics
A note on linguistics
Linguistics is the scientific study of human language. Linguistic structures are
pairings of meaning and sound. Various sub-parts of linguistic structure exist which
can be arranged hierarchically from sound to meaning (with overlap):
Phonetics: study the sound of human language.
Phonology: study the distinctive units within a given language and
commonalities across languages.
Morphology: study the internal structure of words.
Syntax : study of how words are organised into sentences.
Semantics : study meaning of words or sentences.
Pragmatics: study how utterances are used in communicative acts.
Discourse analysis: study how sentences organised into texts.
Phoneme
The smallest units of speech in a language that distinguish one word from another.
Continuous speech can be thought of as a concatenation of different phonemes.
Phonemes are split into consonants (either voiced or unvoiced) and vowels (all are
voiced) in English.
15
Speech Production
Many problems in speech processing can be aided by a basic understanding of
the speech production mechanism and anatomy of the vocal tract.
Common speech coding schemes such as linear predictive coding (LPC)
provide a fairly crude model of the speech production process.
More complex physical models of speech production process, e.g. taking into
account the precise shape and absorption characteristics of the vocal tract
have resulted in more natural sounding speech synthesizers.
16
Speech Production (cont.)
Vocal organs
Vocal tract: begins at the vocal cords or glottis, ends at the lips.
Nasal tract: exists between the velum and the nostrils.
Lungs, diaphragm, and trachea are situated below the vocal cords,
serving as an excitation mechanism by directing the air from the lungs
through the trachea and the glottis.
17
Sagittal section of the
vocal tract
18
A sketch of the vocal tract
19
Voiced and unvoiced speech
Voiced speech: the flow of air from the lungs causes a quasi-periodic
vibration of the vocal cords (or vocal folds), and the sound transmitted
along the vocal tract is unimpeded.
Unvoiced speech: the flow of air through the vocal apparatus is either
cut-off or impeded, e.g. by forming a constriction using the tongue or lips.
20
Voicing process
Air forced through the glottis results in a
quasi-periodic oscillation of the vocal
cords, as shown in the left figure.
In stage 2-4, the increasing pressure in the
trachea forces the opening of the glottis.
As air travels through the glottis, the air
pressure decreases between the vocal
cords.
The decreasing pressure forces the vocal
cords to snap together again, at the lower
edge first, as in stages 6-10.
The resulting perception is called pitch,
whose frequency is around 85-155Hz for
adult male speakers and 165-255Hz for
adult female speakers.
21
Beth's First Laryngoscopy - Vocal Cords in Action
An interesting video clip from Youtube, available at:
http://www.youtube.com/watch?v=iYpDwhpILkQ

22
Unvoiced speech
Unvoiced sounds are caused by a partial or total constriction at some point
in the vocal tract, such as the lips.
Air can be forced through this constriction to create turbulence or noise
(e.g. fricative sounds \s\ and \f\), or the constriction can be opened
suddenly to create a burst of turbulence (e.g. the plosives \p\ and \t\)
There is no vibration of the vocal cords.
The excitation signal can be regarded as a random noise source as
opposed to a periodic sequence of glottal pulses in the voiced case.
23
Typical voiced and unvoiced speech waveform and their spectrum
[Sources: from Rice/PROJECTS00/vocode/]
24
Formants
The vocal tract (and on occasion, coupled with nasal tract) can be
regarded as a resonant system that performs some spectral modification of
the excitation signal (i.e. a quasi-periodic series of glottal pulses for voiced
sounds, or turbulent air flow/noise for unvoiced sounds) before it is
released from the lips.
Modes or peaks in the frequency response of the resonant system are
known as formants, and these occur at the formant frequencies.
Anti-resonances (minimum in the frequency response) can also exist, e.g.
for nasal consonants.
In speech, the most perceptually important formants are the lowest 3
formants. However, trained singers are sometimes able to place more
energy in higher formants (e.g. around 3000Hz for male operatic singers).
25
The first three formants of "ah" are shown in the above spectrum. The vertical
lines denote harmonics due to the vibration of the vocal cords (i.e. multiples of the
fundamental frequencies). The vocal tract acts as a resonance system through
which harmonics pass to generate the vowel's characteristic spectral shape.
26
Vowels
Vowels are all voiced. Vowel phonemes in BBC English is shown below:
27
Vowels
Vowels can be characterised by the articulatory parameters as shown in the vowel
quadrilateral in the following figure: height (close/mid/open: the vertical position of
the tongue relative to the roof of the mouth), backness (front/central/back: horizontal
tongue position relative to the back of the mouth), and roundedness (whether the
lips are rounded or not).
28
Vowels
Vowels can also be characterised in terms of their average formant frequencies, which
are related to the articulatory parameters but differ more from speaker to speaker.
The first three formant frequencies are given in the following table for various vowels
(averaged over male speakers):
29
Consonants
The point or place of constriction in the vocal tract, typically between the tongue and
a stationary articulator (e.g. the teeth or the roof of the mouth), gives the consonant
its characteristic sound.
Consonants can be categorised into:
Fricatives: produced by forcing air through a narrow channel at some point in
the oral pathway. They can be either voiced, e.g. \v\, \z\, or unvoiced, e.g. \f\,
\s\. Sibilants is a particular subset of fricatives, where the air is directed over
the edge of the teeth, e.g. \s\ and \z\.
Stops: produced by building up pressure behind complete constriction in the
vocal tract, and suddenly releasing the pressure, e.g. voiced \b\, \d\ and \g\, and
unvoiced \p\, \t\ and \k\. Plosives are reserved for oral (non-nasal) stops, such
as \p\ in pit, and \d\ in dog.
Nasals: the mouth is completely constricted at some point, and the velum is
lowered so that nasal tract is opened, and sound is radiated from the nostrils,
e.g. \m\, \n\.
Affricates can be modelled as a concatenation of a stop and a fricative, e.g.
\dzh\.
Approximants: vocal tract is narrowed, but leave enough space for air to flow
without much audible turbulence, e.g. \w\ \l\.
30
Consonants phonemes in BBC English:
31
Vowel, consonant,
and formant
frequency can be
depicted by the
right figure:
Waveform and
spectrogram of the
word zoo
32
Modelling of speech production
Accurate model
The production of speech is rather complex, and to model it accurately with
a physical model would have to involve the following:
The nature of the glottal excitation, e.g. periodic/aperiodic
Time variation of the vocal tract shape
Losses due to heat conduction, viscous friction and absorption
characteristics of the vocal tract walls
Nasal coupling
Radiation of sound from the lips
Reference:
L. R. Rabiner and R.W. Schafer, Digital Processing of Speech Signal,
Printice-Hall, 1978.
33
Lossless tube model
One of the simplest models of the vocal tract is a tube of non-uniform, time-
varying cross-section with plane wave propagation along the axis of the
tube, assuming no losses due to viscosity or thermal conduction:
34
Source-filter model
The source or excitation is the signal arriving from the glottis, either a quasi-periodic
sequence of glottal pulses or a broad-band noise signal (typically treated as
Gaussian noise).
The combined response of the vocal tract, nasal tract, and lips is modelled as a time-
varying linear filter.
The output of the model is a convolution of the excitation with the impulse response of
the linear filter. The filter response typically has a number of poles and zeros, or may
be an all-pole filter as in LPC.
It is an approximation and is widely used in speech coding for its simplicity. Filter
parameters can be estimated easily from real speech, which can be subsequently used
for speech synthesis using the same source-filter model in the receiver end of the
transmission channel.
35
Model fit to the real data (where PDS denotes the power spectral density
function) [Sources: from Rice/PROJECTS00/vocode/]
FFT size
M
a
g
n
i
t
u
d
e

i
n

d
B

36
Speech perception
The auditory system and subsequent higher-level cognitive
processing, is responsible for speech perception. It has an
entirely different structure to the organs of speech production,
and is not at all like an inverse model of speech production.
The study of auditory system is split into physiological aspects
(i.e. relating to physical/mechanical processing of sound) and
psychological aspects (i.e. related to processing in the brain).
Psychoacoustics is a general term for the study of how humans
perceive sound, covering both physiological and psychological
aspects (to be covered in wk 9-10).
37
Speech perception (cont.)
Outer ear
It contains pinna and auditory canal. The pinna helps to direct sound into the
auditory canal, and is used to localise the direction of a sound source. The sound
travelling through the canal causes the tympanic membrane (eardrum) to vibrate.
Middle ear
It contains three bones (or ossicles): the malleus (or hammer), incus (or anvil) and
stapes (or stirrup). The arrangement of these bones amplifies the sound being
transmitted to the fluid-filled cochlea in the inner ear.
Since the surface area of the eardrum is many times of that of the stapes footplate,
sound energy striking the eardrum is concentrated on the smaller footplate. The
angles between the ossicles are such that a greater force is applied to the cochlea
than that transmitted to the hammer. The middle ear can also be considered as an
impedance matching device.
Inner ear
It contains two sensory systems: the vestibular apparatus, and the cochlea. The
former is responsible for balance and contains the vestibule and semi-circular canals.
Sound transmitted to the inner ear causes movement of fluid within the cochlea. The
hair cells within the cochlea are stimulated by this movement and convert the
vibration into electrical potentials, which are then transmitted as neural impulses
along the auditory nerve towards the brain.
38
Anatomy of the human ear
39
Hair cells along the cochlear are frequency-selective, with hair cells at
the end near the elliptical window being receptive to high frequencies,
and those near the apex being receptive to low frequencies.
It performs a kind of spectral analysis, whose resolution is non-linear. In
other words, a difference of f = 10Hz between two sinusoidal
components around 100Hz is easily noticeable, whereas at 5kHz is
imperceptible.
The frequency sensitivity of the cochlea is roughly logarithmic above
around 500Hz, i.e. the relative frequency resolution f/f of the cochlear is
relatively constant.
As a consequence of the non-linearity of the frequency selectivity of the
ear, the mel-frequency scale was designed as a perceptual scale of
pitches judged by listeners to be equal in distance from one another. In
other words, any three frequencies equi-distant apart in mels, will appear
to be roughly equi-distant in perceived pitch.

40
Cognitive processing of sound is still an active area of research with
many questions to be answered.
The study of how humans process and interpret their sound environment
is termed as auditory scene analysis.
Computer simulation of the auditory scene analysis process is known as
computational auditory scene analysis.

41
To convert frequency in Hz into mel, we use:
) 700 / 1 ln( 01048 . 1127
Hz mel
f f
) 1 ( 700
01048 . 1127

mel
f
e f
Hz
And vice versa:
42
Mel-frequency scale
43
Sensation of loudness is also frequency-dependent.
When listening to two equal amplitude sinusoids/pure tones at 50Hz
and 1kHz, the 1kHz sinusoid will be heard as louder.
The range of hearing is roughly between 20Hz and 20kHz (although
these limits tend to reduce with age, especially at the high-frequency
end). Outside these limits, nothing is heard at all.
The unit of measurement of loudness level is phon; by definition, two
sine waves that have equal phons are equally loud.
44
45
Digital encoding of speech
Processing of speech has moved almost entirely into the digital
domain.
Speech is initially a variation in air pressure which is converted into a
continuous voltage by a microphone.
Digital encoding of speech has several advantages, such as:
Digital signals can be stored for periods of time and transmitted over noisy
channels relatively uncorrupted.
Digital signals can be encrypted by scrambling the bits, which are then
unscrambled at the receiver.
Digital speech can be encoded and compressed for efficient transmission and
storage.
46
Analog-to-digital (A/D) conversion
A/D conversion consists of two stages:
Sampling
A continuous signal x(t) can be sampled into a discrete signal x[n]
=x(nT), every T seconds, using a sample and hold circuit. The sampling
rate is defined as the number of samples obtained in one second, and
is measured in Hertz (Hz), i.e.

Quantisation
The value of each sample is represented using a finite number of bits.
Each possible combination of n bits denotes a quantisation level. The
difference between the sampled value and the quantised value is the
quantisation error.
T
f
s
1
47
A/D conversion (cont.)
In practice, different bit-depth usually used for different audio
signals
Digital speech
The dynamic range of clean speech is around 40dB (between the
threshold of hearing and the loudest normal speech sound).
Background noise becomes obtrusive when SNR is worse than
about 30dB. Therefore, a 70dB dynamic range provides
reasonable quality, which is equivalent to 12-bit resolution (roughly
6dB/bit).
Commercial CD quality music
16 bits are usually used, i.e. 65536 levels, which correspond to
96dB dynamic range.
Digital mixing consoles, music effects units, and audio processing
software
It is common to use 24-bits or higher.
48
A/D conversion (cont.)
Reconstruction conditions
Aliasing effect and Nyquist criterion
To allow the perfect reconstruction of the original signal, the sampling rate
should be at least twice the highest frequency in the signal.
A lower sampling rate can cause aliasing effect.
49
Aliasing effect in spectral domain
50
Anti-aliasing
To avoid aliasing effect, the A/D converter usually incorporates an anti-
aliasing filter (a low-pass filter) before sampling, with a cut-off frequency near
the Nyquist frequency (half of the sampling rate).
In practice, it is difficult to design a steep cut-off low-pass filter. A non-ideal
filter is used instead, and the sampling rate is usually chosen to be more than
twice the highest frequency in the signal. For some typical applications, the
sampling rates are usually chosen as follows:
In telecommunication networks, 8kHz (the signal is band-limited to [300 3400]Hz)
Wideband speech coding, 16kHz (natural quality speech is band-limited to [50
7000]Hz)
Commercial CD music, 44.1kHz (audible frequency range reaches up to 20kHz)
Oversampling can be advantageous in some applications to relax the sharp
cut-off frequency requirements for anti-aliasing filters.

51
D/A converter
D/A conversion consists of two stages:
Deglitching
A process to convert the digital speech represented by bits into a continuous
voltage signal, similar to the sample and hold operation in A/D conversion.
Interpolating filter
A low-pass filter is then used to remove the sharp edges (causing high-frequency
noise) in the output voltage.
According to sampling theorem, the ideal low-pass filter with which the analog
signal can be perfectly recovered uses a sinc impulse function:
In practice, the sinc function is truncated to a limited interval, instead of infinite sum.
T t
T t
t g
/
) / sin(
) (

n
s
f
n
t g n x t x ) ( ] [ ) (
52
Compressed sensing
Many signals in practice are redundant
Information rate versus Nyquist rate
Signals can be perfectly reconstructed from a small number of (non-uniformly)
random samples (a smaller number than required by the Nyquist sampling
theorem).
Sparse representation
This concept is based on so-called sparse representation of signals, i.e. signals can
be decomposed as a linear combination of a small number of atoms (the signal
components) selected from a dictionary (i.e. the collection of all the atoms).

Speech and Audio Processing and Coding

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speech and Audio Processing and Coding

Загружено:

Авторское право:

Доступные форматы

1

Speech and Audio Processing

Вам также может понравиться