MM 103102

Sound and Speech Recognition
What is Sound ?
Acoustics is the study of sound.
Physical - sound as a disturbance in the air

Psychophysical - sound as perceived by the ear
Sound as stimulus (physical event) & sound as a sensation.
Pressures changes (in band from 20 Hz to 20 kHz)
Physical terms
Amplitude
Frequency
Spectrum
Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 CarnegieMellon

Sound Waves
In a free field, an ideal source of acoustical energy

sends out sound of uniform intensity in all directions.
=> Sound is propagating as a spherical wave.
Intensity of sound is inversely proportional to the square
of the distance (Inverse distance law).
6 dB decrease of sound pressure level per doubling the
distance.

Sound Waves

What is Sound

How we hear
Ear connected to the brain
left brain: speech
right brain: music
Ear's sensitivity to frequency is logarithmic
Varying frequency response
Dynamic range is about 120 dB (at 3-4 kHz)
Frequency discrimination 2 Hz (at 1 kHz)
Intensity change of 1 dB can be detected.

Digitizing Sound

Digitally Sampling

Undersampling

Clipping

Quantization

Digital Sampling
Sampling is dictated by the Nyquist sampling

theorem which states how quickly samples must be
taken to ensure an accurate representation of the
analog signal.
T
fs 2 f or Ts
2
The Nyquist sampling theorem states that the
sampling frequency must be two times greater than
the highest frequency in the original analog signal.

Dithering a Sampled Signal
Analog signal added to the signal to remove the artifacts of quantization
error.
Dither causes the audio signal to always move between quantization

levels.
Otherwise, a low level signal would be encoded as a square wave =>

granulation noise.
Dithered, the A/D converter output is signal + noise

=> perceptually preferred,
since noise is better tolerated than distortion.
Amplitude of dither signal:

high dither amplitudes more easily remove quantization artifacts
too much dither decreases the signal-to-noise ratio

Common Sound Sampling Parameters
Common Sampling Rates
8KHz (Phone) or 8.012820513kHz (Phone, NeXT)
11.025kHz (1/4 CD std)
16kHz (G.722 std)
22.05kHz (1/2 CD std)
44.1kHz (CD, DAT)
48kHz (DAT)
Bits per Sample
8 or 16
Number of Channels
mono/stereo/quad/ etc.

Audio Data Rates
Quality Format Transfer Disk Space Disk Space
(examples) Rate 1 hour 100,000 hours
Netcasting RealAudio 20 Kbit/s 8.8 MByte 0.9 TByte
Preview RealAudio 80 Kbit/s 35.2 MByte 3.5 TByte
Preview MPEG Layer 3 192 Kbit/s 84.4 MByte 8.4 TByte

(MP3)
Broadcasting or MPEG Layer 2 384 Kbit/s 168.8 MByte 16.9 TByte
Editing
Archive Waveform 1538 Kbit/s 675.9 MByte 67.6 TByte

(uncompressed) PCM
Space/Storage Requirements
1 Minute of Sound
Type Mono Mono Stereo Stereo
Resolution 8 bit 16 bit 8 bit 16 bit
Sampling
Rate
44.1k 2646k 5292k 5292k 10584k
22.05k 1323k 2646k 2646k 5292k
11.025k 661.5k 1323k 1323k 2646k
8k 480k 960k 960k 1920k

Many (!) Sound File Formats
Mulaw (Sun, NeXT) .au
RIFF (Resource Interchange File Format)
MS WAV and .AVI

MPEG Audio Layer (MPEG) .mpa .mp3
AIFC (Apple, SGI) .aiff .aif
HCOM (Mac) .hcom
SND (Sun, NeXT) .snd
VOC (Soundblaster card proprietary standard) .voc
AND MANY OTHERS!

Whats in a Sound File Format
Header Information
Magic Cookie
Sampling Rate
Bits/Sample
Channels
Byte Order
Endian
Compression type
Data

Example File Format (NIST SPHERE)
NIST_1A
1024
sample_rate -i 16000
channel_count -i 1
sample_n_bytes -i 2
sample_byte_format -s2 10
sample_sig_bits -i 16
sample_count -i 594400
sample_coding -s3 pcm
sample_checksum -i 20129
end_head

WAV file format (Microsoft) RIFF
A collection of data chunks.

Each chunk has a 32-bit Id
followed by a 32-bit chunk length
followed by the chunk data.
0x00 chunk id 'RIFF'
0x04 chunk size (32-bits)
0x08 wave chunk id 'WAVE'
0x0C format chunk id 'fmt '
0x10 format chunk size (32-bits)
0x14 format tag (currently pcm)
0x16 number of channels 1=mono, 2=stereo
0x18 sample rate in hz
0x1C average bytes per second
0x20 number of bytes per sample
1 = 8-bit mono
2 = 8-bit stereo or
16-bit mono
4 = 16-bit stereo
0x22 number of bits in a sample
0x24 data chunk id 'data'
0x28 length of data chunk (32-bits)
0x2C Sample data

Digital Audio Today
Analog elements in the audio chain are replaced with digital

elements.
16-bit wordlength, 32/44.1/48 kHz sampling rates.
Mostly linear signal processing.
Wide range of digital formats and storage media.
Rapid development of technology
=> better SNR, phase and linearity.
Rapid increase of signal processing power
=> possibility to implement new, complex features.
Soon: Digital radio (satellite), HDTV

Digital (CD) vs Analog (LP or cassette tape)
Information is stored digitally.

The length of its data pits represents a series
of 1s and 0s.
Both audio channels are stored along the
same pit track.
Data is read using laser beam.
Information density about 100 times greater
than in LP.
CD player can correct disc errors.

Benefits of Digital Representation (CD)
Robust
No degradation from repeated playings because data is read by
the laser beam.
Error correction
Transports performance does not affect the quality of audio
reproduction.
Digital circuitry more immune to aging and temperature problems
Data conversion is independent of variations in disc rotational
speed, hence wow and flutter are negligible.
SNR over 90 dB.
Subcode for display, control and user information

CD Format
Sampling
44.1 kHz => 10 % margin with respect to the Nyquist frequency (audible frequencies below 20 kHz)
16-bit linear
=> theoretical SNR about 98 dB (for sinusoidal signal with maximum amplitude)
audio bit rate 1.41 Mbit/s (44.1 kHz * 16 bits * 2 channels)
Cross Interleaved Reed-Solomon Code (CIRC) for error correction
Subcode
Original Specifications
Playing time max. 74.7 min
Disc diameter 120 mm
Disc thickness 1.2 mm
One sided medium, rotates clockwise
Signal is recorded from inside to outside
Pit is about 0.5 m wide
Pit edge is 1 and all other areas whether inside or outside a pit, are 0s

Speech Recognition in Brief

Acoustic Origins
A wave for the words speech lab looks like:
s p ee ch l a b
l to a
transition:
Graphs from Simon Arnfields web tutorial on speech, Sheffield:
http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/

Speech Recognition Knowledge Sources
Acoustic Modeling
Describes the sounds that
make up speech
Speech Recognition
Lexicon Language Model

Describes which
Describes the likelihood
sequences of speech
of various sequences of
sounds make up
words being spoken
valid words
Speech Recognition
THE FUNDAMENTAL EQUATION
O is an acoustical Observation
w is a word we are trying to recognize
Maximize w = argmax (P(W) | O)
P(W|O) is unknown so by Bayes rule:
P(O|W) P(W)
P(W|O) = ------------------------
P(O)
Mechanismofstateoftheartspeechrecognizers
Speechin
Acoustic
analysis
x1 ... xT
P(x1... xT | w1... wk )
Recognition:
Maximize Pronunciationlexicon
P (x1... xT | w1... wk )P(w1... wk )

P(w1 ... wk )
Languagemodel
Recognized
Sentence
Acoustic Sampling
10 ms frame (ms = millisecond = 1/1000 second)
~25 ms window around frame to smooth signal
processing
25 ms
...
10ms
Result:
a1 a2 a3 Acoustic Feature Vectors

Spectral Analysis
Frequency gives pitch; amplitude gives volume
sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000
cycles/sec)
s p ee ch l a b
amplitude
Fourier transform of wave yields a spectrogram

darkness indicates energy at each frequency
hundreds to thousands of frequency samples
frequency

Features for Speech Recognition
Coding scheme (typical)
10 millisecond step size; 25 millisecond window
~39 coefficients each step:
mel-scale cepstra derived from frequency
representation
and coefficients
power

The Markov Assumption
Only immediately preceding history matters
n
P ( X 1 , X 2 , X 3 , , X n ) P ( X i | X 1 , X 2 , X 3 , , X i 1 )
i 1
P ( X i | X 1 , X 2 , X 3 , , X n ) P ( X i | X i 1 )
n
P ( X 1 , X 2 , X 3 , , X n ) P ( X i | X i 1 )
i 1

Hidden Markov Models
In speech recognition the number of states is very
large; we can simplify the problem by factoring the
problem into two components
p( s2 | s1 ) q ( y1 | s2 , s1 )
S1 S2 S3

Hidden Markov Model

Searching the Speech Signal Trellis

Lexicon - links words to phones
in acoustic model
Aaron EH R AX N
Aaron(2) AE R AX N
abandon AX B AE N D AX N
abandoned AX B AE N D AX N DD
abandoning AX B AE N D AX N IX NG
abandonment AX B AE N D AX N M AX N TD
abated AX B EY DX IX DD
abatement AX B EY TD M AX N TD
abbey AE B IY
Abbott AE B AX TD
Abboud AA B UW DD
abby AE B IY
abducted AE BD D AH KD T IX DD
Abdul AE BD D UW L
When Language Modeling Goes Wrong

When P(w) is incorrect

Language Modeling

Language Models
A language model is a probability distribution over word sequences

n
p(W ) p ( w1,...wn) p ( wi | w0,..., wi 1)
i 1
n = 3,4,5 [lose the rest of the context]

Hard to estimate large contexts: consider 64,000^3 words
Need large collections of text
Smoothing P(wi| wi-2, wi-1) is necessary

Creating models for recognition
Speech Acoustic
Transcribe* Train
data models
Text Language
Train
data models

Continual Progress in Speech Recognition
Increasingly Difficult Tasks, Steadily Declining Error Rates
CONVERSATIONAL SPEECH
100
Non-English
English
50
Word Error Rate (%)
READ SPEECH
5000 word BROADCAST NEWS
1000 Word 20,000 Word

Varied microphones
vocabulary
10
Standard microphone
Noisy environment
Unlimited Vocabulary
All results are Speaker -Independent
1
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
NSA/Wayne/Doddington
References
Speech Recognition resource links can be found at:
http://svr-www.eng.cam.ac.uk/comp.speech/Section2/speechlinks.html
An excellent tutorial on speech recognition by Wayne Ward:
http://www-2.cs.cmu.edu/~roni/11761-s01/Presentations/whw%20hmm's%20in%20speech%20recognition%203.0.pdf

Sound + Speech Recognition
Thats all for today

MM 103102

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

MM 103102

Загружено:

Авторское право:

Доступные форматы

Sound and Speech Recognition

Acoustics is the study of sound.

Physical - sound as a disturbance in the air

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 CarnegieMellon

In a free field, an ideal source of acoustical energy

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 CarnegieMellon

Sampling is dictated by the Nyquist sampling

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 CarnegieMellon

Dither causes the audio signal to always move between quantization

Otherwise, a low level signal would be encoded as a square wave =>

Dithered, the A/D converter output is signal + noise

Amplitude of dither signal:

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 CarnegieMellon

Preview RealAudio 80 Kbit/s 35.2 MByte 3.5 TByte

Preview MPEG Layer 3 192 Kbit/s 84.4 MByte 8.4 TByte

Archive Waveform 1538 Kbit/s 675.9 MByte 67.6 TByte

Type Mono Mono Stereo Stereo

Resolution 8 bit 16 bit 8 bit 16 bit

22.05k 1323k 2646k 2646k 5292k

11.025k 661.5k 1323k 1323k 2646k

8k 480k 960k 960k 1920k

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 CarnegieMellon

RIFF (Resource Interchange File Format)

MS WAV and .AVI

AIFC (Apple, SGI) .aiff .aif

HCOM (Mac) .hcom

SND (Sun, NeXT) .snd

VOC (Soundblaster card proprietary standard) .voc

AND MANY OTHERS!

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 CarnegieMellon

A collection of data chunks.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 CarnegieMellon

Analog elements in the audio chain are replaced with digital

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 CarnegieMellon

Information is stored digitally.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 CarnegieMellon

A wave for the words speech lab looks like:

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 CarnegieMellon

Lexicon Language Model

Maximize w = argmax (P(W) | O)

P(W|O) is unknown so by Bayes rule:

P (x1... xT | w1... wk )P(w1... wk )

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 CarnegieMellon

Fourier transform of wave yields a spectrogram

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 CarnegieMellon

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 CarnegieMellon