Вы находитесь на странице: 1из 45

Sound and Speech Recognition

What is Sound ?

Acoustics is the study of sound.

Physical - sound as a disturbance in the air


Psychophysical - sound as perceived by the ear
Sound as stimulus (physical event) & sound as a sensation.
Pressures changes (in band from 20 Hz to 20 kHz)

Physical terms
Amplitude
Frequency
Spectrum

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 CarnegieMellon


Sound Waves

In a free field, an ideal source of acoustical energy


sends out sound of uniform intensity in all directions.
=> Sound is propagating as a spherical wave.
Intensity of sound is inversely proportional to the square
of the distance (Inverse distance law).
6 dB decrease of sound pressure level per doubling the
distance.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 CarnegieMellon


Sound Waves

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 CarnegieMellon


What is Sound

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 5 CarnegieMellon


How we hear
Ear connected to the brain
left brain: speech
right brain: music
Ear's sensitivity to frequency is logarithmic
Varying frequency response
Dynamic range is about 120 dB (at 3-4 kHz)
Frequency discrimination 2 Hz (at 1 kHz)
Intensity change of 1 dB can be detected.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 CarnegieMellon


Digitizing Sound

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 7 CarnegieMellon


Digitally Sampling

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 CarnegieMellon


Undersampling

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 CarnegieMellon


Clipping

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 CarnegieMellon


Quantization

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 CarnegieMellon


Digital Sampling

Sampling is dictated by the Nyquist sampling


theorem which states how quickly samples must be
taken to ensure an accurate representation of the
analog signal.
T
fs 2 f or Ts
2
The Nyquist sampling theorem states that the
sampling frequency must be two times greater than
the highest frequency in the original analog signal.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 CarnegieMellon


Dithering a Sampled Signal
Analog signal added to the signal to remove the artifacts of quantization
error.

Dither causes the audio signal to always move between quantization


levels.

Otherwise, a low level signal would be encoded as a square wave =>


granulation noise.

Dithered, the A/D converter output is signal + noise


=> perceptually preferred,
since noise is better tolerated than distortion.

Amplitude of dither signal:


high dither amplitudes more easily remove quantization artifacts
too much dither decreases the signal-to-noise ratio

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 CarnegieMellon


Common Sound Sampling Parameters
Common Sampling Rates
8KHz (Phone) or 8.012820513kHz (Phone, NeXT)
11.025kHz (1/4 CD std)
16kHz (G.722 std)
22.05kHz (1/2 CD std)
44.1kHz (CD, DAT)
48kHz (DAT)
Bits per Sample
8 or 16
Number of Channels
mono/stereo/quad/ etc.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 CarnegieMellon


Audio Data Rates
Quality Format Transfer Disk Space Disk Space
(examples) Rate 1 hour 100,000 hours
Netcasting RealAudio 20 Kbit/s 8.8 MByte 0.9 TByte

Preview RealAudio 80 Kbit/s 35.2 MByte 3.5 TByte

Preview MPEG Layer 3 192 Kbit/s 84.4 MByte 8.4 TByte


(MP3)
Broadcasting or MPEG Layer 2 384 Kbit/s 168.8 MByte 16.9 TByte
Editing

Archive Waveform 1538 Kbit/s 675.9 MByte 67.6 TByte


(uncompressed) PCM
Space/Storage Requirements

1 Minute of Sound

Type Mono Mono Stereo Stereo

Resolution 8 bit 16 bit 8 bit 16 bit

Sampling
Rate
44.1k 2646k 5292k 5292k 10584k

22.05k 1323k 2646k 2646k 5292k

11.025k 661.5k 1323k 1323k 2646k

8k 480k 960k 960k 1920k

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 CarnegieMellon


Many (!) Sound File Formats
Mulaw (Sun, NeXT) .au

RIFF (Resource Interchange File Format)

MS WAV and .AVI


MPEG Audio Layer (MPEG) .mpa .mp3

AIFC (Apple, SGI) .aiff .aif

HCOM (Mac) .hcom

SND (Sun, NeXT) .snd

VOC (Soundblaster card proprietary standard) .voc

AND MANY OTHERS!

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 CarnegieMellon


Whats in a Sound File Format

Header Information

Magic Cookie
Sampling Rate
Bits/Sample
Channels
Byte Order
Endian
Compression type
Data

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 CarnegieMellon


Example File Format (NIST SPHERE)

NIST_1A
1024
sample_rate -i 16000
channel_count -i 1
sample_n_bytes -i 2
sample_byte_format -s2 10
sample_sig_bits -i 16
sample_count -i 594400
sample_coding -s3 pcm
sample_checksum -i 20129
end_head

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 CarnegieMellon


WAV file format (Microsoft) RIFF

A collection of data chunks.


Each chunk has a 32-bit Id
followed by a 32-bit chunk length
followed by the chunk data.
0x00 chunk id 'RIFF'
0x04 chunk size (32-bits)
0x08 wave chunk id 'WAVE'
0x0C format chunk id 'fmt '
0x10 format chunk size (32-bits)
0x14 format tag (currently pcm)
0x16 number of channels 1=mono, 2=stereo
0x18 sample rate in hz
0x1C average bytes per second
0x20 number of bytes per sample
1 = 8-bit mono
2 = 8-bit stereo or
16-bit mono
4 = 16-bit stereo
0x22 number of bits in a sample
0x24 data chunk id 'data'
0x28 length of data chunk (32-bits)
0x2C Sample data

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 CarnegieMellon


Digital Audio Today

Analog elements in the audio chain are replaced with digital


elements.
16-bit wordlength, 32/44.1/48 kHz sampling rates.
Mostly linear signal processing.
Wide range of digital formats and storage media.
Rapid development of technology
=> better SNR, phase and linearity.
Rapid increase of signal processing power
=> possibility to implement new, complex features.
Soon: Digital radio (satellite), HDTV

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 CarnegieMellon


Digital (CD) vs Analog (LP or cassette tape)

Information is stored digitally.


The length of its data pits represents a series
of 1s and 0s.
Both audio channels are stored along the
same pit track.
Data is read using laser beam.
Information density about 100 times greater
than in LP.
CD player can correct disc errors.

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 22 CarnegieMellon


Benefits of Digital Representation (CD)
Robust
No degradation from repeated playings because data is read by
the laser beam.
Error correction
Transports performance does not affect the quality of audio
reproduction.
Digital circuitry more immune to aging and temperature problems
Data conversion is independent of variations in disc rotational
speed, hence wow and flutter are negligible.
SNR over 90 dB.
Subcode for display, control and user information

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 CarnegieMellon


CD Format
Sampling
44.1 kHz => 10 % margin with respect to the Nyquist frequency (audible frequencies below 20 kHz)
16-bit linear
=> theoretical SNR about 98 dB (for sinusoidal signal with maximum amplitude)
audio bit rate 1.41 Mbit/s (44.1 kHz * 16 bits * 2 channels)
Cross Interleaved Reed-Solomon Code (CIRC) for error correction
Subcode
Original Specifications
Playing time max. 74.7 min
Disc diameter 120 mm
Disc thickness 1.2 mm
One sided medium, rotates clockwise
Signal is recorded from inside to outside
Pit is about 0.5 m wide
Pit edge is 1 and all other areas whether inside or outside a pit, are 0s

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 CarnegieMellon


Speech Recognition in Brief

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 CarnegieMellon


Acoustic Origins

A wave for the words speech lab looks like:

s p ee ch l a b

l to a
transition:
Graphs from Simon Arnfields web tutorial on speech, Sheffield:
http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 CarnegieMellon


Speech Recognition Knowledge Sources

Acoustic Modeling
Describes the sounds that
make up speech

Speech Recognition

Lexicon Language Model


Describes which
Describes the likelihood
sequences of speech
of various sequences of
sounds make up
words being spoken
valid words
Speech Recognition
THE FUNDAMENTAL EQUATION
O is an acoustical Observation
w is a word we are trying to recognize

Maximize w = argmax (P(W) | O)

P(W|O) is unknown so by Bayes rule:

P(O|W) P(W)
P(W|O) = ------------------------
P(O)
Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 CarnegieMellon
Mechanismofstateoftheartspeechrecognizers
Speechin

Acoustic
analysis

x1 ... xT
P(x1... xT | w1... wk )
Recognition:
Maximize Pronunciationlexicon

P (x1... xT | w1... wk )P(w1... wk )


P(w1 ... wk )
Languagemodel

Recognized
Sentence
Acoustic Sampling
10 ms frame (ms = millisecond = 1/1000 second)
~25 ms window around frame to smooth signal
processing

25 ms

...
10ms
Result:
a1 a2 a3 Acoustic Feature Vectors

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 CarnegieMellon


Spectral Analysis
Frequency gives pitch; amplitude gives volume
sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000
cycles/sec)
s p ee ch l a b
amplitude

Fourier transform of wave yields a spectrogram


darkness indicates energy at each frequency
hundreds to thousands of frequency samples
frequency

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 CarnegieMellon


Features for Speech Recognition
Coding scheme (typical)
10 millisecond step size; 25 millisecond window
~39 coefficients each step:
mel-scale cepstra derived from frequency
representation
and coefficients
power

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 CarnegieMellon


The Markov Assumption
Only immediately preceding history matters

n
P ( X 1 , X 2 , X 3 , , X n ) P ( X i | X 1 , X 2 , X 3 , , X i 1 )
i 1

P ( X i | X 1 , X 2 , X 3 , , X n ) P ( X i | X i 1 )

n
P ( X 1 , X 2 , X 3 , , X n ) P ( X i | X i 1 )
i 1

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 CarnegieMellon


Hidden Markov Models
In speech recognition the number of states is very
large; we can simplify the problem by factoring the
problem into two components

p( s2 | s1 ) q ( y1 | s2 , s1 )

S1 S2 S3

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 CarnegieMellon


Hidden Markov Model

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 CarnegieMellon


Searching the Speech Signal Trellis

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 CarnegieMellon


Lexicon - links words to phones
in acoustic model
Aaron EH R AX N
Aaron(2) AE R AX N
abandon AX B AE N D AX N
abandoned AX B AE N D AX N DD
abandoning AX B AE N D AX N IX NG
abandonment AX B AE N D AX N M AX N TD
abated AX B EY DX IX DD
abatement AX B EY TD M AX N TD
abbey AE B IY
Abbott AE B AX TD
Abboud AA B UW DD
abby AE B IY
abducted AE BD D AH KD T IX DD
Abdul AE BD D UW L
Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 37 CarnegieMellon
When Language Modeling Goes Wrong

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 CarnegieMellon


When P(w) is incorrect

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 CarnegieMellon


Language Modeling

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 CarnegieMellon


Language Models

A language model is a probability distribution over word sequences


n
p(W ) p ( w1,...wn) p ( wi | w0,..., wi 1)
i 1

n = 3,4,5 [lose the rest of the context]


Hard to estimate large contexts: consider 64,000^3 words
Need large collections of text
Smoothing P(wi| wi-2, wi-1) is necessary

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 CarnegieMellon


Creating models for recognition

Speech Acoustic
Transcribe* Train
data models

Text Language
Train
data models

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 CarnegieMellon


Continual Progress in Speech Recognition
Increasingly Difficult Tasks, Steadily Declining Error Rates

CONVERSATIONAL SPEECH
100
Non-English
English
50
Word Error Rate (%)

READ SPEECH

5000 word BROADCAST NEWS

1000 Word 20,000 Word


Varied microphones
vocabulary
10

Standard microphone

Noisy environment

Unlimited Vocabulary
All results are Speaker -Independent
1
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
NSA/Wayne/Doddington
Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 43 CarnegieMellon
References
Speech Recognition resource links can be found at:
http://svr-www.eng.cam.ac.uk/comp.speech/Section2/speechlinks.html
An excellent tutorial on speech recognition by Wayne Ward:
http://www-2.cs.cmu.edu/~roni/11761-s01/Presentations/whw%20hmm's%20in%20speech%20recognition%203.0.pdf

Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 CarnegieMellon


Sound + Speech Recognition

Thats all for today

Вам также может понравиться