Speech Recogntion Using Hidden Markov Models

SPEECH RECOGNTION
USING HIDDEN MARKOV

MODELS
OUTLINE
I
THE SPEECH
SIGNAL
II
THE HIDDEN
MARKOV
MODEL
III
SPEECH
RECOGNITION
USING HMM
INTRODUCTION
APPLICATIONS :
1. HANDS-FREE COMPUTING
II. AUTOMATIC TRANSLATION
EARLY HISTORY
1952 Isolated digit recognition for a

single speaker.
1959 Vowel Recognition Program
1970s Isolated word recognition
became a usable technology.
Pattern recognition ideas are
applied to speech
recognition.
Ideas of LPC are employed in
speech recognition.
1980s Introduction of HMM
I. THE SPEECH SIGNAL
OUTLINE:
SPEECH
PRODUCTION
3-STATE
REPRESENTATION
SPEECH
REPRESENTATION
SPECTRAL
REPRESENTATION
THE SPEECH
SIGNAL
PRE-PROCESSING
WINDOWING
SPEECH TO
FEATURE
VECTORS
FEATURE
EXTRACTION
POST
PROCESSING
SPEECH PRODUCTION
What does each block represent

.. ?
Voiced Components
Impulse train Generator
Glottal pulse model
Vocal tract model
Radiation model
Random noise
sounds
generator
Lungs
Epiglottis
Vocal Tract
Lips
Unvoiced
SPEECH REPRESENTATION
Short-time stationary / quasi stationary
Types :
Time-domain representation
Frequency-domain representation
Time-domain representation :
Frequency-domain
representation:
OBTAINING FEATURE
VECTORS
Why do we need feature vectors ?
Preprocessing
Frame
Blocking and
Windowing
Feature
Extraction
Post
processing
Pre-processing :
Purpose : To modify raw speech signal so that
It is more suitable for feature extraction
Noise
cancellation
Pre-emphasis
Voice
Activation
Detection
(VAD)
Noise Cancelling and Preemphasis
Methods for noise cancellation

Spectral subtraction
Adaptive noise cancellation
Pre-emphasis
To emphasize high frequency
components
.because often high frequency
components have low SNR
H(z) = 1- 0.5z-1 ; S1(z) = H(z)S(z)
Voice Activation Detection (VAD)

The signal is chopped-off !!!!
Finds the end-points of the utterances.

Why.?
This is for a single chunk.

Ws1 (m) = Ps1 (m)(1 Zs1 (m))Sc
Ps1 = short term power estimate

Zs1 = zero-crossing rate
Sc = scaling factor
The threshold tw is decided by some function of the

mean and variance of Ws1 itself.
Windowing
Window function such as Hamming

Window is applied to reduce the
discontinuity at the edges of blocks
Hamming Window
w(k) = 0.54 0.46 cos ( 2k / K1 )
K = no. of samples in a speech signal
Feature Extraction:
Feature Extraction
LPC
MFCC
Linear Predictive coding

(LPC)
Encodes at low bit-rate
Assumption : speech sample at
current time can be approximated
from past samples.
Glottal, vocal-tract, lip-radiation
transfer functions are integrated into
all-pole LPC filter.
Feature vectors are ak.
Mel Frequency Cepstral

Coefficients (MFCC)
A non-linear frequency scale is used

Linear until 1KHz
Logarithmic afterwards
Similar to human Cochlea
Xt[n] is the DFT of the tth input speech frame,

Hm[n] is the frequency response of mth filter in
the filter bank, N is the window size of the
transform and M is the total number of filters
DCT of log energies are computed, Since the human auditory

system is sensitive to time evolution of the spectral content of
the signal, an effort is often made to include the extraction of
this information as part of feature analysis. The final feature
vectors are shown below:
Advantages
MFCC reduces information in speech to
small no. of coefficients
MFCC tries to model loudness
MFCC resembles human auditory model,
and it is easy to compute
But for better accuracy in speech

recognition both models are used
simultaneously.
Post Processing
To give more weightage
to certain features
Weight
function
Normalization
To re-scale the numerical values

of the features. To stay in the
same numerical range
HIDDEN MARKOV MODEL
MARKOV CHAINS:
Markov Process ?
First Order Markov Process. ?
Markov Chain: Markov Process with

finite states
HIDDEN MARKOV MODEL
HMM : If one cannot observe states
If states are visible then it is termed as

Observable Markov model
In a hidden Markov model, the state is

not directly visible, but output,
dependent on the state, is visible
HMM example
Imagine that you are a climatologist in

the year 2999 studying the history of
global warming. You cannot find any
records of the weather for the summer
of 2007, but you do find Jasons diary,
which lists how many ice-creams
Jason ate every day that summer. Our
goal is to use these observations to
estimate the temperature every day.
Assume there are only two kinds of
days: cold (C) and hot (H).
In the previous example,

Hot (H) and cold (C) are the hidden states
No. of ice-creams ate by Jason are the

observations
Notation :
T = length of the observation sequence

N = number of states in the model
M = number of distinct observation symbols i.e., the number of
symbols observed.
Q = {q0 ,q1 ,...,qN1 } = distinct states of the Markov process
V = {0,1,...,M 1} = discrete set of possible observations
A = {ai,j}where ai,j = P(it+1 | it = i), the probability of being in state
j at time t+1 given that we were is state i at time t. We assume
that ai,j are independent of time. These are also referred as
state transition probabilities
B = { bj(k)}, bj(k) = P(vk at t | it = j), the probability of observing
symbol vk given that we were in state i . Also termed as
observation probability matrix
= initial state distribution. = {i} , i = P(i1 = i), the
probability of being in state i at the beginning of the experiment
i.e., at t=1.
O= (O0 ,O 1 ,...,O T1 ) = observation sequence. Ot will denote
the observation symbol observed at time t.
= (A, B, ) will be used as a compact notation to denote
HMM.
The three problems for HMMs
Problem -1
Problem 1: Given the observation

sequence O = O1, O2,.. OT, and a
model = (A, B, ), how do we
compute P(O| ), the probability of the
observation sequence, given the
model ?
Problem - 1
Evaluation Problem
It tells us how well a given model

matches the observation sequence.
Application in speech recognition. ?
Problem -11
Given the observation sequence O =

O1, O2,.. OT, and a model = (A, B,
), how do we choose a
corresponding state sequence Q = q1
q2 .. qT which is optimal in some
meaningful sense. (i.e., best explains
the observation sequence)?
Problem -11
We attempt to uncover the hidden

sequences.
We can never uncover the exact

hidden state sequence.
Application in speech recognition. ?

What if a phoneme is lost in a word .
?
Problem -111
How do we adjust model parameters

= (A, B, ) to maximize P(O| ) ?
This is associated with training of

HMM
Solution to Problem - 1
Imagine that you are a climatologist in

the year 2999 studying the history of
global warming. You cannot find any
records of the weather for the summer
of 2007, but you do find Jasons diary,
which lists how many ice-creams
Jason ate every day that summer. Our
goal is to use these observations to
estimate the temperature every day.
Assume there are only two kinds of
days: cold (C) and hot (H).
.8
.2
Given the HMM, what is the probability of the sequence {3, 1
We want to compute P(O|) or P(O)
This task is not straight-forward,
because we dont know the states that

produced this observation sequence
For the state sequence Q = {H,H,C}, Given

O = {3,1,3}
Compute joint prob. P(O,Q) . ?
We have shown for one particular

case, but there are 8 different state
sequences, such as {C,C,C}, {C,C,H}
etc
We would sum over all the 8 possible
state sequences i.e.,
This is a greedy algorithm
For N hidden states and T

observations there are NT comb. of
state seq.
So we move on to a recursive
algorithm called Forward Algorithm
Solution to Problem 11
Given a HMM, we are trying to find the
most-likely state sequence for a
particular observation sequence.
Employing greedy algorithm, we
want to find the seq. of hidden states
that maximizes
Pr(observed seq. , hidden state comb. | )
Problem: Computationally expensive

!!!
Solution: Viterbi Decoding
Logic: It is an inductive algorithm in

which at each instant you keep the
best possible state sequence for each
of the N states as the intermediate
state for the desired observation
sequence O = o1 ,o2 ,...,oT
Our goal is to maximize P(O,Q|)

P(O,Q| ) = P(O|Q, ). P(Q| )
=
1.bq1(o1).aq1q2.bq2(o2)aqT1qT.bqT(oT)
Now define,
It can be seen that,

P(O,Q| ) = exp (-U(q0 ,q1 ,q2 ,...,qT))
Initially our goal was to maximize

P(O,Q|)
Now, we want to minimize U(Q)
U(Q) is an attempt to re-scale the
probability values.
-ln( aqjqk bqk(Ot) ) can be viewed as

Cost function.
Solution to Problem - 111

Deals with training HMM
Encodes HMM parameters to fit the
observation
2 methods to solve this.. !
Segmental K-means Algorithm

Baum-Welch Re-estimation formula
Segmental K-means algorithm :

Tries to adjust model parameters to maximize
the prob. of P(O,Q|), where Q is the
Optimum seq. found by problem-2
Baum-Welch Re-estimation formulae :

Tries to adjust model parameters to maximize
the prob. of P(O,Q|).
Finds more general solution.
So which is preferred. ?
Segmental K-means
algorithm
Let,
= no. of observation seq.

T = length of each observation seq.
D = dimension of each observation
symbol
Dimensions 1,2,3. . .. D
For a single
observation seq.
i.e., for = 1
Length 1,2,3, .
T
Choose N symbols (dimension D), and

assign the remaining T symbols to
each of the N chosen ones according
to Euclidean dist.
Calculate initial and transition prob.
Calculate observation symbol prob.

Using these formulae
Assumption : symbol prob. Distribution
are assumed to be Gaussian
Find the optimal state sequence Q* as

given by the solution to Problem 2 for
each training sequence using
computed above. A vector is
reassigned a state if its original
assignment is different from the
corresponding estimated optimum
state.
This process is contd. unless there is

no new assignment operation.
Isolated word recognizer :
Assume we have a vocabulary of V

words, also we have K utterances of
each word.
Training a HMM:
For each word v in the vocabulary, we
must build an HMM v , i.e., we must
estimate the model parameters (A,B,)
that optimize the likelihood of the training
set observation vectors of the vth word.
Testing :
For each unknown word which is to be
recognized, first we should measure the
observation sequence O = O1,O2 . OT,
via feature analysis of the speech
corresponding to the word, followed by
calculation of model likelihoods for all
possible models, P(O| v), followed by
selection of the word whose model
likelihood is highest
A simple yes,
no example .
Continuous speech
Recognition
We connect the HMMs in a sequence.
Instead of taking the one with

maximum probability, we try to
minimizes the expectancy of a given
loss function.
Reason: Well we are predicting

multiple words here .
THANK YOU

Speech Recogntion Using Hidden Markov Models

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speech Recogntion Using Hidden Markov Models

Загружено:

Авторское право:

Доступные форматы

SPEECH RECOGNTION

USING HIDDEN MARKOV

1952 Isolated digit recognition for a

I. THE SPEECH SIGNAL

What does each block represent

Short-time stationary / quasi stationary

Noise Cancelling and Preemphasis

Methods for noise cancellation

Voice Activation Detection (VAD)

Finds the end-points of the utterances.

This is for a single chunk.

Ps1 = short term power estimate

The threshold tw is decided by some function of the

Window function such as Hamming

Linear Predictive coding

Mel Frequency Cepstral

A non-linear frequency scale is used

Similar to human Cochlea

Xt[n] is the DFT of the tth input speech frame,

DCT of log energies are computed, Since the human auditory

But for better accuracy in speech

To re-scale the numerical values

HIDDEN MARKOV MODEL

Markov Chain: Markov Process with

HIDDEN MARKOV MODEL

HMM : If one cannot observe states

If states are visible then it is termed as

In a hidden Markov model, the state is

Imagine that you are a climatologist in

In the previous example,

No. of ice-creams ate by Jason are the

T = length of the observation sequence

The three problems for HMMs

Problem 1: Given the observation

It tells us how well a given model

Application in speech recognition. ?

Given the observation sequence O =

We attempt to uncover the hidden

We can never uncover the exact

Application in speech recognition. ?

How do we adjust model parameters

This is associated with training of

Imagine that you are a climatologist in

Given the HMM, what is the probability of the sequence {3, 1

We want to compute P(O|) or P(O)

This task is not straight-forward,

because we dont know the states that

For the state sequence Q = {H,H,C}, Given

We have shown for one particular

This is a greedy algorithm

For N hidden states and T

Pr(observed seq. , hidden state comb. | )

Problem: Computationally expensive

Logic: It is an inductive algorithm in

Our goal is to maximize P(O,Q|)

It can be seen that,

Initially our goal was to maximize

-ln( aqjqk bqk(Ot) ) can be viewed as

Solution to Problem - 111

Segmental K-means Algorithm

Segmental K-means algorithm :

Baum-Welch Re-estimation formulae :

= no. of observation seq.

Choose N symbols (dimension D), and