Вы находитесь на странице: 1из 61

SPEECH RECOGNTION

USING HIDDEN MARKOV


MODELS

OUTLINE

I
THE SPEECH
SIGNAL

II
THE HIDDEN
MARKOV
MODEL

III
SPEECH
RECOGNITION
USING HMM

INTRODUCTION
APPLICATIONS :
1. HANDS-FREE COMPUTING
II. AUTOMATIC TRANSLATION

EARLY HISTORY

1952 Isolated digit recognition for a


single speaker.
1959 Vowel Recognition Program
1970s Isolated word recognition
became a usable technology.
Pattern recognition ideas are
applied to speech
recognition.
Ideas of LPC are employed in
speech recognition.
1980s Introduction of HMM

I. THE SPEECH SIGNAL

OUTLINE:
SPEECH
PRODUCTION
3-STATE
REPRESENTATION
SPEECH
REPRESENTATION
SPECTRAL
REPRESENTATION
THE SPEECH
SIGNAL
PRE-PROCESSING

WINDOWING
SPEECH TO
FEATURE
VECTORS
FEATURE
EXTRACTION

POST
PROCESSING

SPEECH PRODUCTION

What does each block represent


.. ?
Voiced Components
Impulse train Generator
Glottal pulse model
Vocal tract model
Radiation model

Random noise
sounds
generator

Lungs
Epiglottis
Vocal Tract
Lips

Unvoiced

SPEECH REPRESENTATION

Short-time stationary / quasi stationary

Types :
Time-domain representation
Frequency-domain representation

Time-domain representation :

Frequency-domain
representation:

OBTAINING FEATURE
VECTORS
Why do we need feature vectors ?

Preprocessing

Frame
Blocking and
Windowing

Feature
Extraction

Post
processing

Pre-processing :
Purpose : To modify raw speech signal so that
It is more suitable for feature extraction

Noise
cancellation

Pre-emphasis

Voice
Activation
Detection
(VAD)

Noise Cancelling and Preemphasis

Methods for noise cancellation


Spectral subtraction
Adaptive noise cancellation

Pre-emphasis
To emphasize high frequency
components
.because often high frequency
components have low SNR
H(z) = 1- 0.5z-1 ; S1(z) = H(z)S(z)

Voice Activation Detection (VAD)


The signal is chopped-off !!!!

Finds the end-points of the utterances.


Why.?

This is for a single chunk.


Ws1 (m) = Ps1 (m)(1 Zs1 (m))Sc

Ps1 = short term power estimate


Zs1 = zero-crossing rate
Sc = scaling factor

The threshold tw is decided by some function of the


mean and variance of Ws1 itself.

Windowing

Window function such as Hamming


Window is applied to reduce the
discontinuity at the edges of blocks

Hamming Window
w(k) = 0.54 0.46 cos ( 2k / K1 )
K = no. of samples in a speech signal

Feature Extraction:
Feature Extraction

LPC

MFCC

Linear Predictive coding


(LPC)
Encodes at low bit-rate
Assumption : speech sample at
current time can be approximated
from past samples.
Glottal, vocal-tract, lip-radiation
transfer functions are integrated into
all-pole LPC filter.
Feature vectors are ak.

Mel Frequency Cepstral


Coefficients (MFCC)

A non-linear frequency scale is used


Linear until 1KHz
Logarithmic afterwards

Similar to human Cochlea

Xt[n] is the DFT of the tth input speech frame,


Hm[n] is the frequency response of mth filter in
the filter bank, N is the window size of the
transform and M is the total number of filters

DCT of log energies are computed, Since the human auditory


system is sensitive to time evolution of the spectral content of
the signal, an effort is often made to include the extraction of
this information as part of feature analysis. The final feature
vectors are shown below:

Advantages
MFCC reduces information in speech to
small no. of coefficients
MFCC tries to model loudness
MFCC resembles human auditory model,
and it is easy to compute

But for better accuracy in speech


recognition both models are used
simultaneously.

Post Processing
To give more weightage
to certain features

Weight
function

Normalization

To re-scale the numerical values


of the features. To stay in the
same numerical range

HIDDEN MARKOV MODEL

MARKOV CHAINS:
Markov Process ?
First Order Markov Process. ?

Markov Chain: Markov Process with


finite states

HIDDEN MARKOV MODEL

HMM : If one cannot observe states

If states are visible then it is termed as


Observable Markov model

In a hidden Markov model, the state is


not directly visible, but output,
dependent on the state, is visible

HMM example

Imagine that you are a climatologist in


the year 2999 studying the history of
global warming. You cannot find any
records of the weather for the summer
of 2007, but you do find Jasons diary,
which lists how many ice-creams
Jason ate every day that summer. Our
goal is to use these observations to
estimate the temperature every day.
Assume there are only two kinds of
days: cold (C) and hot (H).

In the previous example,


Hot (H) and cold (C) are the hidden states

No. of ice-creams ate by Jason are the


observations

Notation :

T = length of the observation sequence


N = number of states in the model
M = number of distinct observation symbols i.e., the number of
symbols observed.
Q = {q0 ,q1 ,...,qN1 } = distinct states of the Markov process
V = {0,1,...,M 1} = discrete set of possible observations
A = {ai,j}where ai,j = P(it+1 | it = i), the probability of being in state
j at time t+1 given that we were is state i at time t. We assume
that ai,j are independent of time. These are also referred as
state transition probabilities
B = { bj(k)}, bj(k) = P(vk at t | it = j), the probability of observing
symbol vk given that we were in state i . Also termed as
observation probability matrix
= initial state distribution. = {i} , i = P(i1 = i), the
probability of being in state i at the beginning of the experiment
i.e., at t=1.
O= (O0 ,O 1 ,...,O T1 ) = observation sequence. Ot will denote
the observation symbol observed at time t.
= (A, B, ) will be used as a compact notation to denote
HMM.

The three problems for HMMs

Problem -1

Problem 1: Given the observation


sequence O = O1, O2,.. OT, and a
model = (A, B, ), how do we
compute P(O| ), the probability of the
observation sequence, given the
model ?

Problem - 1

Evaluation Problem

It tells us how well a given model


matches the observation sequence.

Application in speech recognition. ?

Problem -11

Given the observation sequence O =


O1, O2,.. OT, and a model = (A, B,
), how do we choose a
corresponding state sequence Q = q1
q2 .. qT which is optimal in some
meaningful sense. (i.e., best explains
the observation sequence)?

Problem -11

We attempt to uncover the hidden


sequences.

We can never uncover the exact


hidden state sequence.

Application in speech recognition. ?


What if a phoneme is lost in a word .
?

Problem -111

How do we adjust model parameters


= (A, B, ) to maximize P(O| ) ?

This is associated with training of


HMM

Solution to Problem - 1

Imagine that you are a climatologist in


the year 2999 studying the history of
global warming. You cannot find any
records of the weather for the summer
of 2007, but you do find Jasons diary,
which lists how many ice-creams
Jason ate every day that summer. Our
goal is to use these observations to
estimate the temperature every day.
Assume there are only two kinds of
days: cold (C) and hot (H).

.8

.2

Given the HMM, what is the probability of the sequence {3, 1

We want to compute P(O|) or P(O)

This task is not straight-forward,

because we dont know the states that


produced this observation sequence

For the state sequence Q = {H,H,C}, Given


O = {3,1,3}
Compute joint prob. P(O,Q) . ?

We have shown for one particular


case, but there are 8 different state
sequences, such as {C,C,C}, {C,C,H}
etc
We would sum over all the 8 possible
state sequences i.e.,

This is a greedy algorithm

For N hidden states and T


observations there are NT comb. of
state seq.

So we move on to a recursive
algorithm called Forward Algorithm

Solution to Problem 11
Given a HMM, we are trying to find the
most-likely state sequence for a
particular observation sequence.
Employing greedy algorithm, we
want to find the seq. of hidden states
that maximizes

Pr(observed seq. , hidden state comb. | )

Problem: Computationally expensive


!!!
Solution: Viterbi Decoding

Logic: It is an inductive algorithm in


which at each instant you keep the
best possible state sequence for each
of the N states as the intermediate
state for the desired observation
sequence O = o1 ,o2 ,...,oT

Our goal is to maximize P(O,Q|)


P(O,Q| ) = P(O|Q, ). P(Q| )
=
1.bq1(o1).aq1q2.bq2(o2)aqT1qT.bqT(oT)
Now define,

It can be seen that,


P(O,Q| ) = exp (-U(q0 ,q1 ,q2 ,...,qT))

Initially our goal was to maximize


P(O,Q|)
Now, we want to minimize U(Q)
U(Q) is an attempt to re-scale the
probability values.

-ln( aqjqk bqk(Ot) ) can be viewed as


Cost function.

Solution to Problem - 111


Deals with training HMM
Encodes HMM parameters to fit the
observation
2 methods to solve this.. !

Segmental K-means Algorithm


Baum-Welch Re-estimation formula

Segmental K-means algorithm :


Tries to adjust model parameters to maximize
the prob. of P(O,Q|), where Q is the
Optimum seq. found by problem-2

Baum-Welch Re-estimation formulae :


Tries to adjust model parameters to maximize
the prob. of P(O,Q|).
Finds more general solution.

So which is preferred. ?

Segmental K-means
algorithm

Let,

= no. of observation seq.


T = length of each observation seq.
D = dimension of each observation
symbol
Dimensions 1,2,3. . .. D
For a single
observation seq.
i.e., for = 1

Length 1,2,3, .
T

Choose N symbols (dimension D), and


assign the remaining T symbols to
each of the N chosen ones according
to Euclidean dist.
Calculate initial and transition prob.

Calculate observation symbol prob.


Using these formulae
Assumption : symbol prob. Distribution
are assumed to be Gaussian

Find the optimal state sequence Q* as


given by the solution to Problem 2 for
each training sequence using
computed above. A vector is
reassigned a state if its original
assignment is different from the
corresponding estimated optimum
state.

This process is contd. unless there is


no new assignment operation.

Isolated word recognizer :

Assume we have a vocabulary of V


words, also we have K utterances of
each word.

Training a HMM:
For each word v in the vocabulary, we
must build an HMM v , i.e., we must
estimate the model parameters (A,B,)
that optimize the likelihood of the training
set observation vectors of the vth word.

Testing :
For each unknown word which is to be
recognized, first we should measure the
observation sequence O = O1,O2 . OT,
via feature analysis of the speech
corresponding to the word, followed by
calculation of model likelihoods for all
possible models, P(O| v), followed by
selection of the word whose model
likelihood is highest

A simple yes,
no example .

Continuous speech
Recognition

We connect the HMMs in a sequence.

Instead of taking the one with


maximum probability, we try to
minimizes the expectancy of a given
loss function.

Reason: Well we are predicting


multiple words here .

THANK YOU

Вам также может понравиться