Вы находитесь на странице: 1из 27

Speech Recognition

From Judith A. Markowitz. Using Speech Recognition, Prentice Hall, NJ, 1996 Guojun Lu. Multimedia Database Management Systems, Chapter 5, Artech House, 1999

Speech Recognition


Digitize and represent waveforms Feature extraction (10 ms frame)

Important Feature:
MelMel-frequency cepstral coefficients (MFCC) developed based on how human hears sound

Recognition: Identify what the user has said

Three approaches

Template matching AcousticAcoustic-phonetic recognition (e.g., FastTalk) Stochastic processing


Phoneme: the smallest unit of sound that is unique (distinguishing one word from another word for a given language) Example:


The words seat, meat, beat, cheat are different s m b c words since the initial sound is a separate phoneme in English. About 40-50 phonemes in English 40Abnormal: AE B N AO R M AX L

Terminology (Contd)

The simplest sound is pure tone that has a sign waveform. Pure tone are rare. Most sounds, including speech phonemes, are complex waves, having a dominant or primary frequency called fundamental frequency overlaid with secondary frequencies.

Fundamental frequency for speech is the rate at which the vocal cords flap against each other when producing a voiced phoneme.

Examples of Complex Waves for Phoneme



Terminology (Contd)

Formants: Bands of secondary frequencies that distinguish one phoneme from another MultiMulti-frequency sounds like the phonemes of speech can be represented as complex waves. waves. Bandwidth of a complex wave: the range of frequencies in the waveform. Sounds that produce acyclic waves are often called noise.


CoCo-articulation effects: inter-phoneme interinfluences Neighboring phonemes, the position of a phoneme within words, and the position of the word in the sentence all influence the way a phoneme is uttered. Because of co-articulation effects, a specific coutterance or instance of a phoneme is called a phone.

Template Matching


Each word or phrase is stored as a separate template. Idea: Select the template that best matches the spoken input (frame-by-frame comparison) and the (frame-bydissimilarity is within a predetermined threshold. Template matching is performed at the word level. Temporal alignment is used to ensure that fast or slow utterance of the same word is not identified as different words.

Dynamic Time Warping is used for temporal alignment.

Dynamic Time Warping

Robust Template

In early systems, one template for one example (token) To handle variability, many templates of the same word are stored. Robust template is created from more than one token of the same word using mathematical averages and statistical clustering techniques.

Template Matching


Perform well with small vocabularies of phonetically distinct words. Midsize vocabularies in the range of 1000-10000 words are 1000possible if the number of vocabulary choices at a one time is kept minimal. Must have at least one template for each word in the application vocabulary. Not good with large vocabularies containing words that have similar sounds (confusable words., e.g., to and two)


AcousticAcoustic-Phonetic Recognition

Store only representations of phonemes for a language Three steps


I. Feature extraction II. Segmentation and labeling:

Segmentation determine when one phoneme ends and another begins Labeling identify phonemes

III. Word-level recognition: WordSearch for words matching phoneme hypotheses. The word best matching a sequence of hypotheses is identified.

Output a set of phoneme hypotheses that can be represented by a phoneme lattice, a decision tree, etc.

Stochastic Processing

Use Hidden Markov Model (HMM) to store the model of each of the items that will be recognized.

Items: phonemes or subwords.

3-state HMM of a triphone obtain from training

Each state of the HMM has statistics for a segment of the word.

The statistics describe the parameter values and variation that were found in samples of the word.

A recognition system may have numerous HMMs or may combine them into one network of states and transitions.

Stochastic processing using HMM is accurate and flexible.

Subword Units

Training whole-word models are not good for large wholevocabularies. Subword units are considered. The most popular subword unit is triphone. Triphone (phoneme in context (PIC)) consists of the current phoneme and its left and right phonemes. A triphone is generally represented by a 3-state HMM. 3  

The first state represents the left phoneme The middle state represents the current phoneme The last state represents the following phoneme.

The number of triphones for English is much larger than the number of phonemes

The recognition system compares the input with stored models Two comparison approaches

BaumBaum-Welch maximum likelihood algorithm computes the probability scores between the input and the stored models and selects the best match. Viterbi Algorithm looks for the best path

Evaluation of Speech Recognition System


Vocabulary size and flexibility Required sentence and application structures The end users Type and amount of noise Stress placed upon the person using the application

Basic class of errors


Deletion: dropping of words Substitution: replace a word with another word Insertion: adding a word Rejection: cannot recognized by the program High threshold more rejection Low threshold more substitution or insertion errors.


CoCo-articulation InterInter-speaker differences IntraIntra-speaker inconsistencies Robustness of a system: how system performs under variability

CorpusCorpus- reference database for training; it includes machine-readable dictionaries, word machinelists, and published materials from specific professions. Homophones: same pronunciation, different spelling Ex. one and won Active Vocabulary: set of words the application expects to be spoken at any one time.

Grammars (models, scripts) are used to structure words to reduce perplexity, increase speed and accuracy, and enhance vocabulary flexibility.

Finite state grammars Probabilistic models LinguisticsLinguistics-based grammars

Search through the vocabularies to find the best match Branching factor: the number of items in the active vocabulary at a single point in a recognition process. Perplexity is often used to refer to the average branching factor.

High branching factor high recognition time


A total vocabulary of 1000 words Input: Take the toll road to Milwaukee With no grammar, the branching factor at each point is 1000. The perplexity is 1000. With a finite state grammar,

Take the TYPE road to PLACE TYPE=high OR toll OR back OR rocky OR long PLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR Rio

The branching factor of the first, second, forth, fifth is one; the branching factor of the third and sixth is five.

Weakness of Finite-State Grammar Finite 

Users cannot deviate from the patterns Cannot rank the probability of occurrence to improve speed and accuracy

Statistical Models

Often used in dictation systems Specify what is likely instead of what is allowed. Two forms of statistical modeling

N-gram models N-class models

N-gram Model

Identify the current word (unknown word) by assuming the identity of that word dependent upon the previous N-1 words and the acoustic information Nof the unknown word.

Example: Trigram---N=3 two words prior the unknown Trigram---N=3 word This is my printer [unknown word] The unknown word would be identified using the two prior words my printer and the acoustic information of the current word


Good for large vocabulary dictation applications.

N-class model

Extend the concept of N-gram modeling to syntactic Ncategories. Bi-class modeling calculates the possibility that two Bicategories will appear in succession. Ex of biclass

Article: a, an, the Countable noun: table, book, shoe


The probability of article countable-noun countableGood for corpus much smaller than N-gram modeling N-

LinguisticsLinguistics-Based Grammars

Aim to understand what a user has said as well as identify the spoken words. ContextContext-free grammars are often used.