Вы находитесь на странице: 1из 7

1

DSP Implementation of Voice Recognition Using Dynamic Time Warping


Algorithm
Fariha Muzaffar, Bushra Mohsin, Farah Naz
Lecturer Farooq Jawed
Department Of Electronics Engineering, NED University of Engineering & Technology, Karachi

Abstract Voice recognition system performs two fundamental


Voice recognition is a process of an automatic system to operations: signal modeling and pattern matching. Signal
perceive speech. The paper discusses voice recognition modeling represents process of converting speech signal
using Cepstral Analysis and DTW of a set of five words. into a set of parameters. Pattern matching is the task of
The software model was designed using DSP block finding parameter set from memory which closely
library in Simulink. The developed model provide us matches the parameter set obtained from the input speech
with the necessary tools to record, filter, and analyze signal. We have designed a very basic Speaker-
different voice samples and compare them with the dependent Voice recognition system that identifies
archived sample. isolated spoken words using a limited vocabulary of five
The paper concludes with discussion about the words. In this report we briefly discuss the signal
implementation of the speech recognition algorithm on a modeling approach for speech recognition. It is followed
DSP processor. The algorithm was implemented on the by overview of basic operations involved in signal
target DSP via Embedded target for TI DSP toolbox and modeling. Further commonly used temporal and spectral
real time Workshop (RTW). analysis techniques of feature extraction are discussed in
detail.
Keywords: DTW, Cepstrum analysis

2. VOICE PRODUCTION
1. INTRODUCTION We speak using pulmonic aggressive airstream (i.e. when
we breathe out). This airstreams sets the vocal folds in
Voice recognition is the ability of machines to respond to motion, producing voicing. Moving the active
spoken commands. Among the earliest applications for articulators (tongue, lips, mandible, uvula, and posterior
speech recognition were automated telephone systems pharyngeal wall) against the non-movable structures or
and medical dictation software (Transcription). Voice passive articulators (teeth, palate) changes the shape of
recognition is being used today by thousands of people the supralaryngeal vocal tract and modifies the sound
everyday. Such systems as calling cards and phone being produced. The result in sounds that are very similar
banking services use speech recognition by prompting to other sounds found in nature but which humans
the user to answer questions in voice rather than pressing perceive as speech because other humans articulate them.
digits. Voice Recognition Information Systems have
become so advanced and mainstream that business and
health care professionals are turning to Voice recognition
solutions for everything from providing telephone
support to writing medical reports. Technological
advances have made voice recognition software and
devices more functional and user friendly, with most
contemporary products performing tasks with over 90
percent accuracy.
Voice Recognition systems can be classified into two
categories speaker-dependent and speaker-independent.
Speaker-dependent systems works by comparing a whole
word input with a user-supplied pattern. These patterns
are developed by user during the training exercise.
Accuracy rates are typically less than 90%. Fig.1 Graphical Model of Human Speech Production
Speaker–independent systems require no training through Vocal Tract
sessions.
2

3. ANNOTATION the signal. After that the signal is run through a pre-
emphasis network. The pre-emphasizer can be either
Some fundamental ideas about the articulatory fixed or slowly adaptive. For Pre-emphasis a fixed first
production of numbers from 0 to 9, call & end & also order system is recommended. The most widely used
major classification into which their sounds are divided pre-emphasis network is

One:
Rounded half close back vowel: There are three possible
resonators involved in the articulation of a vowel: the
Where,
oral cavity, the labial cavity, and the nasal cavity. If the
lips are pushed forward and rounded, a third, labial
resonator is formed. Rounded vowels means labial
resonator active and back vowels means tongue body in
the post-palatal or velar region Let x (k) be the digitized filtered input signal. The output
Two: is then related to the input signal by the following
Voiceless dental or alveolar stop: The tongue makes equation:
contact with the alveolar ridge directly above front teeth.
Two comes in the class of the aspirated word which
means if feel a breath of air as you say the word. .
Three:
Voiceless dental or alveolar stop: The tongue makes The value for the variable ‘a’ is usually chosen to be
contact with the front teeth. Three is also an aspirant. around 0.95. This means that about 95% of any one
Four & Five: sample is presumed to originate from the previous
Voiceless, labiodentals, fricative. The lower lip is sample.
brought close to the upper teeth, occasionally even
grazing the teeth with its outer surface, or with its inner
surface, imparting in this case a slight hushing sound
4.2 Normalization

After pre-emphasis, each word has it's energy


normalized. Based on the energy distribution along the
4. VOICE ANALYSIS temporal axis, it is computed the center of gravity, and
this information is used as reference for temporal
A speech analysis is done after taking an input through alignment of the words. In practice normalization is done
microphone from a user. The design of the system so power in signal after windowing is approximately
involves manipulation of the input audio signal. At equal to the power of signal before windowing.
different levels, different operations are performed on the
input signal such as Pre-emphasis, Normalizing,
Windowing, Cepstrum analysis and Recognition of the 4.3 Windowing
spoken word.
When we analyze signals we tend to do so on only small
portions at a time. The reason for this is that we typically
assume that the signal is constant over the time-span of
our analysis. Since speech is actually changing rapidly,
we have to cut it into small parts for this assumption to
hold. Hence we windowed our speech signal before
analysis. Windowing can be seen as multiplying a signal
by a window which is zero everywhere except for the
Fig. 2 Block diagram of speech recognitions system region of interest, where it is one. Since we pretend that
our signals are infinite, we can discard all of the resulting
zeros and concentrate on just the windowed portion of
4.1 Pre-Emphasis the signal. A rectangular window is called so because of
its shape. One problem with this kind of window is the
The speech signal obtained from the microphone is abrupt changes at the edge, which can cause distortion in
digitized and is put through a filter, to spectrally flatten the signal being. To reduce this distortion we used a
3

smoother window shape called a Hamming window. This


window is zero at the edges and rises gradually to be 1 in
the middle. When this window is used the edges of the
signal are de-emphasized and the edge effects are
reduced.
Fig.4. Block diagram of Cepstrum Analyzer
Each individual frame is windowed to minimize the
signal discontinuities at the borders of each frame. If the
window is defined as w[n], 0 < n < N-1, then the
windowed signal is 4.5 Recognition block

Several techniques can be used for the recognition of


speech. The recognition technique we have used is that
of Dynamic Time Warping.
Where, 0 < n < N-1.
4.5.1 Dynamic Time warping
4.4 Cepstral Analysis
This technique allows for the incoming word of speech
and the template to be of different sizes. The size
The source in voiced speech is the vibration of the vocal difference is due to variations in the length of words in
folds in response to airflow from the lungs. In unvoiced normal spoken speech. These variations cause the frames
speech the sound source is not a regular vibration but to be out of alignment. Every word or phrase is stored in
rather vibrations are caused by turbulent airflow due to a its own separate template. Speech input must be stored in
constriction in the vocal tract. The filter in speech these templates before recognition is to take place. DTW
production is the vocal tract As with any other filter, this operates by selecting which frames of the reference
tube has a characteristic spectrum. In a real speech template best match each frame of the input such that the
spectrum the overall filter shape and the location of the resulting error between them is minimized. Therefore the
formants is often drowned out by the effects of the source speaker must pre-program the recognizer with his or her
spectrum. Therefore, we removed the source effects by own speech. By allowing multiple frames of one to be
taking the Cepstrum of the signal and then studied the matched against a single repeated frame of the other,
two spectra separately to achieve a more accurate picture DTW can compress or expand relative time.
of the precursors of the speech sound.
The difference between a new & a stored sample is called
Cepstral Analysis technique is very useful as it provides a Euclidean distance & is given by:
methodology for separating the excitation from the vocal
tract shape. In the linear acoustic model of speech
production, the composite speech spectrum, consist of Min [ di = (x – Zi)T(x – Zi) ]
excitation signal filtered by a time-varying linear filter
representing the vocal tract shape as shown in figure Dynamic programming is an approach to implicit storage
of all possible solutions to a problem requiring
minimization of global error. Applied to template
matching for speech recognition, imagine a matrix ‘D’ in
which the rows correspond to frames of reference
template & the column corresponds to an input template.
Fig.3. Vocal tract filter
For each matrix element in ‘D’ a cumulative distortion
measure is given by:
Hence in log domain the excitation and the vocal tract
shape are superimposed, and can be separated. Cepstrum
is computed by taking inverse discrete Fourier transform D(i,j) = d(i,j) + min{D[p(i.j) + T[(i,j), p(I,j)]}
(IDFT) of logarithm of magnitude of discrete Fourier
transform finite length input signal. In log domain the Where,
excitation and the vocal tract shape are superimposed,
and can be separated. ‘d’ is the local distance measure between frame ‘i’ of
input & frame ‘j’ of reference stored template
4

P(i,j) is the coordinates of the possible previous points on sound cards.


the matching trajectory between the two templates
5.1.3a Softscope
T( ) is a term for cost associated with any particular Softscope is the the Data Acquisition oscilloscope. It is
transition an interactive graphical user interface (GUI) for
streaming data into a display. When the Softscope is
opened hardware, math, and reference channels are
5. SOFTWARE & HARDWARE displayed. With the Scale Channel option data can be
PLATFORM scaled horizontally and vertically according to the
requirement. The Triggering option helps to control the
5.1 Software approach: initialization of data acquisition. Using the Export Data
option we can save channel data or measurements to the
The following tools were used for the design & workspace, a figure, or a MAT-file.
implementation of speech recognition algorithm

5.1.1 MATLAB. 5.1.4 Embedded target for TIC6000 DSP

The name MATLAB stands for matrix laboratory. Embedded Target for TI C6000 DSP use Simulink to
MATLAB is a high-performance language for technical model digital signal processing algorithms from blocks in
computing. It integrates computation, visualization, and the DSP Blockset, and then use Real-Time Workshop to
programming in an easy-to-use environment where generate C code targeted to Texas Instruments Code
problems and solutions are expressed in familiar Composer Studio Integrated Development Environment
mathematical notation. MATLAB has extensive facilities (CCS IDE). The Embedded Target for TI C6000 DSP
for displaying vectors and matrices as graphs, as well as takes the generated C code and uses Texas Instruments
annotating and printing these graphs. It includes high- tools to build specific machine code depending on the TI
level functions for two-dimensional and three- board you use. The build process downloads the targeted
dimensional data visualization, image processing, machine code to the selected hardware and runs the
animation, and presentation graphics. executable on the digital signal processor. After
downloading the code to the board, the digital signal
5.1.2 Simulink. processing application runs automatically on the target.
When using this target, the build process creates a new
Simulink is a software package that enables you to project in Code Composer Studio and populates the
model, simulate, and analyze systems whose outputs project with the required files.
change over time. Such systems are often referred to as The following are blocks of this toolbox, ADC, DAC,
dynamic systems. Simulink can be used to explore the LED, Switch.
behavior of a wide range of real-world dynamic systems.
The DSP Blockset brings the full power of Simulink to
DSP system design and prototyping by providing key 5.2 Hardware Approach
DSP algorithms and components in the adaptable block
format of Simulink. Digital Signal Processors such as the TMS320C6x family
of processors are like fast special purpose
microprocessors with a specialized type of architecture &
5.1.3 Data Acquisition Toolbox instruction set appropriate for signal processing. The
architecture of C6x digital signal processor is very well
The Data Acquisition Toolbox is a collection of M-file suited for numerical intensive calculations, based on
functions and MEX-file dynamic link libraries (DLLs) VLIW architecture.
built on the MATLAB® technical computing
environment. The toolbox provides a framework for 5.2.1 C6x DSP’s Architecture.
bringing live, measured data into MATLAB using PC-
compatible, plug-in data acquisition hardware. It The C67x processor consists of three main parts: CPU,
provides support for analog input (AI), analog output peripherals, and memory. The CPU consists of eight
(AO), and digital I/O (DIO) subsystems including functional units, which operate in parallel. It is divided
simultaneous analog I/O conversions. It also supports into two sides A & B. Each side has a so called
popular hardware vendors/device such as the Windows
5

M unit (Used for multiplication operation) coefficients are obtained. Finally, using the Euclidean
L unit (used for logical & arithmetic operations) distance formula the difference between the words is
S unit (used for branch, bit manipulation, & arithmetic calculated. The single word of testing phase is compared
operations) with the five words which were stored during the training
D unit (used for loading, storing & arithmetic operations) phase and five Euclidean distances are obtained in a
similar manner. The Euclidean distances are then
1 8V 16M 128K Daughter compared amongst themselves to determine the smallest
Card I/F value, which is then displayed as the recognized word.

Par

TMS320 C
Po
D.
Po C
Us

Re Three
3 3V 16-bit codec
Emula Line Level Input
Line Level
Fig. 5 DSK TI c6711

5.2.2 HILS (Hardware-in-the-loop-simulation)

The purpose of the HILS is to test an embedded


system. The proof that the embedded system passed Fig 6 Simulink simulation model of voice recognizer
its test is that its outputs were correct for the inputs
that it was given. HILS can be applied very
effectively in other areas too. For instance, machine 6.2 Embedded Model
control and motion control are two areas where it's
hard to completely test the software before the
expensive, fragile, and often unique machine is built.
HILS proves to be quite useful in testing the
software in such cases. Some limitations of HILS are
that it cannot easily stop-if you pause the hardware-
in-the-loop simulator, all the components that it's
attached to, including the embedded program in your
system-under-test, keep running. Furthermore, A
HILS cannot tell what's going on inside the Fig. 7 Embedded model of voice recognizer
embedded software. It can only read the embedded
system's outputs The above figure shows an embedded model. This voice
recognition model is embedded on the Target DSP
c6711. Results from hardware are obtained using
hardware-in-the-loop-simulation (HILS). The DSP with
6. SIMULINK MODELS
the program embedded on it is plugged into the
backplane of the PC. The PC runs the user interface and
6.1 Simulation Model data logging code and the DSP runs your simulation and
the analog I/O. The input of the DSP is connected to the
The figure shows the simulation model of voice “line out” of the DSP and the output of the DSP is
recognizer. Input words, for training and testing, are connected to the “line in” of the DSP. During the training
taken through the softscope present in Data Acquisition phase wav files of the designed vocabulary are played on
toolbox. These words are saved in Matlab workspace as the PC. The DSP takes it as the input and calculates its
arrays. During simulation these words from the Cepstral Co-efficient. These co-efficient are fed into the
workspace are taken as input. Different operations are line in of the PC and are acquired with the help of
performed on them at different stages and their cepstral
6

softscope present in the Data Acquisition toolbox of


Matlab. Similarly, another wav file is played during the
testing phase and its Cepstral co-efficient are obtained in
the same manner. These co-efficient are saved in the
Matlab workspace. In order to recognize the word spoken
during the testing phase, the formula of Euclidean
distance is applied on these co-efficient and they are fed
into the decision making block

7 RESULTS:

Fig.11 waveform of the digit “one” after Windowing

Fig. 8 waveform of the digit “one”

Fig.12 waveform of the digit “one” after Cepstrum


Fig. 9 waveform of the digit “one” after pre-emphasis

Fig.12 waveform after calculating Euclidean distance

8. TESTING:
For the tests we used a training set consisting of 15
occurrences of the digit “1” by 3 speakers (i.e., 5
occurrences per speaker). All the speakers were female.
A range was set and it was found that the error rate was
less than 30% (more than 70% correct classifications).
Fig. 10 waveform of the digit “one” after overlap Table 8.1 and 8.2 displays the error and efficiency for
analysis every speaker.
7

DIGITS SPEAKER 1 SPEAKER SPEAKER


2 3
ONE 50 % 75% 75%
TWO 75 % 75% 75%
THREE 75 % 75% 75%
FOUR 75 % 75% 50%
FIVE 75 % 50% 50%
Table 8.1

TOTAL EFFICIENCY
SPEAKER 1 70 %
SPEAKER 2 70%
SPEAKER 3 65%
SYSTEM 68.33%
Table 8.2

9. CONCLUSION
We have discussed the implementation of speech
recognition algorithms using Simulink rather than using
C codes. Working with C is tedious and time consuming
and have used Cepstrum Analysis and Dynamic time
Warping which give 68% accurate results.

References.
[1] Ben Gold, Nelson Morgan, Speech and Audio Signal
processing processing

[2] Ralph Chassaing, DSP Applications using C and the


TMS320C6x DSK 2002

[3] Texas Instruments, Implementation of an HMM-


Based, Speaker-Independent Speech Recognition System
on the TMS320C2x and TMS320C5x, 1996

[4] Texas Instruments, Automated Dialing of Cellular


Telephones Using Speech Recognition, 1994

[5] Texas Instruments, TMS320C6000 Technical Brief


SPRU197D February 2000

[6] Texas Instruments,TMS320C6201/6701 Evaluation


module user’s guide 1998

[7] Mitra, Digital Signal Processing using MATLAB

Вам также может понравиться