Академический Документы
Профессиональный Документы
Культура Документы
2. VOICE PRODUCTION
1. INTRODUCTION We speak using pulmonic aggressive airstream (i.e. when
we breathe out). This airstreams sets the vocal folds in
Voice recognition is the ability of machines to respond to motion, producing voicing. Moving the active
spoken commands. Among the earliest applications for articulators (tongue, lips, mandible, uvula, and posterior
speech recognition were automated telephone systems pharyngeal wall) against the non-movable structures or
and medical dictation software (Transcription). Voice passive articulators (teeth, palate) changes the shape of
recognition is being used today by thousands of people the supralaryngeal vocal tract and modifies the sound
everyday. Such systems as calling cards and phone being produced. The result in sounds that are very similar
banking services use speech recognition by prompting to other sounds found in nature but which humans
the user to answer questions in voice rather than pressing perceive as speech because other humans articulate them.
digits. Voice Recognition Information Systems have
become so advanced and mainstream that business and
health care professionals are turning to Voice recognition
solutions for everything from providing telephone
support to writing medical reports. Technological
advances have made voice recognition software and
devices more functional and user friendly, with most
contemporary products performing tasks with over 90
percent accuracy.
Voice Recognition systems can be classified into two
categories speaker-dependent and speaker-independent.
Speaker-dependent systems works by comparing a whole
word input with a user-supplied pattern. These patterns
are developed by user during the training exercise.
Accuracy rates are typically less than 90%. Fig.1 Graphical Model of Human Speech Production
Speaker–independent systems require no training through Vocal Tract
sessions.
2
3. ANNOTATION the signal. After that the signal is run through a pre-
emphasis network. The pre-emphasizer can be either
Some fundamental ideas about the articulatory fixed or slowly adaptive. For Pre-emphasis a fixed first
production of numbers from 0 to 9, call & end & also order system is recommended. The most widely used
major classification into which their sounds are divided pre-emphasis network is
One:
Rounded half close back vowel: There are three possible
resonators involved in the articulation of a vowel: the
Where,
oral cavity, the labial cavity, and the nasal cavity. If the
lips are pushed forward and rounded, a third, labial
resonator is formed. Rounded vowels means labial
resonator active and back vowels means tongue body in
the post-palatal or velar region Let x (k) be the digitized filtered input signal. The output
Two: is then related to the input signal by the following
Voiceless dental or alveolar stop: The tongue makes equation:
contact with the alveolar ridge directly above front teeth.
Two comes in the class of the aspirated word which
means if feel a breath of air as you say the word. .
Three:
Voiceless dental or alveolar stop: The tongue makes The value for the variable ‘a’ is usually chosen to be
contact with the front teeth. Three is also an aspirant. around 0.95. This means that about 95% of any one
Four & Five: sample is presumed to originate from the previous
Voiceless, labiodentals, fricative. The lower lip is sample.
brought close to the upper teeth, occasionally even
grazing the teeth with its outer surface, or with its inner
surface, imparting in this case a slight hushing sound
4.2 Normalization
The name MATLAB stands for matrix laboratory. Embedded Target for TI C6000 DSP use Simulink to
MATLAB is a high-performance language for technical model digital signal processing algorithms from blocks in
computing. It integrates computation, visualization, and the DSP Blockset, and then use Real-Time Workshop to
programming in an easy-to-use environment where generate C code targeted to Texas Instruments Code
problems and solutions are expressed in familiar Composer Studio Integrated Development Environment
mathematical notation. MATLAB has extensive facilities (CCS IDE). The Embedded Target for TI C6000 DSP
for displaying vectors and matrices as graphs, as well as takes the generated C code and uses Texas Instruments
annotating and printing these graphs. It includes high- tools to build specific machine code depending on the TI
level functions for two-dimensional and three- board you use. The build process downloads the targeted
dimensional data visualization, image processing, machine code to the selected hardware and runs the
animation, and presentation graphics. executable on the digital signal processor. After
downloading the code to the board, the digital signal
5.1.2 Simulink. processing application runs automatically on the target.
When using this target, the build process creates a new
Simulink is a software package that enables you to project in Code Composer Studio and populates the
model, simulate, and analyze systems whose outputs project with the required files.
change over time. Such systems are often referred to as The following are blocks of this toolbox, ADC, DAC,
dynamic systems. Simulink can be used to explore the LED, Switch.
behavior of a wide range of real-world dynamic systems.
The DSP Blockset brings the full power of Simulink to
DSP system design and prototyping by providing key 5.2 Hardware Approach
DSP algorithms and components in the adaptable block
format of Simulink. Digital Signal Processors such as the TMS320C6x family
of processors are like fast special purpose
microprocessors with a specialized type of architecture &
5.1.3 Data Acquisition Toolbox instruction set appropriate for signal processing. The
architecture of C6x digital signal processor is very well
The Data Acquisition Toolbox is a collection of M-file suited for numerical intensive calculations, based on
functions and MEX-file dynamic link libraries (DLLs) VLIW architecture.
built on the MATLAB® technical computing
environment. The toolbox provides a framework for 5.2.1 C6x DSP’s Architecture.
bringing live, measured data into MATLAB using PC-
compatible, plug-in data acquisition hardware. It The C67x processor consists of three main parts: CPU,
provides support for analog input (AI), analog output peripherals, and memory. The CPU consists of eight
(AO), and digital I/O (DIO) subsystems including functional units, which operate in parallel. It is divided
simultaneous analog I/O conversions. It also supports into two sides A & B. Each side has a so called
popular hardware vendors/device such as the Windows
5
M unit (Used for multiplication operation) coefficients are obtained. Finally, using the Euclidean
L unit (used for logical & arithmetic operations) distance formula the difference between the words is
S unit (used for branch, bit manipulation, & arithmetic calculated. The single word of testing phase is compared
operations) with the five words which were stored during the training
D unit (used for loading, storing & arithmetic operations) phase and five Euclidean distances are obtained in a
similar manner. The Euclidean distances are then
1 8V 16M 128K Daughter compared amongst themselves to determine the smallest
Card I/F value, which is then displayed as the recognized word.
Par
TMS320 C
Po
D.
Po C
Us
Re Three
3 3V 16-bit codec
Emula Line Level Input
Line Level
Fig. 5 DSK TI c6711
7 RESULTS:
8. TESTING:
For the tests we used a training set consisting of 15
occurrences of the digit “1” by 3 speakers (i.e., 5
occurrences per speaker). All the speakers were female.
A range was set and it was found that the error rate was
less than 30% (more than 70% correct classifications).
Fig. 10 waveform of the digit “one” after overlap Table 8.1 and 8.2 displays the error and efficiency for
analysis every speaker.
7
TOTAL EFFICIENCY
SPEAKER 1 70 %
SPEAKER 2 70%
SPEAKER 3 65%
SYSTEM 68.33%
Table 8.2
9. CONCLUSION
We have discussed the implementation of speech
recognition algorithms using Simulink rather than using
C codes. Working with C is tedious and time consuming
and have used Cepstrum Analysis and Dynamic time
Warping which give 68% accurate results.
References.
[1] Ben Gold, Nelson Morgan, Speech and Audio Signal
processing processing