Академический Документы
Профессиональный Документы
Культура Документы
2. Related Works
Research on speech recognition has been started
since 1930; however research work to recognize
bangla speech has been started since around
2000. Here we will mention the research works
found so far for Bangla speech recognition. Figure 1: Block diagram for Speech Recognition
3.1 Speech Data Extraction or
Preprocessing
In this stage, the first step is to record the speech length 25ms. For the elimination of unwanted
data by a microphone in a specified format (wav frequency level a 26-filter bank channel is used
file, 16000Hz and 16 bits). This wav data will with a pre-emphasis coefficient value 0.97.
be converting into a form that is suitable for 3.3 Pattern Recognition
further computer processing and analysis The tasks in this step are divided into two phase,
through a series of process that involves noise Training and Recognition. We used Hidden
elimination and the speech end point detection Markov Model based classifier. An elaborate
process. discussion on HMM model for speech
3.1.1 Noise Elimination recognition is briefly described at (L. Rabiner,
We used an adaptive filter to eliminate the noise 1989). In our training methodology we created
from the recorded speech signal. A sample of word based HMM model for isolated speech and
the surrounding environment is used as input for phoneme based HMM model for continuous
the adaptive filter (M. R. Hassan, 2003). The speech recognition. We used the following
extracted voice data will be simply the training algorithm for creating the HMM models.
subtraction of the predicted noise from the Step 1: Initialize λ = (π, A, B)
speech signal. Step 2: Compute probabilities using λ.
3.1.2 End Point Detection Step 3: Adjust λ= λ'
We used the generalized end point detection Step 4: Repeat 2-3 until converge
algorithm presented at (K. Roy, 2002; K. J. 3.4 Dictionary and Language model
Rahman, 2003; M. R. Islam, 2005) in order to Our approach towards the isolated speech
identify the starting and end point of the speech recognition is quite simple, we used just a
signal. The goal of this process is to detect the simple dictionary contains only the input-
presence of voice and remove pauses and output-HMMmodelName and no language
silences in the background noise. For continuous model is necessary. However, for continuous
speech we use this algorithm only for the start speech recognition we created a pronunciation
and end point detection. However for isolated dictionary that contains the input-output-
speech the intermediate noisy signal and pronouncing for each word entry where the
unwanted signal within the speech is also pronunciation describes the sequence of HMMs
eliminated. A background unvoiced sample of that constitute each word. For each word the
the current recording environment was taken. output is provided as Unicode sequence and the
The sample was split into frames of known size pronunciation is given with the consideration of
and the energy was calculated for each frame. phoneme as a unit. As in continuous speech
The frame with the maximum energy was taken recognition we recognize a sequence of words
as the threshold for the end point detection. and that’s why it is necessary to incorporate a
Then the recorded voice sample was taken and language model. We used the Regular grammar
split into frames in the same manner, each frame modeling technique as our language model
of the voice sample was compared with the which has the properties like finite state model,
maximum noise frame and at the point where small vocabulary and restricted grammar. We
the voice frame's energy had a lower value than create a word level network that will typically
the maximum noise frame, which would be the represent the grammar which defines all the
end point. legal word sequence explicitly. The regular
3.2 Feature Extraction grammar model outputs only two values of
In this stage we extract meaningful unique probability; P(W)=1 where W valid in word
features from the preprocessed speech data. network and P(W)=0 otherwise. The Task
From the comparison among features described Grammar which defines all of the legal word
at (M. F. Khan, 2002) we decided to use MFCC sequences explicitly or a Word Loop which
features. With the addition of the MFCC simply puts all words of the vocabulary in a
features we calculate energy, delta coefficient loop and therefore allows any word to follow
and acceleration coefficient. Finally the total any other word. Word-loop networks are often
number of features is 39 among those 12 MFCC, augmented by a stochastic language model (S.
1 energy, 13 first order derivative and 13 second Young, 2001-2005).
order derivative. Feature vectors are extracted
with Hamming window function of window 4. Implementation
We used HTK as the core engine for our speech completion of all these tasks, we will have
recognizer due to its availability, portability and separate model for each word in the dictionary.
sophisticated facilities for speech analysis,
HMM training, testing and results analysis (S.
Young, 2001-2005). We used our own
implemented algorithms discussed earlier for the
preprocessing task and used HTK for the rest of
the part with some specified parameters. The
Figure 2: HMM prototype for each model
preprocessing task is similar for both type of
4.1.2 Recognition
recognition which will output the valid speech The recognizer can recognize only the words
data from the recorded speech signal. Here we
defined in the dictionary. For each unknown
will briefly discuss the implementation word which is to be recognized, the
difference of the isolated and continuous speech
preprocessing and feature analysis must be
recognition from pattern recognition point of carried out, that means measurement of the
view. We followed (A K M Mahmudul Hoque
observation sequence via a feature analysis of
2006) for the implementation of isolated speech the speech corresponding to the word; followed
recognition.
by calculation of the model likelihoods for all
4.1 Isolated speech recognition possible HMM models; followed by selection
4.1.1 Prepare Training Data whose model likelihood is highest. The
The first step is to label the speech data for each probability computation step is generally
word of the dictionary. The label is same as the performed using the Viterbi algorithm. We used
text that represents the spoken word. We take 5 HVite (S. Young, 2001-2005) tool to perform
to 10 samples for each word and save the the recognition task and provide recognized
labeled file as the word followed by the sample output as text. In addition to the dictionary,
number (ex: “kolom_2.lab” is the labeled file of extracted features and HMM List, HVite
the word “kolom” for sample number 2). We requires a word network as HTK convention. So,
use HSLab (S. Young, 2001-2005) tool for this to build a word network we include a very
task. The second step is to extract the feature simple grammar (see figure-5) with a single
vectors from the sample files. All the state as just the words in the dictionary. We
specifications for feature extraction (described used the HParse (S. Young, 2001-2005) tool to
in the previous section) are written into a create the word network.
configuration file. The HCopy (S. Young, 2001-
2005) tool is used, which automatically extract 4.2 Continuous speech recognition
the features according to the configuration and
4.2.1 Prepare Training Data
save into a file. Continuous Speech Recognition (CSR) involves
The next step is model training. For each word
connecting HMMs together in sequence. Each
V in the vocabulary, we must build an HMM model in the sequence corresponds directly to
model i.e. we must estimate the model
the assumed underlying symbol. For CSR we
parameters (A, B, π) that optimize the likelihood moved towards the phoneme based HMM
of the training set observation vector of the Vth
model. For this, first we designed the phoneme
word (L. Rabiner, 1989). To create a model, set for bangla language where we considered
first we have to choose a priori a topology for
mono-phone as the phoneme unit. We have
each HMM model. We choose a HMM selected 47 mono-phones from our IPA chart (M.
prototype with 4 active states and 2 non-
A. Hai, 2004). Then we created a set of
emitting states. The prototype is depicted in phonetically balance (PB) sentences consist of
figure-2. Before beginning the training we
total 52 sentences, transcribe those sentences
initialized the HMM model for each word with with appropriate phone label and saved that as a
the HInit (S. Young, 2001-2005) tool. After
HTK specified Master Label File (mlf) format
initializing, the models are trained with the (S. Young, 2001-2005). The recorded speech of
feature data set. The training is an iterative
the PB labeled sentences is considered as the
process until all the models are reached to a
training data for CSR. In our PB sentences we
convergence. We used HRest tool to re-estimate
have listed total 1814 phonemes. Table-1
the model parameters iteratively. After the illustrates the phoneme distribution, where 1st
column present the monophone in Bangla, 2nd constructing valid sentences of the language and
column present the corresponding IPA symbol, also word loop which simply puts all words of
3rd column present the frequency of the the vocabulary.
monophone in the PB and the rest of the
columns follow the first three columns. Next, 5. Result Analysis
acoustic analysis is performed on these training In this research work, we give emphasis to the
data to extract features. For this, the number of inclusion of the HMM technique for recognizing
features, configuration parameters and the tool Bangla speech as no such work have been seen
is exactly same as isolated speech recognition. and also to evaluate the performance from
- sil 101 খ kʰ da দ d̪ 33 several aspect. We have taken a vocabulary of
a ɔ 161 গ g dah ধ d̪ʰ 12 100 words and test samples from 5 different
speakers to observe the performance. For
আ a 217 ঘ gʰ p প p 39 Isolated Speech Recognition we recorded the
i i 117 ঙ ŋ ph ফ pʰ 12 words for training in normal office environment
where several samples (5-10) with little
ঈ iː 3 চ c b ব b 67
variations were taken for each word. The
u u 42 ছ cʰ bh ভ bʰ 14 recognizer is capable of recognizing each
ঊ uː 1 জ ɟ m ম m 36 spoken word existing in the dictionary only
when the words are spoken by the same speaker
eয্া ӕ 1 ঝ ɟʰ z য ʤ 13 and the mood of the speaker is same. However
e e 165 ঞ No r র r 135 for different speaker the performance decreases
o o 61 ত t̪ l ল l 52 to almost 20%. For continuous speech
আঁ ã 5 থ t̪ʰ s শ ʃ 30 recognition we used the same 100 words for
building the regular grammar. Table-2 shows
iঁ ĩ 2 দ d̪ sh স s 23 the performance of both the recognition systems.
uঁ ũ 2 ধ d̪ʰ h হ h 28 SR Type Speaker Speaker
Dependent Independent
oঁ õ 2 ন n ra ঢ় ɾ 4
Isolated 90% 70%
eঁ ẽ 4 ট t y য় j 22
Continuous 80% 60%
ক k 102 ঠ tʰ 2
Table 2: Performance analysis
Table 1: Phoneme Distribution Recognizing continuous speech with ANN
The next step is model training. To create a classifier has average accuracy rate of 73.36%
model, here we choose a HMM prototype with 3 (K. J. Rahman, 2003), for three layer Back-
active states and 2 non-emitting states. We Propagation Neural Network the maximum
initialize the HMM models using HInit. Then accuracy rate is 86.67% (M. R. Islam, 2005),
we create a HTK specified Master Macro File and spoken letter recognition by measuring
(mmf) for all monophone using the prototype Euclidian distance, which can recognize only
HMM file. Next we re-estimate the parameters the vowels, has an 80% accuracy rate (A H M.
using HRest tool. After the completion of all Rezaul Karim, 2002). In comparison, the
these tasks, we will have separate model for recognizer presented in this paper has an
each phoneme. average accuracy rate of 85%. The performance
4.2.2 Recognition analysis reveals the importance about the
For a continuous speech signal to be recognized, improvement of the recognition with different
the preprocessing and the feature extraction speaker. Several studies on SR system
(using HCopy tool) is done first. Then the signal emphasizes on the training data with varieties of
is recognized using the HVite tool with the speakers to increase the performance. So, next
assist of regular grammar based language we should put our effort on collecting the
modeling technique. We create a regular training data from different speaker and observe
grammar and convert it to an intermediate form the performance.
of decoding network using the HParse tool.
Networks are specified using the HTK Standard
6. Applications
Lattice Format (SLF). In the grammar we define
The entire domain where speech recognition
the legal word sequences explicitly for
technology can be applied are automatic
translation, automotive speech recognition,
dictation, hands-free computing: voice K. J. Rahman, M. A. Hossain, D. Das, A. Z. M.
command recognition computer user interface, Touhidul Islam and Dr. M G. Ali, “Continuous
home automation, interactive voice response, Bangla Speech Recognition System”, Proc. 6th
medical transcription, mobile telephony, Int. Conf. on Computer and Information
pronunciation evaluation in computer-aided Technology (ICCIT), Dhaka, 2003.
language learning applications and robotics. In
our research work we are considering the K. Roy, D. Das and M G. Ali, "Development of
isolated speech recognition for commands & the Speech Recognition System Using Artificial
control, data entry, mobile telephony and home Neural Network”, Proc. of 5th ICCIT, 2002.
automation task. On the other hand continuous
speech recognition can be used for speech to L. Rabiner. A Tutorial on Hidden Markov
text conversion. Model and Selected Applications of Speech
Recognition, In Proceedings of IEEE, Vol-77,
7. Conclusion No-2, February 1989.
In this paper we concentrated on the research
and development of a Bangla Speech L. Rabiner and B. H. Juang, “Fundamentals of
Recognizer using the appropriate technique and Speech Recognition”, 1st edition, Prentice Hall,
tools. We have studied the past works and to the New Jersey, 1993.
best of our knowledge this work is the first
reported attempt to recognized Bangla speech M. A. Hai, “Dhonibiggan O Bangla Dhonitotto”,
using HMM Technique with the assist of 8th edition, Mollik Brothers, Dhaka, 2004.
stochastic language model. We observed that the
language specification is not significant for ISR; M. F. Khan and R. C. Debnath, "Comparative
however it has great importance for CSR Study of Feature Extraction Methods for Bangla
specially the language specific issues are Phoneme Recognition", Proc. 6th ICCIT, Dhaka,
constructing the phoneme set, phonetically 2002.
balance sentences and regular grammar for
Bangla Language. This paper clearly describes M. R. Hassan, B. Nath and M. Ala Uddin
the theory and implementation details of our Bhuiyan, “Bengali Phoneme Recognition: A
entire development task using the HTK tool. New Approach”, Proc. 6th ICCIT, Dhaka, 2003.
This work can be extended to the further
research on connected word recognition as an M. R. Islam, A. Sayeed M. Sohail, M. W. H.
extension of isolated speech recognition and the Sadid, M. A. Mottalib, "Bangla Speech
performance measurement of the diphone or Recognition using three layer Back-Propagation
triphone based HMM model as an extension of Neural Network", Proc. of NCCPB, Dhaka,
continuous speech recognition. 2005.