Вы находитесь на странице: 1из 6

Isolated and Continuous Bangla Speech Recognition: Implementation,

Performance and application perspective


Md. Abul Hasnat1 Jabir Mowla2 Mumit Khan3
1,2,3
Department of Computer Science and Engineering, BRAC University,
66 Mohakhali, Dhaka, Bangladesh
e-mail: mhasnat@gmail.com, jabir_jr@yahoo.com, mumit@bracu.ac.bd

Abstract a variety of ways. During this long research


period several key technologies were applied
Research on automatic speech where the combination of hidden Markov Model
recognition has been approach (HMM) and the stochastic language model
progressively since 1930 and the produces high performance (B. H. Juang, 2005).
major advances are since 1980 with Most of the research effort on recognizing
the introduction of the statistical bangla speech is performed using the ANN
modeling of speech with the key based classifier. No research work has been
technology Hidden Markov Model reported yet that uses the DTW technique and
(HMM) and the stochastic language HMM based classifier and no language model is
model (B. H. Juang, 2005). However, included with the existing research works.
the existing reported research works An isolated-word speech recognition system
on Bangla speech recognition didn’t requires that the speaker pause briefly between
yet incorporate the HMM technique words, whereas a continuous speech recognition
and language model. This paper system does not. The continuous speech consists
presents two different type of Bangla of continuous utterance which is representative
speech recognition from the of real speech. On the other hand a sentence
implementation, performance and constructed from connected words does not
application perspective. We used represent real speech as it is actually
HMM technique for pattern concatenation of isolated words. For Isolated
classification and also incorporate word the assumption is that the speech to be
stochastic language model with the recognized comprises a single word or phrase
system. At the signal preprocessing and to be recognized as complete entity with no
level we perform adaptive noise explicit knowledge or regard for the phonetic
elimination and end point detection. content of the word or phrase. Hence, for a
Spectral feature vectors such as Mel vocabulary of V words (or phrases), the
Frequency Cepstral Coefficients recognition algorithm consisted of matching the
(MFCC) with the addition of first and measured sequence of spectral vectors of the
second order coefficients are extracted unknown spoken input against each of the set of
from each speech wave signal. HMM spectral patterns for V words and selecting the
is used for pattern classification. The pattern whose accumulated time aligned spectral
system is implemented using the distance was smallest as the recognized word (L.
Cambridge Hidden Markov Modeling Rabiner, 1993). The notion of isolated speech
Toolkit (HTK) (S. Young, 2001-2005). recognition can be extended for connected
speech recognition if we consider a small
vocabulary and solve the co-articulation
1. Introduction problem that arises between words (S. Furui,
Speech recognition is the process of converting
2001). In continuous speech recognition,
an acoustic signal, captured by a microphone or
continuously uttered sentences are recognized.
a telephone, to a set of words. Research in this
The standard approach of continuous speech
area has attracted a great deal of attention over
recognition is to assume a simple probabilistic
the past five decades where several technologies
model of speech production whereby a specified
are applied and the efforts were made to
word sequence, W, produce an acoustic
increase the performance up to marketplace
observation sequence, so that the decoded string
standard so that the users will have the benefit in
has the maximum a posteriori probability (L. K. Roy (2002) performed the recognition by
Rabiner, 1993). In continuous speech Artificial Neural Network (ANN) using back
recognition it is very important to use propagation neural Network. They used DSP
sophisticated linguistic knowledge. The most techniques to extract the features of speech
appropriate units for enabling recognition signal. M. R. Hassan (2003) presents a phoneme
success depend on the type of recognition and recognition approach using ANN as a classifier.
on the size of the vocabulary. Various units of They calculated the RMS energy level as feature
reference templates/models from phonemes to from the filtered digitized signal. A. H. M.
words have been studied. When words are used Rezaul Karim (2002) presents a technique to
as units, word recognition can be expected to be recognized bangla phonemes using the
highly accurate; however it requires larger Euclidian distance measure. Reflection
memory and more computation. Using coefficient and autocorrelations have been used
phonemes as units does not greatly increase as features. K. J. Rahman, (2003) presents
memory size requirements or computation (S. continuous Bangla speech recognition system
Furui, 2001). In our experiment we used word using ANN. They employed a word separation
as a unit for isolated speech recognition and algorithm to separate the words. They applied
phoneme as a unit for continuous speech fourier transform based spectral analysis to
recognition. generate the feature vectors from each isolated
words. M. R. Islam (2005) presents a Bangla
Hidden Markov Model is powerful modeling ASR system that employed a three layer back-
technique for discrete state processes. The basic propagation Neural Network as the classifier. S.
idea behind the HMM is that the observation A. Hossain (2004) presents a brief overview of
sequence generated by the system exists in a Bangla speech synthesis and recognition. A
finite number of states in the model, and at each comparative study on the feature extraction
time step the model makes a state transition and methods are presented by M. F. Khan (2002).
gives a probability as the output. More precisely, 3. Methodology / Overview of the
Hidden Markov Model is defined as the triple Systems
λ := (π, A, B):. For real world implementation of The block diagram of canonic speech
Hidden Markov Model, three problems must be recognition system is shown in figure 1. We can
solved: the evaluation problem, the decoding subdivide the entire model into three major parts:
problem and the learning problem (S. Furui, speech data extraction or preprocessing, feature
2001). The forward algorithm is used to solve extraction, pattern recognition. Although the
the evaluation problem, the Viterbi algorithm for basic theory of both type of speech recognition
the decoding problem, and all parameters are system in pattern recognition approach is quite
adjusted for solving the learning problem. similar, however they applied different strategy
The outline of the system is as follows. We for incorporating language model and dictionary
begin in section 2 with the related works that into their entire system model. They also used
describes the past efforts on Bangla speech different style of data representation for both of
recognition. Section 3 discusses about the their training and recognition system.
details of the overall system for both isolated
and continuous speech recognition. Section 4
describes the implementation details. Section 5
presents the result analysis. Section 6 discusses
the applications and at last we end up the
discussion with the conclusion at section 7.

2. Related Works
Research on speech recognition has been started
since 1930; however research work to recognize
bangla speech has been started since around
2000. Here we will mention the research works
found so far for Bangla speech recognition. Figure 1: Block diagram for Speech Recognition
3.1 Speech Data Extraction or
Preprocessing
In this stage, the first step is to record the speech length 25ms. For the elimination of unwanted
data by a microphone in a specified format (wav frequency level a 26-filter bank channel is used
file, 16000Hz and 16 bits). This wav data will with a pre-emphasis coefficient value 0.97.
be converting into a form that is suitable for 3.3 Pattern Recognition
further computer processing and analysis The tasks in this step are divided into two phase,
through a series of process that involves noise Training and Recognition. We used Hidden
elimination and the speech end point detection Markov Model based classifier. An elaborate
process. discussion on HMM model for speech
3.1.1 Noise Elimination recognition is briefly described at (L. Rabiner,
We used an adaptive filter to eliminate the noise 1989). In our training methodology we created
from the recorded speech signal. A sample of word based HMM model for isolated speech and
the surrounding environment is used as input for phoneme based HMM model for continuous
the adaptive filter (M. R. Hassan, 2003). The speech recognition. We used the following
extracted voice data will be simply the training algorithm for creating the HMM models.
subtraction of the predicted noise from the Step 1: Initialize λ = (π, A, B)
speech signal. Step 2: Compute probabilities using λ.
3.1.2 End Point Detection Step 3: Adjust λ= λ'
We used the generalized end point detection Step 4: Repeat 2-3 until converge
algorithm presented at (K. Roy, 2002; K. J. 3.4 Dictionary and Language model
Rahman, 2003; M. R. Islam, 2005) in order to Our approach towards the isolated speech
identify the starting and end point of the speech recognition is quite simple, we used just a
signal. The goal of this process is to detect the simple dictionary contains only the input-
presence of voice and remove pauses and output-HMMmodelName and no language
silences in the background noise. For continuous model is necessary. However, for continuous
speech we use this algorithm only for the start speech recognition we created a pronunciation
and end point detection. However for isolated dictionary that contains the input-output-
speech the intermediate noisy signal and pronouncing for each word entry where the
unwanted signal within the speech is also pronunciation describes the sequence of HMMs
eliminated. A background unvoiced sample of that constitute each word. For each word the
the current recording environment was taken. output is provided as Unicode sequence and the
The sample was split into frames of known size pronunciation is given with the consideration of
and the energy was calculated for each frame. phoneme as a unit. As in continuous speech
The frame with the maximum energy was taken recognition we recognize a sequence of words
as the threshold for the end point detection. and that’s why it is necessary to incorporate a
Then the recorded voice sample was taken and language model. We used the Regular grammar
split into frames in the same manner, each frame modeling technique as our language model
of the voice sample was compared with the which has the properties like finite state model,
maximum noise frame and at the point where small vocabulary and restricted grammar. We
the voice frame's energy had a lower value than create a word level network that will typically
the maximum noise frame, which would be the represent the grammar which defines all the
end point. legal word sequence explicitly. The regular
3.2 Feature Extraction grammar model outputs only two values of
In this stage we extract meaningful unique probability; P(W)=1 where W valid in word
features from the preprocessed speech data. network and P(W)=0 otherwise. The Task
From the comparison among features described Grammar which defines all of the legal word
at (M. F. Khan, 2002) we decided to use MFCC sequences explicitly or a Word Loop which
features. With the addition of the MFCC simply puts all words of the vocabulary in a
features we calculate energy, delta coefficient loop and therefore allows any word to follow
and acceleration coefficient. Finally the total any other word. Word-loop networks are often
number of features is 39 among those 12 MFCC, augmented by a stochastic language model (S.
1 energy, 13 first order derivative and 13 second Young, 2001-2005).
order derivative. Feature vectors are extracted
with Hamming window function of window 4. Implementation
We used HTK as the core engine for our speech completion of all these tasks, we will have
recognizer due to its availability, portability and separate model for each word in the dictionary.
sophisticated facilities for speech analysis,
HMM training, testing and results analysis (S.
Young, 2001-2005). We used our own
implemented algorithms discussed earlier for the
preprocessing task and used HTK for the rest of
the part with some specified parameters. The
Figure 2: HMM prototype for each model
preprocessing task is similar for both type of
4.1.2 Recognition
recognition which will output the valid speech The recognizer can recognize only the words
data from the recorded speech signal. Here we
defined in the dictionary. For each unknown
will briefly discuss the implementation word which is to be recognized, the
difference of the isolated and continuous speech
preprocessing and feature analysis must be
recognition from pattern recognition point of carried out, that means measurement of the
view. We followed (A K M Mahmudul Hoque
observation sequence via a feature analysis of
2006) for the implementation of isolated speech the speech corresponding to the word; followed
recognition.
by calculation of the model likelihoods for all
4.1 Isolated speech recognition possible HMM models; followed by selection
4.1.1 Prepare Training Data whose model likelihood is highest. The
The first step is to label the speech data for each probability computation step is generally
word of the dictionary. The label is same as the performed using the Viterbi algorithm. We used
text that represents the spoken word. We take 5 HVite (S. Young, 2001-2005) tool to perform
to 10 samples for each word and save the the recognition task and provide recognized
labeled file as the word followed by the sample output as text. In addition to the dictionary,
number (ex: “kolom_2.lab” is the labeled file of extracted features and HMM List, HVite
the word “kolom” for sample number 2). We requires a word network as HTK convention. So,
use HSLab (S. Young, 2001-2005) tool for this to build a word network we include a very
task. The second step is to extract the feature simple grammar (see figure-5) with a single
vectors from the sample files. All the state as just the words in the dictionary. We
specifications for feature extraction (described used the HParse (S. Young, 2001-2005) tool to
in the previous section) are written into a create the word network.
configuration file. The HCopy (S. Young, 2001-
2005) tool is used, which automatically extract 4.2 Continuous speech recognition
the features according to the configuration and
4.2.1 Prepare Training Data
save into a file. Continuous Speech Recognition (CSR) involves
The next step is model training. For each word
connecting HMMs together in sequence. Each
V in the vocabulary, we must build an HMM model in the sequence corresponds directly to
model i.e. we must estimate the model
the assumed underlying symbol. For CSR we
parameters (A, B, π) that optimize the likelihood moved towards the phoneme based HMM
of the training set observation vector of the Vth
model. For this, first we designed the phoneme
word (L. Rabiner, 1989). To create a model, set for bangla language where we considered
first we have to choose a priori a topology for
mono-phone as the phoneme unit. We have
each HMM model. We choose a HMM selected 47 mono-phones from our IPA chart (M.
prototype with 4 active states and 2 non-
A. Hai, 2004). Then we created a set of
emitting states. The prototype is depicted in phonetically balance (PB) sentences consist of
figure-2. Before beginning the training we
total 52 sentences, transcribe those sentences
initialized the HMM model for each word with with appropriate phone label and saved that as a
the HInit (S. Young, 2001-2005) tool. After
HTK specified Master Label File (mlf) format
initializing, the models are trained with the (S. Young, 2001-2005). The recorded speech of
feature data set. The training is an iterative
the PB labeled sentences is considered as the
process until all the models are reached to a
training data for CSR. In our PB sentences we
convergence. We used HRest tool to re-estimate
have listed total 1814 phonemes. Table-1
the model parameters iteratively. After the illustrates the phoneme distribution, where 1st
column present the monophone in Bangla, 2nd constructing valid sentences of the language and
column present the corresponding IPA symbol, also word loop which simply puts all words of
3rd column present the frequency of the the vocabulary.
monophone in the PB and the rest of the
columns follow the first three columns. Next, 5. Result Analysis
acoustic analysis is performed on these training In this research work, we give emphasis to the
data to extract features. For this, the number of inclusion of the HMM technique for recognizing
features, configuration parameters and the tool Bangla speech as no such work have been seen
is exactly same as isolated speech recognition. and also to evaluate the performance from
- sil 101 খ kʰ da দ d̪ 33 several aspect. We have taken a vocabulary of
a ɔ 161 গ g dah ধ d̪ʰ 12 100 words and test samples from 5 different
speakers to observe the performance. For
আ a 217 ঘ gʰ p প p 39 Isolated Speech Recognition we recorded the
i i 117 ঙ ŋ ph ফ pʰ 12 words for training in normal office environment
where several samples (5-10) with little
ঈ iː 3 চ c b ব b 67
variations were taken for each word. The
u u 42 ছ cʰ bh ভ bʰ 14 recognizer is capable of recognizing each
ঊ uː 1 জ ɟ m ম m 36 spoken word existing in the dictionary only
when the words are spoken by the same speaker
eয্া ӕ 1 ঝ ɟʰ z য ʤ 13 and the mood of the speaker is same. However
e e 165 ঞ No r র r 135 for different speaker the performance decreases
o o 61 ত t̪ l ল l 52 to almost 20%. For continuous speech
আঁ ã 5 থ t̪ʰ s শ ʃ 30 recognition we used the same 100 words for
building the regular grammar. Table-2 shows
iঁ ĩ 2 দ d̪ sh স s 23 the performance of both the recognition systems.
uঁ ũ 2 ধ d̪ʰ h হ h 28 SR Type Speaker Speaker
Dependent Independent
oঁ õ 2 ন n ra ঢ় ɾ 4
Isolated 90% 70%
eঁ ẽ 4 ট t y য় j 22
Continuous 80% 60%
ক k 102 ঠ tʰ 2
Table 2: Performance analysis
Table 1: Phoneme Distribution Recognizing continuous speech with ANN
The next step is model training. To create a classifier has average accuracy rate of 73.36%
model, here we choose a HMM prototype with 3 (K. J. Rahman, 2003), for three layer Back-
active states and 2 non-emitting states. We Propagation Neural Network the maximum
initialize the HMM models using HInit. Then accuracy rate is 86.67% (M. R. Islam, 2005),
we create a HTK specified Master Macro File and spoken letter recognition by measuring
(mmf) for all monophone using the prototype Euclidian distance, which can recognize only
HMM file. Next we re-estimate the parameters the vowels, has an 80% accuracy rate (A H M.
using HRest tool. After the completion of all Rezaul Karim, 2002). In comparison, the
these tasks, we will have separate model for recognizer presented in this paper has an
each phoneme. average accuracy rate of 85%. The performance
4.2.2 Recognition analysis reveals the importance about the
For a continuous speech signal to be recognized, improvement of the recognition with different
the preprocessing and the feature extraction speaker. Several studies on SR system
(using HCopy tool) is done first. Then the signal emphasizes on the training data with varieties of
is recognized using the HVite tool with the speakers to increase the performance. So, next
assist of regular grammar based language we should put our effort on collecting the
modeling technique. We create a regular training data from different speaker and observe
grammar and convert it to an intermediate form the performance.
of decoding network using the HParse tool.
Networks are specified using the HTK Standard
6. Applications
Lattice Format (SLF). In the grammar we define
The entire domain where speech recognition
the legal word sequences explicitly for
technology can be applied are automatic
translation, automotive speech recognition,
dictation, hands-free computing: voice K. J. Rahman, M. A. Hossain, D. Das, A. Z. M.
command recognition computer user interface, Touhidul Islam and Dr. M G. Ali, “Continuous
home automation, interactive voice response, Bangla Speech Recognition System”, Proc. 6th
medical transcription, mobile telephony, Int. Conf. on Computer and Information
pronunciation evaluation in computer-aided Technology (ICCIT), Dhaka, 2003.
language learning applications and robotics. In
our research work we are considering the K. Roy, D. Das and M G. Ali, "Development of
isolated speech recognition for commands & the Speech Recognition System Using Artificial
control, data entry, mobile telephony and home Neural Network”, Proc. of 5th ICCIT, 2002.
automation task. On the other hand continuous
speech recognition can be used for speech to L. Rabiner. A Tutorial on Hidden Markov
text conversion. Model and Selected Applications of Speech
Recognition, In Proceedings of IEEE, Vol-77,
7. Conclusion No-2, February 1989.
In this paper we concentrated on the research
and development of a Bangla Speech L. Rabiner and B. H. Juang, “Fundamentals of
Recognizer using the appropriate technique and Speech Recognition”, 1st edition, Prentice Hall,
tools. We have studied the past works and to the New Jersey, 1993.
best of our knowledge this work is the first
reported attempt to recognized Bangla speech M. A. Hai, “Dhonibiggan O Bangla Dhonitotto”,
using HMM Technique with the assist of 8th edition, Mollik Brothers, Dhaka, 2004.
stochastic language model. We observed that the
language specification is not significant for ISR; M. F. Khan and R. C. Debnath, "Comparative
however it has great importance for CSR Study of Feature Extraction Methods for Bangla
specially the language specific issues are Phoneme Recognition", Proc. 6th ICCIT, Dhaka,
constructing the phoneme set, phonetically 2002.
balance sentences and regular grammar for
Bangla Language. This paper clearly describes M. R. Hassan, B. Nath and M. Ala Uddin
the theory and implementation details of our Bhuiyan, “Bengali Phoneme Recognition: A
entire development task using the HTK tool. New Approach”, Proc. 6th ICCIT, Dhaka, 2003.
This work can be extended to the further
research on connected word recognition as an M. R. Islam, A. Sayeed M. Sohail, M. W. H.
extension of isolated speech recognition and the Sadid, M. A. Mottalib, "Bangla Speech
performance measurement of the diphone or Recognition using three layer Back-Propagation
triphone based HMM model as an extension of Neural Network", Proc. of NCCPB, Dhaka,
continuous speech recognition. 2005.

8. Reference S. A. Hossain, M. L. Rahman, M. F. Ahmed and


A H M. Rezaul Karim, Md. S. Rahman, Md. M. Dewan, "Bangla Speech Analysis, Synthesis
Zafar Iqbal, “Recognition of Spoken Letters in and Recognition: An Overview", Proc. of
Bangla”, Proc. of 6th ICCIT, Dhaka, 2002. NCCPB, Dhaka, 2004.

S. Furui, "Digital Speech Processing, Synthesis


A. K. M. Mahmudul Hoque, "Bengali
and Recognition", 2nd Edition, Marcel Dekker
Segmented Speech Recognition System",
Inc., New York, 2001.
Undergraduate Thesis Report, Computer
Science and Engineering, BRAC
S. Young, G. Evermann, M. Gales, T. Hain, D.
University, May, 2006.
Kershaw, G. Moore, J. Odell, D. Ollason, D.
Povey, V. Valtchev and P. Woodland, “The
B. H. Juang and L. R. Rabiner, “Automatic
HTK Book”, 2001-2005 Cambridge University
Speech Recognition-A Brief History of the
Engineering Departments, Website:
Technology”, Elsevier Encyclopedia of
http://htk.eng.cam.ac.uk/docs/docs.shtml.
Language and Linguistics, Second Edition, 2005.

Вам также может понравиться