Вы находитесь на странице: 1из 65

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/311982954

SPEECH TO TEXT RECOGNITION USING


HIDDEN MARKOV MODEL TOOLKIT

Thesis · June 2016


DOI: 10.13140/RG.2.2.12807.80802

CITATIONS READS

0 51

1 author:

Nilkanth Shet Shirodkar


Goa University
2 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Nilkanth Shet Shirodkar on 30 December 2016.

The user has requested enhancement of the downloaded file.


KONKANI SPEECH TO TEXT RECOGNITION
USING HIDDEN MARKOV MODEL TOOLKIT

A Thesis

Submitted by

NILKANTH SHET SHIRODKAR

in partial fulfillment for the award of the degree


of

MASTER OF TECHNOLOGY
in

COMPUTER SCIENCE

Department of Computer Science and Technology

GOA UNIVERSITY
JUNE 2016
Dedicated to my Parents

i
GOA UNIVERSITY

CERTIFICATE

This is to certify that Nilkanth S. Shet Shirodkar, has worked on


this thesis titled “KONKANI SPEECH TO TEXT RECOGNITION
USING HIDDEN MARKOV MODEL TOOLKIT” in Department of
Computer Science and Technology under my supervision and guidance.
This thesis being submitted to Goa University, Taleigao Plateau, Goa for
the award of the degree of M.Tech in Computer Science is an original
record of work carried out by the candidate himself and has not been
submitted for the award of any other degree or diploma of this or other
university in India or abroad.

SIGNATURE

Mr. Ramdas Karmali


Research Guide
Assistant Professor
Dept. of Computer Science and Technology
ABSTRACT

The purpose of the study is to develop an Isolated Word Speech Recog-

niser for Konkani language, using Hidden Markov Model based speech

recognizer specially focusing on konkani digits. This is the first Speech

to text recognizer developed for konkani Language using Hidden Markov

Models Toolkit (HTK). Speakers were asked to read numeric digits audi-

bly in konkani Language and corpora was collected in audio format. This

collected corpora was then used for testing and training Konkani Speech

Recognition System.

Konkani Automatic Speech Recognition (ASR) system was imple-

mented using the HMM toolkit for building HMM model using training

data. Then, this trained HMM Model was used for recognising Konkani

word and results revealed that 80.02% accuracy for Phoneme Level Acous-

tic Model and 79.36% accuracy for word Level Acoustic Model.

This developed system can be used by developers and researchers who

are interested in speech recognition for Konkani language and any other

related Indian languages.


ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my respected guide,Shri. Ramdas Kar-


mali and Co-Guide Dr. Jyoti Pawar for their valuable guidance, consistent encourage-
ment, personal caring, timely help and providing me with an excellent atmosphere for
doing research. All through the work, in spite of their busy schedule, they has extended
cheerful and cordial support to me for completing this research work.

I wish express my deep sense of gratitude to Dr. V.V. Kamat, Professor and HOD,
Goa University for his constant feedback and support right from the beginning of re-
search. I am also thankful to all faculty and staff members of Department of Computer
Science and Technology, Goa University for their support for completion of this work.

It is an immense pleasure to express my gratitude towards Dr. Vinay Kumar Mit-


tal,IIIT Chittoor, Sri City. It was a pleasure working with him, who helped me time to
time in giving me proper guidance and valuable feedback that has helped me in finding
the right path in this research work. I would also like to thank him for his motivating
behaviour and enlightening me with his innovative ideas.

I would like to thank Rashmi Shet for Linguistic support and also students who
have helped me in creating speech database. Also Special thanks for NLP group, Goa
University who were always present for help and support whenever I needed.

iv
TABLE OF CONTENTS

ABSTRACT iii

ACKNOWLEDGEMENTS iv

LIST OF FIGURES viii

LIST OF TABLES ix

ABBREVIATIONS x

1 Introduction 1
1.1 Background Study . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Speech Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Fundamentals of Speech Recognition . . . . . . . . . . . . . . . . 4
1.4.1 Overview of Speech Recognition System . . . . . . . . . . 4
1.4.2 Components of Speech Recognition . . . . . . . . . . . . . 5
1.5 Objective of Study . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Contribution and Significance . . . . . . . . . . . . . . . . . . . . 7
1.7 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Survey 9
2.1 Types of Speech Recognition Tasks . . . . . . . . . . . . . . . . . 9
2.1.1 Speaker Dependent . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Speaker Independent . . . . . . . . . . . . . . . . . . . 9
2.2 Speech Recognition Techniques . . . . . . . . . . . . . . . . . . . 10
2.2.1 Template-based approach: . . . . . . . . . . . . . . . . . 10
2.2.2 Knowledge-based approaches: . . . . . . . . . . . . . . 10
2.2.3 Statistical based approaches: . . . . . . . . . . . . . . . 11

v
2.2.4 Learning-based approaches: . . . . . . . . . . . . . . . . 11
2.2.5 Artificial intelligence . . . . . . . . . . . . . . . . . . . . 11
2.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Problem in Designing Speech Recognition System . . . . . . . . . 12
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Characteristics of Konkani Language 14


3.1 Introduction to Konkani . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Konkani Phonology . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Introduction to Speech Production Systems . . . . . . . . . . . . . 15
3.2.1 Voiced and Unvoiced Speech . . . . . . . . . . . . . . . . . 17
3.2.2 Vowels and Consonants . . . . . . . . . . . . . . . . . . . 18
3.2.3 Konkani Language Speech Sound Label Set . . . . . . . . . 19

4 Methodology 23
4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 Task Grammer . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.2 Word Network . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . 26
4.1.4 Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.5 Phonetic Transcription . . . . . . . . . . . . . . . . . . . . 26
4.2 Feature Extraction and Optimization . . . . . . . . . . . . . . . . . 27
4.3 Parameter Estimation (Traning) . . . . . . . . . . . . . . . . . . . . 29
4.3.1 Training Strategies . . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Acoustic Analysis . . . . . . . . . . . . . . . . . . . . . . 31
4.3.3 Acoustic Model Generation . . . . . . . . . . . . . . . . . 32
4.3.4 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 35
5.1 Performance Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.1 Experiment No. 1 . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Experiment No. 2 . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.3 Experiment No. 3 . . . . . . . . . . . . . . . . . . . . . . . 42

vi
5.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion and Recommendation 46


6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Areas for Further Study . . . . . . . . . . . . . . . . . . . . . . . . 47

A Speaker Profile 48

B Toolkit for Developing Speech Recognizers 49

C Result Analysis 50
LIST OF FIGURES

1.1 Principle components of Speech Recognition . . . . . . . . . . . 4


1.2 Overview of Speech Recognition System . . . . . . . . . . . . . . 5
1.3 Overview of the speech recognition problem . . . . . . . . . . . 5

3.1 Schematic view of human speech production . . . . . . . . . . . 15


3.2 Block model of human speech production . . . . . . . . . . . . . 16
3.3 Classification of sound units . . . . . . . . . . . . . . . . . . . . 18
3.4 Characters Vowel Sound . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Character representing Consonant-Vowel (CV) . . . . . . . . . . 19
3.6 Konkani Language Speech Sound Label Set . . . . . . . . . . . . 20

4.1 Grammar for Digit Recognition . . . . . . . . . . . . . . . . . . 25


4.2 Process of creating a word lattice . . . . . . . . . . . . . . . . . . 25
4.3 Block diagram for obtaining the MFCC . . . . . . . . . . . . . . 28
4.4 Architecture of Speech Recognition System . . . . . . . . . . . . 30
4.5 Conversion of the training data . . . . . . . . . . . . . . . . . . . 31
4.6 Recognition process of an unknown input signal . . . . . . . . . 33
4.7 Recognition output (recognised transcription) . . . . . . . . . . 34

viii
LIST OF TABLES

4.1 Summary of Speech Recording . . . . . . . . . . . . . . . . . . . 27

A.1 Speaker Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.1 Toolkit for building acoustic models and Speech recognizer . . . . . 49

ix
ABBREVIATIONS

ASR Automatic Speech Recognition

MFCC Mel Frequency Cepstral Coefficients

HMM Hidden Markov Models

GMM Gaussian Mixture Model

STT Speech to Text

TTS Text to Speech

HTK Hidden Markov Models Toolkit

DFT Discrete Fourier transform

MLF Master Label files

SLF Standard Lattice Format

WER Word Error Rate

x
CHAPTER 1

INTRODUCTION

1.1 Background Study

Speech is the verbal means of knowledge exchange between people. We humans com-
municate by speaking and listening to other humans. Several researchers are working
towards developing interactive systems which allow computer to recognise and under-
stand human speech and can give appropriate output, specifically as part of Human-
Computer Interaction. Researchers are working on Speech Recognition by trying to
develop systems which will understand human speech and which will produce speech
as humans do. Output from the such system can be termed as Artificial Speech[7].

Speech Recognition is one of the fastest developing field in Machine Learning. The
current generation of computational Technology enable human and computer interac-
tion wherein human will speak to the computer, which will be recognised and under-
stood by the computer and suitable output in text form will be generated. Then this
text will be read by computer using Text to Speech (TTS), thus providing easy handfree
interaction between human and Computer.

Speech Recognition is a field of Computer Science and Electronics wherein we


deal with Signals and Systems, Signal Processing, Signal Enhancement etc. In Speech
Recognition, user speaks using microphone, which is taken as an input for the ma-
chine (Speech Waveform). Processing of this speech waveform is then done by framing
speech signal, converting signal from time domain to frequency domain, speech en-
hancement etc. Resulting output is then given in the text format to the machine which
has to identify the word which the user has spoken.

There is a rapid advancement in the area of speech recognition research with better
performing algorithms and system. There are lots of applications in the speech recogni-
tion field, Non-English speakers can now use this system in their mother tongue so also
physically disabled person. It can further be used in natural language understanding
and several other applications in the area of Computer Science.

Speech Technology is an developing technology with variety of tools and methods


for better and fast implementation.

To discuss speech recognition technology we use following terminology in this re-


port.

• Utterance :- Utterance is the vocalization of a word, it can be a single word or


few words.

• Vocabularies :- Vocabularies are also known as dictionaries which are the lists
of words that can be recognized by speech recognition system. Generally Vo-
cabularies are of two kinds, small vocabularies and large vocabularies. small
vocabularies are easier to build as comparied to large vocabularies.

• Training :- Training is the process of learning the characteristics of sound unit.


Developing the model using the training data to learns the parameters of the mod-
els.

• Hidden Markov Model :- Hidden Markov Models (HMM) is used to produces


the sequence of observable output in order to model unkown processes at discrete
time interval. Hidden Markov Model is a finite set of states, each of which is asso-
ciated with probability distribution. Transition probability gives transition among
the states. For a particular state, outcome generated depend on the probability dis-
tribution but these outcomes are not visible for external observer therefore states
are hidden.

2
1.2 Problem Statement

The ultimate goal of this study is to build a model for Konkani Speech Recognition
system. In our study, we propose to use HMM model which is trained using audio
data. Our main focus is on minimizing the information loss while extracting features
from audio signal. Appropriate techniques for efficient feature extraction will be inves-
tigated. Study also includes building word level and phoneme level acoustic models and
to find out which model gives better recognition accuracy with respect to feature extrac-
tion technique. We propose to develop Konkani Speech Corpus and Konkani Speech
Sound Label Set which can be used for various NLP projects. However, current study is
restricted to isolated word speech recognizer which focuses only on numeric Konkani
digits.

1.3 Speech Processing

Signal processing is the process of extracting relevant information from the speech sig-
nal in an efficient, robust manner. Included in signal processing is the form of spectral
analysis used to characterize the time-varying properties of the speech signal as well as
various types of signal pre-processing and post-processing to make the speech signal
robust to the recording environment for signal enhancement.

• Speech Recognition:

It is the process of extracting the acoustic features from the signal to recognize
various words present in it. Goal is to convert spoken speech into text.

• Speech Synthesis:

It is the process of producing artificial human speech, which can be used to con-
vert Text to Speech TTS. Goal is to read a given text in artificial voice.

3
1.4 Fundamentals of Speech Recognition

Figure 1.1: Principle components of Speech Recognition

Speech Analyzer : Analyzes the speech signal and removes the background noise
thus focusing only on the speech signal.

Acoustic Model : It identifies phoneme from the speech using probability based
mathematical model.

Language Model : Identifies words uttered by speaker from the phonemes by mak-
ing use of a dictionary file and grammar file.

1.4.1 Overview of Speech Recognition System

Training Data is collected and given to the Feature Extraction Module. Features are
extracted from the training data and given to the Acoustic Model. In feature extrac-
tion, we have used three types of techniques: MFCC 13, MFCC 26 and MFCC 39.In
Acoustic Model, we have two types: Word level Acoustic Model and Phoneme Level
Acoustic Model. Language Model is constructed using text data. In the Testing Phase,
same procedure is repeated. Features are extracted from testing data and given to the
decoder. Decoder with the help of Acoustic Model and Language model recognizes the
text words.

4
Figure 1.2: Overview of Speech Recognition System

1.4.2 Components of Speech Recognition

Speech Recognition Problem: Speaker produces some speech and we have to develop
a system that automatically convert that speech into a written transcription, which is
known as Speech to Text (STT) Recognition.

(a) Speech recognition problem

Figure 1.3: Overview of the speech recognition problem

5
It is possible, after some pre-processing, to represent the speech signal as a sequence
of observation symbols O = o1,o2,o3...oT that represents a string composed of elements
of a alphabet V of symbols. In addition, we have a vocabulary V of all the words Wi,
1 < i < V , that can be uttered. Then mathematically the speech recognition problem
comes down to finding the word sequence W having the highest probability of being
spoken, given the acoustic evidence O, thus we want to solve

Unfortunately, this equation is not directly computable since the number of possible
observation sequences is sheer inexhaustible, unless there is some limit on the duration
of the utterances and there is a limited number of observation symbols. But Bayes
formula gives:

Where P(W), called the language model, is the probability that the word string
W will be uttered and P(O/W) is the probability that when word string W is uttered
the acoustic evidence O will be observed, the latter is called the acoustic model. The
probability P(O) is usually not known but for a given utterance it is just a normalizing
constant and can be ignored.

Consequently, a speech recognizer consists of three components: a pre-processing


part that translates the speech signal into a sequence of observation symbols, a language
model that tells us how likely a certain word string is to occur and an acoustic model
that tells us how a word string is likely to be pronounced.

6
1.5 Objective of Study

The objective of the study is to develop an Isolated Word speech to text STT recogniser
for Konkani Language.

The specific objectives of the study are:

i. To survey all literature related to Speech Recognition.

ii. To develop Konkani language speech corpus.

iii. To implement an isolated word speech to text STT recognizer.

iv. To make the developed system speaker-independent

v. To validate the developed konkani speech to text STT system.

1.6 Contribution and Significance

Main contributions in developing Konkani Speech to text recognizer can be listed as


follows:

1. Development of Speech Database


This involves creation of Speech database from various speakers, both male and
female. This also involves creation of dictionary and alignment.

2. Development of Konkani Speech Label set


Developed labels set for Konkani Langauage, In this we have taken Devanagari
script and given labels to all the Devanagari characters. This will help researcher
who will work in speech processing area for konkani language in future.

3. Building Speech-to-text Recognizer for Konkani


Hidden Markov Model Toolkit HTK has been used to build speech recognizer for
the Konkani language.

7
1.7 Scope

This project was limited for only numeric digits from 1 to 10 in Konkani. The system
developed implements isolated word speech to text recognition, and was trained and
tested for isolated words representing Konkani digits.

Thus the study focuses on development of two acoustic models such as Word level
Acoustic model and Phoneme Level Acoustic model with MFCC13, MFCC26, and
MFCC39 Feature Extraction Technique.

The collected speech corpus and speech sound label set developed will be useful for
any researcher who wishes to push Konkani speech recognition with enhanced vocabu-
lary and recognition accuracy.

8
CHAPTER 2

LITERATURE SURVEY

Speech Recognition has been one of the important research areas since 1930. Im-
provements and development of new approaches by researchers has lead to major ad-
vancements in speech recognition system. Speech recognition concerns with Human-
Computer interaction wherein human give commands through speech to a computer
which is then processed and gives desired output to the humans. Earlier, speech recog-
nition systems had very low accuracy but current speech recognition systems have
achieved higher recognition accuracy, low word error rate with improved algorithms.
We can use speech recognition system as an input for the computer. Speech recognition
can be characterized by language model, acoustic model, vocabulary, speaking style
and environment etc. This chapter describes a review of speech recognition techniques,
problems in designing speech recognition system, different types of methods used to
develop speech recognition system.

2.1 Types of Speech Recognition Tasks

Speech recognition tasks are based on the speaker, which are classified as follows:-

2.1.1 Speaker Dependent

A speech recognition system designed to recognize speech of single speaker or a group


of speaker is known as Speaker Dependent system. Developing these type of systems
are usually easier and are found to work more accurate.

2.1.2 Speaker Independent

A speech recognition system designed to recognize speech of any new speaker is known
as Speaker Independent system. Developing this type of systems are very difficult and
have low recognition accuracy. However, they are more flexible.

2.2 Speech Recognition Techniques

Speech recognition techniques are as follows:

2.2.1 Template-based approach:

In this approach we have a collection of recorded words in the training set known as
templates. There will be many templates for each word in the training set since there
will be variations in speech with respect to time. Whenever an unknown speech is en-
countered, it is being compared with all the predefined templates in the training set, then
the most optimal match is found. The only disadvantage of using this approach is that
it takes a long time to give output because training data will have fixed templates and
new template will match all the predefined templates. Dynamic time warping (DTW)
is one of the approaches for template matching, in which templates are represented in a
sequence of feature vectors for a given word. We need to align the new template with
the predefined template and need to find a word with the lowest score. To compute
score we need to find the distance between the template and observed feature vectors
with distance measure and local distance for every alignment, lowest score is the best
alignment and that word for which templates gives lowest score is given as an output.

2.2.2 Knowledge-based approaches:

A knowledge base technology requires an expert in speech Processing whose exper-


tise generate and store very complex structured and unstructured data used by speech
recognition system. But to get such an expert knowledge in speech recognition is very
difficult. This method was impractical hence, they moved towards automatic learning,
wherein system learn by itself based on the training.

10
2.2.3 Statistical based approaches:

This approach is most widely used by researchers in speech recognition area. In this
approach, we are using automatic statistical learning procedure mostly HMM. There are
the hidden states and observable states; speech signals are the hidden states and features
extracted from the speech signals are the observable states. In this approach, we need
to find the probability of hidden states given observable states. Also using Gaussian
mixtures for unstructured, observable states are the acoustic information of the speech.
These new HMM based statistical modeling approaches are state-of-art approaches for
speech recognition.

2.2.4 Learning-based approaches:

To overcome the disadvantage of Statistical based approaches such as HMMs machine


learning methods wherein we need to give training and need knowledge from domain
experts, hence Learning-based approaches are used such as neural networks and genetic
programming. In this approach knowledge is acquired automatically through evolution-
ary process.

2.2.5 Artificial intelligence

The Artificial Intelligence is a hybrid approach of pattern recognition and acoustic-


phonetic approaches. It utilizes the concepts and ideas of pattern recognition and acous-
tic phonetic methods. The artificial intelligence approach makes an attempt to recognise
speech in similar way as humans, such as visualizing, analyzing and making decision
based on acoustic features.

2.3 Corpora

Building a Konkani speech to text recognizer requires speech database for training as
well as for testing the system. Speech database was created using Audacity tool to train

11
and test the Konkani speech to text STT recogniser. Speech corpus was collected in
.wav format with mono channel and each speech file was tagged with the appropriate
text.

2.4 Problem in Designing Speech Recognition System

A Speech Recognition is a challenging task with lots of difficulties in implementation


since speech has lots of parameters which need to be accurately captured[16]. Lots
of researchers are working on finding the optimal algorithm for speech recognition.
Speech recognition comes under both the fields of computer science and electronics
which consists of signal processing, signals and systems, pattern recognition, acoustics,
phonetics and linguists.

Below are the problems faced during building speech recognition:

i. Number of speakers: More the number of speakers, more will the training data,
hence more will the accuracy of the system. To build speaker independent system with
more robust and higher accuracy we need to have large speech database as training data.

ii. Vocabulary size: Most Indian languages have a large vocabulary, therefore as
the vocabulary size increases the recognition accuracy decreases. So there should be a
limited vocabulary to get optimal accuracy rate.

iii. Utterance: Each utterance should be properly recorded, such as starting silence
uttered word and again ending silence. Also, each utterance should be properly labeled
with tags.

iv. Speaker: There can be variations in speech in terms of age, sex and accent etc
of the speaker.

v. Environment : Noise pollution negatively affects speech recognition system


in terms of distorted signal noise which will decrease the speech recognition system
performance.

12
2.5 Summary

Speech to text recognition was implemented using Hidden Markov Model Toolkit HTK.
Hidden Markov Model Toolkit was build by Cambridge University - initially by Ma-
chine Learning Lab. HTK Can be Used for OCR, DNA Sequencing, and Speech Recog-
nition. There are around 39 tools in HTK which can be easily customized as per the
requirements. Hidden Markov Model toolkit HTK is an open source software with
speech recognition Engine. For Indian Languages various institutions are using the
HTK framework to build speech recognizer because of its customizable property[19].
The development of speech to text recognition involves advanced techniques of HMM
training and evaluation. As of today majority of research labs in universities and indus-
try are using HTK for their research work as a standard toolkit.

13
CHAPTER 3

CHARACTERISTICS OF KONKANI LANGUAGE

Our major requirement for developing a speech to text recognizer for Konkani Lan-
gauge is to define phoneme, speech sound label set which was developed along with
the help of Konkani linguists. In this Chapter, Konkani Language phonology, grammar,
vocabulary, vowels, and consonants will be discussed.

To define konkani speech sound label set the help was taken from the Konkani
linguists to develop Konkani speech sound label set. Sound labels are very important in
developing Konkani speech recognizer.

This chapter, will focus on the study of Konkani language for defining the speech
sound label set for Konkani speech to text recognizer.

3.1 Introduction to Konkani

Konkani is an Indo-Aryan language from the family of Indo-European languages and


is spoken on the western coast of India. It is one among the twenty-two scheduled lan-
guages mentioned within the eighth schedule of the Indian Constitution and the official
language of the Indian state of Goa.

It is a minority language in Karnataka (Karwar, Mangalore), Maharashtra, Kerala


and Daman and Diu.

3.1.1 Konkani Phonology

The Konkani language has 16 basic vowels, 36 consonants, 5 semi-vowels, 3 sibilants, 1


aspirate, and many diphthongs similar to other Indo-Aryan languages. It has both long
and short vowels and also syllables with long vowels. Different types of nasal vowels
are a special feature of the Konkani language.
3.2 Introduction to Speech Production Systems

The main components of the human speech system are lungs, throat, mouth(oral), lar-
ynx, trachea, nose(nasal cavity). Normally throat and mouth are grouped together to
form one unit called oral tract and nasal cavity is called the nasal tract.

(a) Human Speech Production

Figure 3.1: Schematic view of human speech production

The air is pressed from lungs through larynx using muscle force. The vocal cords
vibrates and produces pressure wave. This frequency of the pressure signal is known
as fundamental frequency(f0). Fundamental frequency defines the melody of speech.
Humans cannot speak with the constant fundamental frequency, hence, there is a con-
tinuous change in the fundamental frequency. The frequency of the vocal cord depends
on several factors such as sex, age etc.

15
Figure 3.2: Block model of human speech production

The air flows within the oral tract and through the nasal tract, the cavities resonate
sound waves are produced which is known as speech signal. Both the vocal tract and
nasal tract act as resonators which produce resonance frequencies known as formant
frequency. To vary the cavities of the mouth by moving the tongue, lips, velum, and
mouth as a result of this we are able to pronounce terribly many different sounds.

• Pitch determines the vibration frequency of the vocal cords.

• The positions of the lips, tongue, and nose determine the timbre.

• The compression of the lungs determine the loudness.

16
3.2.1 Voiced and Unvoiced Speech

Speech can be divided into two parts : voiced and unvoiced. There is more energy in
the voice part than in the unvoiced part. Vocal cords and vocal tract are used to produce
voiced sound but to produce unvoiced sound vocal cords are not used hence it is not
possible to find the fundamental frequency of unvoiced sound. Examples of unvoiced
sounds are /s/ /sh/ and /p/

The sound which produces vibration in the vocal cords is called voice and the sound
which does not produce any vibration in vocal cords is called unvoiced sound.

1. Manner of Articulation

• It refers to how the airflow is constricted in Vocal Tract.

2. Place of Articulation

• It refers to where in the vocal tract constriction of air flow takes place.

 Velar :- Sound which is produced keeping the tongue as far as back as possible.

 Palatal :- Sound which is produced at front hard palate or roof of the mouth.

 Retroflex :- Sound which is produced above the gum (upper teeth)

 Dental :- Sound which is produced in toungue in between upeer and lower teeth.

 Brlabial :- Sound which is produced by lips

Nasal sounds are produced when the air pass through the nasal cavity.

Prepositioning of Articulator : Even if we cut a small part of the sound signal then
also we can recognize which word that sound signal was.

17
3.2.2 Vowels and Consonants

The sound units of Konkani language is broadly classified into two categories, vowels,
and consonants.

Figure 3.3: Classification of sound units

The vowel sound units are further classified into three categories as short vowels,
long vowels, and diphthongs. From the production process point of view, there is no
distinction between short and long vowels, except that the duration of production will
be longer, typically nearly double that of short vowels. For instance short vowel /a/
may be almost half the duration as that of the long vowel /A/ or /a:/. In the case of
diphthongs, two vowel sounds are produced in succession without any pause. The pro-
duction process is such that the vocal tract shape is initially producing the first vowel
and midway during the production of the first vowel it changes the shape to produce the
other vowel .

(a) Characters representing Vowel Sound

Figure 3.4: Characters Vowel Sound

These two broad categories are mainly based on the shape of the vocal tract. In the
case of vowels, the vocal tract shape is wide open without any constriction along its
length starting from the glottis till the lips and is excited by voiced excitation source.
Alternatively, in the case of consonants, there may be a constriction in vocal tract shape
somewhere along its length and is excited by either voiced, unvoiced and both types of
excitation.

18
Figure 3.5: Character representing Consonant-Vowel (CV)

Sound which is made by either breathing in or breathing out is known as aspirate


sound. The aspirate/un-aspirate variation is seen in all affricates and stops except in
voiceless labial stop in Standard Konkani. Some dialects have /pha/ which can used
instead of /f/. The aspirate/un-aspirate variation is also found in nasals and laterals. The
starting syllable vowel is reduced after the aspirates and also after fricatives. Aspirates
are mostly found in non-starting position in Goan standard dialect. unaspirated conso-
nants replace aspirates in many dialects but diffrence is maintained in shorter vowels in
starting syllable.

3.2.3 Konkani Language Speech Sound Label Set

A standard set of labels for speech sounds commonly used in Konkani language is
presented. All labels are in lower case even though the labels are case-insensitive.
Since the number of speech sounds is larger than the alphabet, a system of suffixes, as
well as letter combinations, are used for labels.

19
Figure 3.6: Konkani Language Speech Sound Label Set

20
21
22
CHAPTER 4

METHODOLOGY

This chapter focuses on how the speech-to-text (STT) recognition system for Konkani
Language was developed. HTK toolkit which was used for building Hidden Markov
Models. HTK Toolkit is basically developed for researcher in speech recognition to
build HMM based speech processing tools. This project mostly focuses on issues in
building HMM, then using different types of feature extraction technique and also dif-
ferent types of Acoustic Model Generation. Due to the time constraint this research was
not able to use large vocabulary, which can be easily increased later. Hence, this was
limited to language model, in future research can focus more on developing language
model and integrating into this system. The goal of the research was to build speaker-
independent Konkani speech-to-text recognition system with robust word recognizer.

In the speech recognition, we need to extract features from the speech signal, known
as acoustic Information of the Speech. Further, this information is given to the HMM
model and processing is done, the output given by the model is the hypothesis transcrip-
tion of the isolated word. Speech recognition is a state-of-art and very complex task.
To develop speech-to-text recognition system has 4 main stages by HTK[20]

1. Data Preparation

2. Training

3. Recognition on testing data

4. Analysis
To develop the system, the following are sub-processes :

i. Building the task grammar

ii. Constructing a dictionary

iii. Recording the data.

iv. Creating transcription files for training data

v. Extracting features from training data

vi. (Re)training the acoustic models

vii. Evaluating the recognisers against the test data

viii. Analysis on recognition results

4.1 Data Preparation

To build speech recogniser we need to create speech data, a recording from microphone
from the speaker. It is a good idea to build speech data from scratch so that it will meet
the system requirement. This speech data is used for training as well as for testing the
recognizer system. Training speech data is used to develop the model and testing speech
data is used for evaluating the recognisers to evaluate the performance of the system.

4.1.1 Task Grammer

For any Language there should be proper a grammar defined and for speech recogniser,
we need to define task grammar for Konkani language. For this research focus is on a
limited language model, only digits from 1 to 10 in Konkani and grammar is created for
the Konkani digit. The defined grammar was in BN form which starts with $sil defines
the starting silence of the speech signal than $syl defines the word and again $sil which
defines the end silence of the speech signal, where | stands for logical or so out of 10
digits only one digit which ever is having highest probability will be selected and given
as an output.

24
$syl=( eka | don | tin | char | pach | sa | sat | ath | nav | dha );
(sil $syl sil)

The above grammar as a network as shown below :

Figure 4.1: Grammar for Digit Recognition

4.1.2 Word Network

Once we have created a task grammar we need to make it recognizable by hidden


Markov model toolkit HTK. Grammar is created just to provide user convenience but
for speech recognition system we need to provide it in such as way that a system can
understand it. HTK support Lattice format in Standard Lattice Format (SLF) which will
be having a word to word transition and also the word instances. In HTK word network
will be created automatically from the task grammar using HParse tool.

Figure 4.2: Process of creating a word lattice

25
4.1.3 Pronunciation Dictionary

In the dictionary, we have mapping between the word and the acoustic model, which are
of sub-word in the Phoneme level acoustic model and word in the word level acoustic
model, Words are used from the task grammar.

4.1.4 Recording

A collection of voice samples from people were taken in order to train a model and test
it. To build isolated word speech recognition system we need a collection of utterances
from different speakers. This work is limited to language model recognition system
build to recognise only 10 words. The recording was done by microphone using Au-
dacity tool. Speaker was told to read each digit at a time with some gap in between.
The distance between the speaker and microphone was kept approximately 6-8cm. The
recording was done in a closed room environment. The recording was done on mono
channel with a sampling rate of 16 kHz and has 16 bits per sample. Total 20 people’s
voices were recorded which include 10 male and 10 female speakers. Each speaker
was asked to utter each word 3 times so each speaker is having 30 (10*3)speech file.
In total, will be having 20 speakers with all 600 (30*20) speech files. All speech files
were saved in .wav format.

• Sampling rate of audio: 16 kHz

• Bit rate (bits per sample): 16

• Channel: mono (single channel)

Once the recording is complete we need to label all speech with transcription. As
per the requirement for the experiment this recorded data is split in training corpus and
testing corpus to build a model and to test the model respectively.

4.1.5 Phonetic Transcription

For Speech recognition we need to give training and testing corpora, Then we need
to tell the system which file corresponds to which digit in the corpora. Hidden Markov

26
Table 4.1: Summary of Speech Recording

Attributes Values
Total # of Files 600
Total # of Speakers 20
Total # of Session 600
Total Duration of all Sessions 1156s
Average Duration of each Session 1.916s

Model toolkit (HTK) uses Master Label files (MLF) which stores all the information re-
garding the speech data. This Master Label files (MLF) are created separately for Word
level acoustic model and Phoneme level acoustic model in which each recorded and
labeled speech file is associated with a word to build isolated word speech recognizer.

4.2 Feature Extraction and Optimization

In the Feature Extraction, we extract a number of parameters from the speech signal
which has maximum information relevant to build the HMM model. Hence, features
which are extracted will be robust to acoustic variation but sensitive to linguistic con-
tent. Also, features which we have extracted should be robust against noise present in
the speech and also other factors which will drop the accuracy of the speech recogni-
tion. Feature extraction is the method of reducing the dimensionality of the given input
data, this reduction may lead to some information loss. Mel Frequency Cepstral Coef-
ficients is mostly used in speech recognition because it is efficient in computation and
robustness. The basic step of Mel Frequency Cepstral Coefficients (MFCC) is filtering
which includes pre-emphasis filter and removing the surrounding noise.

• Pre-processing
To enhance the recognition accuracy, speech signals are pre-processed before fea-
tures are extracted. There are two main steps in pre-processing:
1. Pre-emphasization
Speech waveform has a high dynamic range and suffers from additive noise. So
it has to eliminate the higher frequency components and pre-emphasis is done.
2. Voice Activation Detection
Voice Activation Detection is used to differentiate between the voice with silence
and the voice without silence.

27
• Frame Blocking
Speech signal cannot be used directly, it has to split into frames and then analysis
is done. Normally frame size of the range 0-25 ms is taken and overlapping is
applied on frames, hamming window is applied.

• Windowing
To avoid unnatural discontinuities in the speech segment we use window shape
as hamming window.

• Discrete Fourier transform


The basis of performing Discrete Fourier transform (DFT) is to convert the vocal
tract impulse response in the time domain into frequency domain. DFT is applied
to obtain magnitude frequency response of each frame.

• Mel filter-bank
Mel filter-bank is based on human ear perception, frequency contents of sounds
for speech signal does not follow a linear scale. Actually there are more number
of filters on low frequency area and less number of filters on high frequency area,
the filter bank has a triangular band pass frequency response.

• Cepstrum
We need to convert from frequency domain to time domain using Inverse Discrete
Fourier Transform and result which we get after conversion is known as Mel
Frequency Cepstrum Coefficients (MFCC).

Figure 4.3: Block diagram for obtaining the MFCC

The speech signal is first divided into time frames consisting of an arbitrary number
of samples. Next, overlapping of the frames is used for smooth transition from frame to
frame. For each frame, we apply Hamming window to eliminate discontinuities at the
edges. Then we apply DFT to convert each frame from the time domain into frequency
domain. Then we apply Mel Filter-bank which consists of a set of band pass filter:

28
as the frequency increases number of filters decreases because human ear has more
number of filters for low frequency as compared to high frequency. Then again we
need to convert the log Mel spectrum into time domain using IDFT. The result of this
output is called Mel-Frequency Cepstrum Coefficient. This set of the coefficient are also
known as acoustic vectors. First 13 acoustic vectors will give you energy features next
26 acoustic vectors will be delta and next 39 acoustic vectors will be the accelerations
or double delta. So we have taken all MFCC acoustic vectors and tested on speech
recognition and analyze number of acoustic vectors affect on training data.

4.3 Parameter Estimation (Traning)

To build speech recognizer we need to define the structure of Hidden Markov Model
and then we need to estimate the model parameters of Hidden Markov Model to build
a model. We need to define the Prototype to train the topology of each Hidden Markov
Model, this process of parameter estimation is called training HMM. Later when we
are building the actual speech recognizer. This HMM parameter prototype definition
will be ignored and transition probabilities will be calculated using the training tools.

29
Figure 4.4: Architecture of Speech Recognition System

In the architecture we have main 5 modules which are used to developed the Konkani
speech recognition.

1. Training Corpus Preparation


2. Acoustic Analysis
3. Acoustic Model Generation
4. Testing Corpus
5. Decoding

In Training Corpus Preparation, speech was recorded using Audacity tool and was
saved in .wav format. It was needed to label all the speech files using Wave surfer
tool and then training labels were stored. Next step was Acoustic Analysis, in which
Training Corpus was taken and given to the HCopy Library which had been used to
extract the features from the Training Corpus. This technique also uses configuration
parameters such as MFCC 13, MFCC 26 and MFCC 39. Mapping Script contains
mapping between the training speech files and feature files. Next, these files were given
to Acoustic Model Generation, which created the model which was used to decode
testing files and to convert it into text with the help of decoder.

30
4.3.1 Training Strategies

There is need to create initial model with speech data and labeled transcription with
word boundaries. To initialize model HInit tool is required and to re-estimate the model
parameters HRest tool is required. For each word, HMMs will be generated. Training
data is taken and is segmented uniformly, and for each model with the corresponding
data, means and variance were calculated.

4.3.2 Acoustic Analysis

In speech recognition, speech waveforms cannot be directly given. First, we need to


do acoustic analysis on the speech waveform to make it more compact. We need to
take speech waveform and we need to do framing of length 25msec with the overlap-
ping each other frame, then each frame is taken and multiplied by windowing function
such as Hamming Window. From each Windowed frame, there is a need to extract a
vector of the acoustic coefficient. It is required to give information about audio data to
Hidden Markov model toolkit HTK such as Sampling rate, the format of wave file and
also about feature extraction parameters such as MFCC 13, MFCC 26 and MFCC 39.
Further Window Length and Pre-emphasis files are saved as a configuration file.

Figure 4.5: Conversion of the training data

This configuration file (.conf) is a text file that specifies the various configura-
tion parameters like the format of the speech files (HTK), technique for feature ex-
traction(MFCC), length of time frame(25msec), frame periodicity(10msec), number of
MFCC coefficients(13,26,39) etc. The Acoustic Vector (.mfcc) files are used in both
training and decoding phase of the system. The HCopy tool of HTK is used for this
purpose.

31
Training Corpus is first taken followed by Configuration file which was being gener-
ated as per the requirements of the system. It is also required to create a Mapping Script
file which will tell HCopy tool to convert training speech data into MFCC feature files.

4.3.3 Acoustic Model Generation

An acoustic model is outlined as a reference model to which comparisons are made to


recognize unknown utterances. There are 2 kinds of acoustic models viz. word model
and phoneme model. Word model has been used because it is appropriate for small
vocabulary and therefore the statistical approach Hidden Markov Modeling HMM for
system training. During this phase of implementation, first HMM initialization is com-
pleted using a prototype. This prototype needs to be generated for every word in the
dictionary. The same topology is used for all the HMMs and the defined topology con-
sists of four active states (observation functions) and two non-emitting states (initial and
therefore the last state with no observation function). A number of Gaussian mixtures
are used as observation functions and these are represented by a mean vector and vari-
ance vector in a text description file known as a prototype. This pre-defined prototype
along with Acoustic vector (.mfcc files) and training labels (.lab files) is used by HTK
tool HInit for initialization.

In the second step of this phase of implementation, HTK tool HRest is used for esti-
mating the best optimal values for the HMM parameters such as transition probability,
mean and variance vectors for every observation function. This iterative step is known
as re-estimation and this is repeated many times for each HMM to train. These embed-
ded re-estimations indicate the convergence through the change measure (convergence
factor). This final step of acoustic model generation phase, known as convergence test,
is repeated until the absolute value of convergence factor does not decrease from one
HRest iteration to another. In our system implementation, re-estimation iteration is
repeated for 5 times. This five HMMs per word in the vocabulary are generated.

32
4.3.4 Recognition

In Recognition, the aim is to convert speech wave file into text which is being spoken
by the user. First, an input speech signal is taken and then converted into acoustic
vectors (MFCC) using HCopy tool from Hidden Markov Model toolkit HTK. This is
done for both training corpus as well as for testing Corpus. Feature extraction is the
critical step in speech processing. If there is any loss of information while extracting
the features then this may lead a decrease in accuracy. There is a need to minimize
the loss of information during the feature extraction which will increase the accuracy in
speech recognition by optimizing the number of parameters of the Cepstral Coefficients
which plays an important role in converting a speech signal into a sequence of acoustic
vectors. Then Viterbi algorithm is used to find the optimal path which matches it against
the recognizer, using HVite tool.

Figure 4.6: Recognition process of an unknown input signal

33
Input Signal(.mfcc) is the feature extracted from the given input signal which is
needed to recognize. Model names are listed using HMMList and we have already
defined the dictionary and task network. The output file contains the transcription of
the input signal. If we use the file (004-10b) such as speaker 4 with spoken digit "dha"
as input data, for instance, we will get output:

Figure 4.7: Recognition output (recognised transcription)

In order to allow direct extraction of the acoustical coefficients from the input signal,
configuration parameters of acoustic analysis previously used with the training data are
needed.

34
CHAPTER 5

RESULTS

The performance of the system is tested against the speaker-independent parameter by


using test database which is different from the training corpus. Several experiments
were carried out and depending on experiment training and testing corpus is divided.
In order to evaluate the performance of the speaker independent system, speakers who
were present in testing corpus were not present in the training corpus.

The performance of speech recognition depends on various techniques such as fea-


ture extraction, modeling and testing. The speech recognition system mainly uses seg-
mented analysis, Mel frequency Cepstral coefficients MFCC and Gaussian Mixture
Model Gaussian Mixture Model (GMM).

The following equations show the formula for evaluating performance of speech
system where N is the total number of words in the test set, D is the number of deletions
(words which are present in the test set, but are deleted by the recognizer and are not
present in the recognizer transcription), S is a number of substitutions (words in the test
set which are substituted by other words in the recognizer transscription) and I is the
number of insertions (words which are present in the recognizer transcription however
not within the reference set).
(N − D − S − I)
Percentage Accuracy = ×100
N
Word Error Rate (WER) = 100 - Percentage Accuracy

Where Word Error RateWER in above equation is used as one of the criterion to
evaluate the performance of the system.
5.1 Performance Test

5.1.1 Experiment No. 1

In Experiment No.1, we have taken 5 male speakers and 5 female speakers in Training
Database, and for testing we have used 5 Male speakers and 5 female speakers which
are different from training database. In the very first experiment, we are taking the
13 MFCC values with 24 Number of Gaussian Mixtures. The accuracy improved by
around 14% for phoneme level acoustic Model and around 9% for word level acoustic
model. If the total average accuracy for both phoneme level acoustic model and word
level acoustic model is taken and compared, then the word level acoustic model has the
better accuracy by 2.89%.

In this experiment word level acoustic model is better than phoneme level acoustic
model.

36
By taking the 26 MFCC values with 24 Number of Gaussian Mixtures, the accuracy
improved around 12% for phoneme level acoustic model and around 6.67% for word
level acoustic model. If the total average accuracy for both phoneme level acoustic
model and word level acoustic model is taken and compared, phoneme level acoustic
model has the more accuracy by 3.319%.

In this experiment phoneme level acoustic model is better than word level acoustic
model

37
By taking the 39 MFCC values with 24 Number of Gaussian Mixtures. The ac-
curacy improved by around 10% for phoneme level acoustic model and around 5.34%
for word level acoustic model. If the total average accuracy for both phoneme level
acoustic model and word level acoustic model is taken and compared, phoneme level
acoustic model has the more accuracy by 1.7%.

In this experiment phoneme level acoustic model was better than word level acoustic
model.

38
5.1.2 Experiment No. 2

In Experiment No. 2 we have taken 7 male Speaker and 7 female Speaker in training
database, and for testing we have used 3 male Speaker and 3 female Speaker which are
different from training database.

We are taking the 13 MFCC values with 24 Number of Gaussian Mixtures. The
accuracy improved by around 6.67% for phoneme level acoustic model and around
16.12% for word level acoustic Model. If the total average accuracy for both phoneme
level acoustic model and word level acoustic model is taken and compared, word level
acoustic model has the more accuracy by 2.89%.

In this experiment word level acoustic Model was better than phoneme level acous-
tic Model.

39
By taking the 26 MFCC values with 24 Number of Gaussian Mixtures. Our accu-
racy improved by around 8.89% for phoneme level acoustic model and around 8.33%
for word level acoustic model. If the total average accuracy for both phoneme level
acoustic model and word level acoustic model is taken and compared, phoneme level
acoustic Model has the more accuracy by 2.29%.

In this experiment phoneme level acoustic model was better than word level acoustic
model

40
By taking the 39 MFCC values with 24 Number of Gaussian Mixtures. Our accu-
racy improved by around 8.89% for phoneme level acoustic model and around 2.22%
for word level acoustic Model. If we take the total average accuracy for both phoneme
level acoustic model and word level acoustic model and then compare them, phoneme
level acoustic model has the more accuracy by 1.43%.

In this experiment phoneme level acoustic model was better than word level acoustic
model.

41
5.1.3 Experiment No. 3

In Experiment No. 3 we have taken 9 male Speaker and 9 female Speaker in train-
ing database, and for testing we have used 1 male Speaker and 1 female alternatively
changing it with the training data.

We are taking the 13 MFCC values with 24 Number of Gaussian Mixtures. Our ac-
curacy improved by around 7.5% for phoneme level acoustic Model and around 11.84%
for word level acoustic model. If the total average accuracy for both phoneme level
acoustic model and word level acoustic model is taken and compared, word has the
more accuracy by 4.42%.

In this experiment word level acoustic model was better than phoneme level acoustic
model.

42
By taking the 26 MFCC values with 24 Number of Gaussian Mixtures. Our accu-
racy improved by around 9% for phoneme level acoustic model and around 7.5% for
word level acoustic model. If the total average accuracy for both phoneme level acoustic
model and word level acoustic model taken and compared, word has the more accuracy
by 2.79%.

In this experiment word level acoustic model was better than phoneme level acoustic
model.

43
By taking the 39 MFCC values with 24 Number of Gaussian Mixtures. Our accu-
racy improved by around 9% for phoneme level acoustic model and around 7.5% for
word level acoustic model. If the total average accuracy for both phoneme level acous-
tic model and word level acoustic model is taken and compared, word has the more
accuracy by 0.639%.

In this experiment word level acoustic Model was better than phoneme level acous-
tic Model.

44
5.2 Performance Analysis

All experiments were conducted and analysis on results were carried out. Feature Ex-
traction Techniques MFCC 13, MFCC 26 and MFCC 39 are used and compared with
Word Level Acoustic Model and Phoneme level Acoustic Model. Experiments were
carried out to check which Feature Extraction Technique was giving higher accuracy
for word level acoustic model and phoneme level acoustic model.

When MFCC 13 as feature extraction technique with word Level Acoustic Model
and phoneme level acoustic model was used, word Level acoustic model was giving
higher Accuracy than phoneme level acoustic model; word level acoustic model accu-
racy increased by 3.28% as compared to phoneme level acoustic model.

Then MFCC 26 as feature extraction technique with word Level acoustic model
and phoneme level acoustic model was used, phoneme level acoustic model was giving
higher accuracy than word level acoustic model; phoneme model accuracy increased by
0.4% as compared to word level acoustic model.

Then MFCC 39 as feature extraction technique with word level acoustic model and
phoneme level acoustic model was used, phoneme level acoustic model was giving
higher accuracy than word level acoustic model; phoneme level acoustic model accu-
racy increased by 2.37% as compared to word model. These results prove that the
training of the system was successful and developed system was speaker independent.

45
CHAPTER 6

CONCLUSION AND RECOMMENDATION

6.1 Conclusion

The main aim of this research was to build a Konkani Speech to Text Recognition
System, more focus was on developing recognition system which will minimize the loss
of information during the feature extraction which will increase the accuracy in speech
recognition and building Word Level and phoneme level acoustic model and to observe
which model gives good recognition accuracy for corresponding Feature Extraction
techniques. To achieve this objective Limited Language Model was developed with
limited word grammar and dictionary.

The system was tested using testing corpus data and the system scored 80.02% for
Phoneme Level Acoustic Model and 79.36% for word Level acoustic model. This de-
veloped system can be used by researchers as well as by developers who are willing to
develop an application in native langauge. This is just an establishment of speech recog-
nition system for Konkani langauge, this study can be expanded for large vocabularies
and for large scale language models.

6.2 Recommendation

Our work has shown that using word level acoustic model with MFCC 13 feature ex-
traction technique gives better accuracy than phoneme level acoustic model. While
phoneme level acoustic model has proved to give more accurate results than word level
acoustic model when used with MFCC 39 and MFCC 26. Recognition results are al-
most matching while slight variation is observed in MFCC 26 and MFCC 39.
6.3 Areas for Further Study

We need to enhance the Konkani Speech Recognition system to make it more robust
with higher performance and accuracy, as follows:

 Speech Database needs to be increased to minimum thousand speech sample


words collected from several individuals belonging to different age groups.

 Isolated Word Speech to Text Recognition System can be extended for continuous
speech recognition of Konkani language.

47
APPENDIX A

SPEAKER PROFILE

Table A.1: Speaker Details

Speaker ID Gender District Environment


S1(M) Male South Goa Lab
S2(M) Male South Goa Lab
S3(M) Male North Goa Lab
S4(M) Male South Goa Lab
S5(M) Male North Goa Lab
S6(M) Male North Goa Lab
S7(M) Male North Goa Lab
S8(M) Male North Goa Lab
S9(M) Male South Goa Lab
S10(M) Male North Goa Lab
S11(F) Female North Goa Lab
S12(F) Female North Goa Lab
S13(F) Female South Goa Lab
S14(F) Female South Goa Lab
S15(F) Female North Goa Lab
S16(F) Female North Goa Lab
S17(F) Female South Goa Lab
S18(F) Female North Goa Lab
S19(F) Female North Goa Lab
S20(F) Female North Goa Lab

Speakers were between age group of 15-28


APPENDIX B

TOOLKIT FOR DEVELOPING SPEECH RECOGNIZERS

Table B.1: Toolkit for building acoustic models and Speech recognizer

Programming
License Latest release Platform Support
Language
Prohibits
Linux,
redistribution Version 3.5 HTK book,
HTK C Mac OS X,
but R&D (released Dec 2015) active mail
Windows
allowed
sphinxbase-5prealpha
Linux,
pocketsphinx-5prealpha Tutorial
Sphinx BSD Java Mac OS X,
sphinxtrain-5prealpha and forum
Windows
sphinx4-5prealpha
APPENDIX C

RESULT ANALYSIS

We have conducted 3 experiments in which we have used MFCC 13, MFCC 26 and
MFCC 39 as feature extraction techniques. We conducted seperate experiments using
phoneme level and word level acoustic model.

Following tables show various experiments conducted using MFCC 13, MFCC 26
and MFCC 39 as feature extraction techniques and using phoneme level acoustic model.
Same experiments were conducted using word level acoustic model and using above
feature extraction techniques, results of which are followed.
REFERENCES
[1] L. R. Rabiner and B. H. Juang, B. Yegnanarayana "Fundamentals of speech
recognition" Pearson India, 1st Edition, 2009.
[2] B. Saxena and C. Wahi, "Hindi Digits Recognition System on Speech Data Col-
lected in Different Natural Noise Environments," 2015, pp. 23-30.
[3] Steve Young, Gunnar Ever, Thomas Hain, Dan Kershaw, Gareth Moore, Julian
Odell, Dave Ollason, Dan Povey, Valtcho Vaitchev, Phil Woodland, "The HTK Book",
copyright 2001-2002 Cambridge University Engineering Department
[4] A. Maesa, F. Garzia, M. Scarpiniti, and R. Cusani, "Text Independent Automatic
Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian
Mixture Models" Journal of Information Security, vol. 03, no. 04, pp. 335-340, 2012.
[5] L. K. Thakuria, P. Acharjee, A. Das, and P. H. Talukdar, "BODO Speech Recog-
nition based on Hidden Markov Model Toolkit (HTK)", International Journal of Scien-
tific and Engineering Research, vol. 4, no. 12, pp. 2309-2313, 2013.
[6] Jurafsky D., Martin J. Speech and Language Processing: An Introduction to Nat-
ural Language Processing, Computational Linguistics and Speech Recognition. Delhi,
India: Pearson Education, 2000.
[7] Deshmukh, N., Ganapathiraju, "Hierarchical Search for Large Vocabulary Con-
versational Speech Recognition", IEEE Signal Processing Magazine, 1(5):84-107, 1999.
[8] S. Swamy and R. K.V, "An Efficient Speech Recognition System," Computer
Science and Engineering: An International Journal, vol. 3, no. 4, pp. 21-27, Aug.
2013.
[9] E. Gouws, K. Wolvaardt, N. Kleynhans, and E. Barnard, "Appropriate base-
line values for HMM-based speech recognition," Proceedings of PRASA, pp. 169-172,
2004.
[10] B. A. Al-Qatab "Arabic speech recognition using hidden Markov model toolkit
(HTK)," in Information Technology (ITSim), International Symposium in, 2010, vol.
2, pp. 557-562.
[11] IEEE International Colloquium on Signal Processing and Its Applications, M.
N. Taib, Universiti Teknologi MARA, and Faculty of Electrical Engineering, Eds.,
CSPA 2009 proceedings. Piscataway, NJ: IEEE, 2009.
[12] Mengjie, Z., "Overview of speech recognition and related machine learning
techniques", December 10, 2004
[13] N. Dave, "Feature extraction methods LPC, PLP and MFCC in speech recogni-
tion", International Journal for Advance Research in Engineering and Technology, vol.
1, no. 6, pp. 1-4, 2013.

51
[14] B. Saxena and C. Wahi, "Hindi Digits Recognition System on Speech Data
Collected in Different Natural Noise Environments," 2015, pp. 23-30.
[15] Dupont, Audio-Visual Speech Modeling for Continuous Speech Recognition,IEEE
Transactions on multimedia:141-151, 2000
[16] Rudnicky, A.I., Lee, K.F., and Hauptmann, A.G. Survey of current speech
technology. Communications of the ACM:52-57, 1992.
[17] S. D. Dhingra, G. Nijhawan "Isolated speech recognition using MFCC and
DTW," International Journal of Advanced Research in Electrical, Electronics and In-
strumentation Engineering, vol. 2, no. 8, pp. 4085-4092, 2013.
[18] Chulhee Lee, Donghoon Hyun, Euisun Choi, Jinwook Go, and Chungyong Lee,
"Optimizing feature extraction for speech recognition" IEEE Transactions on Speech
and Audio Processing, vol. 11, no. 1, pp. 80-87, Jan. 2003.
[19] M. Dua, R. K. Aggarwal "Punjabi automatic speech recognition using HTK,"
IJCSI International Journal of Computer Science Issues, vol. 9, no. 4, pp. 1694-814,
2012.
[20] Chulhee Lee, Donghoon Hyun, Euisun Choi, Jinwook Go, and Chungyong Lee,
"Optimizing feature extraction for speech recognition", IEEE Transactions on Speech
and Audio Processing, vol. 11, no. 1, pp. 80-87, Jan. 2003.
[21] D. Bansal, "Punjabi Speech Synthesis System using HTK," International Jour-
nal of Information Sciences and Techniques, vol. 2, no. 4, pp. 58-70, Jul. 2012.
[22] M. Dua, R. K. Aggarwal, V. Kadyan, and S. Dua, "Punjabi speech to text sys-
tem for connected words" in Communication and Computing (ARTCom2012), Fourth
International Conference on Advances in Recent Technologies in, 2012, pp. 206-209.
[23] K. Patel and R. K. Prasad, "Speech recognition and verification using MFCC
and VQ", International Journal of Emerging Science and Engineering, vol. 1, no. 7, pp.
33-7, 2013.
[24] I. N. DEWI, F. FIRDAUSILLAH, and C. SUPRIYANTO, "SPHINX-4 IN-
DONESIAN ISOLATED DIGIT SPEECH RECOGNITION.," Journal of Theoretical
and Applied Information Technology, vol. 53, no. 1, 2013.
[25] A. Maesa, F. Garzia, M. Scarpiniti, and R. Cusani, "Text Independent Au-
tomatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and
Gaussian Mixture Models," Journal of Information Security, vol. 3, no. 4, pp. 335-340,
2012.
[26] L. Muda, M. Begam, "Voice recognition algorithms using mel frequency cep-
stral coefficient and dynamic time warping techniques", 2010.
[27] Pratik K. Kurzekar.et.al "Continuous Speech Recognition System: A Review",
Asian Journal of Computer Science and Information Technology. 4: 6 (2014) 62 - 66.
[28] Anchal Katyal, Amanpreet Kaur, Jasmeen Gill, "Automatic Speech Recogni-
tion: A Review",International Journal of Engineering and Advanced Technology (IJEAT)
ISSN: 2249-8958, Volume-3, Issue-3, February 2014

52
[29] Kishori R. Ghule, R. R. Deshmukh, "Feature Extraction Techniques for Speech
Recognition: A Review", International Journal of Scientific and Engineering Research,
Volume 6, Issue 5, May-2015.
[30] Harpreet Singh,et.al, "A Survey on Speech Recognition", International Journal
of Advanced Research in Computer Engineering and Technology (IJARCET) Volume
No. 2, Issue No. 6, June 2013.

53

View publication stats

Вам также может понравиться