Вы находитесь на странице: 1из 4

Transcription of Telugu TV News using ASR

M. Ram Reddy*, P. Laxminarayana*, A. V. Ramana#, Markandeya J L*, J N V S Indu Bhaskar*, B. Harish*,


S. Jagadheesh*, E. Sumalatha*
*
Research and Training Unit for Navigational Electronics, Osmania University
Hyderabad, India
1
ramreddy.malle@gmail.com, 2plaxminarayana@yahoo.com, 4markandeya.janaswamy@gmail.com, 5jupudibhaskar967@gmail.com,
6
biruduganti.harish@gmail.com, 7jagadeeshsamala@gmail.com, 8latha.emmela@gmail.com,
#
Ikanos Communications Pvt. Ltd.
Bangalore, India
3
avramana9@yahoo.com

Abstract—Automatic Speech Recognition (ASR) is the process data of those 3-4 people, the ASR system can recognize the
of converting the human speech which is in the form of acoustic speech for those 3-4 people easily. Therefore it is not
waveform, into text. In this paper we discussed about building an completely a speaker independent ASR system. As the
automatic speech recognition system for Telugu news. A Telugu collection of speech from many speakers is not an easy task
speech database is prepared along with the transcription, for building a speaker independent system, the proposed
dictionary. Telugu speech files are collected from the Telugu TV system build with the speech data base of limited number of
news channels. Most of the selected sentences are recorded in news readers will be much useful to find out an any event with
studio environment and some of the speech files are having non details from the archives of speech files of news. In this paper
predictable back ground noise. Wavsurfer is used to make the
building an automatic speech recognition system and its
speech in to small sentences. CMU SPHINX tool kit is used to
develop the ASR system. The recognized text file is finally
performance for Telugu news is reported.
displayed using Baraha software (demo version). The next section will give the brief explanation about the
Automatic Speech Recognition system. Third section
Keywords—Automatic Speech Recognition, Telugu news, describes data preparation for Telugu ASR system. Fourth
database, HMM, acoustic models, language models, Sphinx, section provides the testing results. Conclusions were made in
MFCC the fifth section.
I. INTRODUCTION II. AUTOMATIC SPEECH RECOGNITION SYSTEM
Automatic speech recognition (ASR) system converts the This paper, describes the Automatic speech recognition
acoustical speech signal into text form. It will mimic the system for Telugu news using CMU SPHINX recognition
human ear and find the information present in the speech. system [6]. Figure 1 shows the detailed block diagram of
ASR system needs to be developed for most of the existing CMU SPHINX speech recognition system. Once models are
languages for many applications, so that it can be used by the generated using training database, they can be used in speech
highest population. Towards this end, in the past decade, recognition. Model generation is done using the sphinx
some works have been done in the field of speech recognition training tools. Recognition is done by the speech decoder
for Indian languages [1-5]. (either sphinx3 or pocket sphinx).
ASR will make the human-to-machine communication as Speech database, transcriptions, phonetic dictionary and
user-friendly. If the machine takes the speech input, it will be phone list for particular language are required to train the
very easy for interacting with machines. Speaker system. Acoustic models are obtained using speech data with
independence is the typical aspect of a speech recognition transcription and phonetic dictionary. Acoustic models and
application. A system is said to be speaker independent if the language models are required for recognition. However
rate of success of speech recognition is independent of the language models can be obtained using large amount of
speaker. Generating the phone models for many speakers with transcription data only. i.e., speech data is not required for
variations, in any language will require large speech data language models.
spoken by multiple speakers and it leads to build a speaker
independent system. The following sections will describe the details about the
procedure adopted for building database, acoustic models,
Generally, finding a significant event in a year is difficult language models and recognition.
by searching the speech/audio files having news. Therefore if
the speech is converted in to text, it is easy to search. III. DATABASE PREPARATION
Therefore it is proposed to develop an ASR system for
transcribing Telugu TV news. Generally the number of news- Database preparation is an important and laborious task.
readers in the channel for particular program will be limited to To train the system we need to have speech files,
3-4 people. Therefore if the ASR system is trained with speech corresponding transcriptions, dictionary for all the words
present in the transcription, phone list for that language.
Training Training script using dedicated phone symbols, correct the transcription
CMU lm easily, if required. An example transcription is shown below
Speech Files Speech
Tool kit
Transcription
ఎ న్కల సమ ా కి రంగ◌ం ిదద్మౌ ోం ి
Model Generation

Sphinx Ennikala samarAniki raMgaM siddamautOMdi


Feature Acoustic Language
Training
Extraction Models Models
Tools

Phone list,
a aa i ii u uu rx rx~
Filler dic, అ ఆ ఇ ఈ ఉ ఊ ఋ ౠ
Dictionary
e ei ai o oo au n' h'
ఎ ఏ ఐ ఒ ఓ ఔ ◌ం ◌ః
Testing k kh g gh ~g
Filler dic,
speech
Dictionary క్ ఖ్ గ్ ఘ్ ఙ్
Recognition

File
Output ch chh j jh ~j
Text for
Pre- Feature Speech చ్ ఛ్ జ్ ఝ్ ఞ్
testing
processing Extraction Decoder
(sphinx3 or speech t' th' d' dh' nd~
pocket sphinx) file
ట్ ఠ్ డ్ ఢ్ ణ్
Fig.1. Automatic Speech Recognition (sphinx-3 or pocket sphinx decoder)
Setup with the Model Generation (Training) block diagram
t th d dh n
త్ థ్ ద్ ధ్ న్
A. Speech files collection
The files with TV news archived in the U-tube are r ph b bh m
downloaded and separated the audio from video. The news ప్ ఫ్ బ్ భ్ మ్
bulletins (in the form of video’s or mp4 format) are also
directly downloaded from the news channels website and y r l v sh shh s h l' r'
these are converted to WAV file format using ffmpeg tool [7]. య్ ర్ ల్ వ్ శ్ ష్ స్ హ్ ళ్ ఱ్
The generated WAV files are having 16 kHz sampling
frequency and 16 bits per sample resolution for duration of 30 ^ SIL
to 60 minutes. These long speech files are segregated to small Separator Silence
files with one or two sentences. In the downloaded news
bulletins, some parts are recorded or captured in studio Fig.2. Phone representations for Telugu varnamala (alphabet) with different
environment and some parts are collected in the noisy symbols
environments. Here we considered the sentences collected in
the studio and some sentences with slight background noise. D. Preparation of Dictionary
The long speech files are manually segmented using A phonetic dictionary is required to recognize the speech.
WavSurfer [8]. The labels are marked in the WavSurfer by Phonetic dictionary consists of all unique words and their
listening the speech, to make the sentences having complete sequence of phonemes present in the transcription (both the
meaning. After marking the labels, the entire speech file is training and testing). An example of phonetic dictionary is
divided into small length WAV files which can be utilized by shown below (Baraha Format)
the SPHINX ASR system.
• gud'^maarnin'g /g/ /u/ /d'/ /^/ /m/ /aa/ /r/ /n/ /i/
B. Phone list preparation /n'/ /g/
Phonemes are the basic sound units for any language. So • t'u /t'/ /u/
we need to have a phone list for Telugu language. In this list, • elakshhan /e/ /l/ /a/ /k/ /shh/ /a/ /n/
all the basic sound units in Telugu varnamala (alphabet) are Here 4064 unique words were found in the transcription.
represented with some distinguished symbols. This list is
shown in figure 2. IV. IMPLEMENTATION OF ASR FOR TELUGU NEWS

C. Preparation of Transcription file Automatic speech recognition system for Telugu news is
implemented using CMU SPHINX. Sphinx train tools are
Transcription file is having the corresponding text, which used to generate the acoustic models. CMU SLM Tools [12]
is spoken in the speech files. Here the transcription is written are used to generate the language models.
using the phone symbols used for the basic sounds in Telugu
as shown in figure-2. In this regard, to check the transcription A. Feature Extraction
in Telugu Unicode format, Baraha tool (Demo version) [9] is Feature extraction in ASR is the computation of a
used. Baraha works like a text editor. We can type Telugu sequence of feature vectors which provides a compact
representation of the given speech signal. The ultimate task of accuracy of the language models. If there are no medical
the feature extraction is to generate the sequence of vectors words in the database used for building language model but
representing spectral and temporal behavior of the signal. that language model is used for medical transcription ASR
Acoustic models are the basic characteristics for speech units will fail in recognition. So generally language models are
used in the recognition system. Mel Frequency Cepstral application dependent.
Coefficients (MFCC) features are used to train the models.
V. EXPERIMENTAL RESULTS
Here the Mel frequency cepstral coefficients (MFCC) are
used as features, extracted from speech files for both the Speech recognition performance is evaluated for testing-
model generation and recognition. Mel filter bank is designed files, using Sphinx III and Pocket Sphinx speech decoders
in such a way that it mimics the sensitivity of the human ear. separately with their respective acoustic models. At first
The frequency representation of the speech signal is converted MFCC features are extracted for the testing speech file and it
in to Mel frequency by passing through the Mel filter bank, will be provided as input to the decoder along with Dictionary,
designed according to the Mel scale in (1). filler dictionary, acoustic models and language model for the
Telugu language. Testing speech files are maintained at 16
kHz sampling frequency and 16 bits/sample.
2595 log 1 (1)
Recognition accuracy is computed, using word error rate
The following specifications used for extracting the MFCC (WER). Number of substitutions (S), Deletions (D) and
features. Insertions (I) are computed by comparing the recognized test
transcription with the original transcription. Word error rate is
Sampling Frequency = 16 kHz; Frame Length = 25ms; calculated using the below equation
Overlap of Fame = 15ms. Pre-emphasis factor = 0.97; lower
and upper cut of frequencies = 133Hz and 6855Hz; FFT . . (2)
Length = 512; Number of Mel Filters = 40; Number of MFCC
coefficients/frame and energy of frame together = 13. Later
13-delta and 13-double delta coefficients are cascaded to the Where N is the total number of words in the sentence
feature vector.
Training speech data of 800 sentences consists of 4064
unique words. Testing data with 52 sentences consists of total
B. Acoustic model generation
702 words. The recognition accuracy using Sphinx and Pocket
The task of an acoustic model in speech recognition is to Sphinx is given in Table I. Here both the training and testing
estimate, at run time the probabilities P (A/W) for any acoustic transcriptions are used in the language model and all the
data string A = a1, a2….am and hypothesized word string W = unique words are used in the phonetic dictionary.
w1, w2….wn. In other words, the task is to estimate the
probability that when the speaker utters W, the acoustic When coming to real time applications we need to see the
processor outputs A. recognized text directly on the screen. For this purpose we
have used Baraha (demo version). After the recognition of
CMU SPHINX train tools are used to generate the speech, the recognized text file is showed using Baraha as
parameters of the Acoustic characteristics/models; means,
variances, transition probabilities and mixture weights. shown in figure 3.
Continuous and semi-continuous models are generated for TABLE I. SPEECH RECOGNITION SYSTEM ACCURACY
sphinx-3 and pocketsphinx decoders respectively. Here we
have used 800 speech files of length around 2-15 seconds, to Total testing ASR accuracy with ASR accuracy with
train the automatic speech recognition system. Context words Sphinx-3 Pocket Sphinx
dependent triphones, diphone’s, mono phone models are
generated using HMMs by considering 8-gaussians and each 702 87.32 % 92.59 %
Gaussian is represented with 3 states. GMMS with 8-gaussians
per state are used based on the studies carried out to find with VI. CONCLUSION
optimum number of Gaussians for building an ASR with
sphinx-3 and TIMIT database [10][11]. However it varies A Telugu news speech recognition system is generated
based on the size of the database available for training. using an open source CMU SPHINX (speech recognition)
tools. Speech database is collected from the Telugu news
C. Language model generation channel bulletins. The output of the speech recognition system
will be displayed, using Baraha. The reported results are taken
Mainly language models are used to predict the sequences
by running the experiments. For the limited training data of
of words most frequently occurred in that particular language.
around one hour speech with 4064 unique words, ASR
These are having the information of probability of word
accuracy is 92.59% with pocket Sphinx for testing data having
following other or the same word. In this implementation
all words in the language model and dictionary. However the
language model is prepared using CMU SLM tool kit [12].
ASR accuracy is coming down drastically (results are not
Here the transcriptions of both training and testing are used to
shown in this paper) when the words in the testing data are not
generate the language model. Of course if some transcriptions
available in the dictionary and language model. So as and
are available later, they can also be included to improve the
Fig.3. Recognized text for input speech from the news channel

When new words are available, language model and dictionary inProc. Int. Conf. Speech and Computer (SPECOM), Patras, Greece,
should be updated. This automatic Telugu news transcription October 2005.
[4]. G.Sreenu, P.N.Girija, M.Narendra Prasad and M.Nagamani “A Human
will be very much useful for applications like key word Machine Speaker Dependent Speech Interactive System” in IIT
spotting and finding the details of an event occurred in the past Khargpur, December 20-22, 2004, IEEE 2004.
by searching the transcripted news. [5]. A. Lakshmi, Hema A. Murthy,”A New Approch to Continuous Speech
Recognition in Indian Languages” in National conference on
VII. FUTURE SCOPE communications (NCC), Indian Institute of Technology Bombay, 01-03
February 2008, pp. 277-280.
It should be trained with large speech corpus in Telugu [6]. Cmu Sphinx project by Carnegie Mellon University, Pittsburgh ,USA,
http://cmusphinx.sourceforge.net/
language, to make it suitable for real world applications. [7]. Website https://www.ffmpeg.org/
[8]. Wavesurfer http://www.speech.kth.se/wavesurfer/
REFERENCES [9]. Website http://www.baraha.com/
[10]. A. V.Ramana, P.Laxminarayana, P.Mythilisharan “Investigation of ASR
[1]. Ganesh Shivaraman, K Samudravijaya “Hindi speech recognition and
Recognition performance and Mean Opinion Scores for different speech
online speaker adaption” in Int. Conf. Technology systems and
and audio codecs” IETE Journal of Research, March –April 2012,
Management (ICTSM) Proc. Published by International Journal of
pp.121-129.
computer Applications (IJCA), 2011, pp. 27-30.
[11]. A.V. Ramana, P.Laxminarayana, P.Mythilisharan, “Real time ASR with
[2]. Venkateswarlu.R.L.K., Teja.R.R., Kumari.R.V.“Developing efficient
HMM word models for Telugu”, Procedings of the international
speech recognition system for Telugu letter recognition” in Int. Conf.
Conference on Recent Advances in Communication Engineering,
Computing, Communication and Applications (ICCCA), 22-24 Feb.
Osmania University, December 20-23, 2008.
2012, ISBN: 978-1-4673-0270-8, DOI :10.1109/ICCCA.2012.6179184.
[12]. The Carnegie Mellon University Statistical Language Modeling (SLM)
[3]. Anumanchipalli Gopalakrishna et al., "Development of Indian Language Toolkit http://www.speech.cs.cmu.edu/SLM_info.html
Speech Databases for Large Vocabulary Speech Recognition Systems",

Вам также может понравиться