Вы находитесь на странице: 1из 5

Isolated Word Recognition using MFCC and iterative Clustering Technique

1
Dr.A. Revathi, 2Pujitha Kothapalli, 2Arutla Sravya and 2C.Lakshmi Sravya
1
Professor ECE –School of EEE –SASTRA University
2
III year ECE –School of EEE –SASTRA University

ABSTRACT: dialling on mobile phones and many other day to day


applications. One of the important steps in speech
The main goal of this paper is to develop the speaker
recognition system is to extract features. The feature
independent isolated word recognition system. The ability of a
reader to recognise the words that are written correctly, virtually extraction deals with identifying the linguistic content
and effortlessly is defined as word recognition or isolated word and discarding the background noise, emotion etc. The
recognition. In this paper the recogniser consists of the three main motto of the feature extraction is to find
phases –speech acquisition (collection of words) –feature discriminative and robust features in the acoustic data.
extraction using Mel Frequency Cepstral Coefficient (mfcc) - There are many techniques to extract the features like
classification and recognition phase respectively. Firstly the
Mel Frequency Cepstral Coefficient (MFCC),
input analog signal is converted into digital speech signal. From
the digital signal the acoustic features are extracted and stored in
Perceptual Linear Prediction (PLP), Linear Predictive
the data base. These features are used for the computation of the Cepstral Coefficients (LPCC) etc, We have chosen
threshold values. In recognition phase the word uttered by an mfcc for the following reasons
independent speaker is recognised or rejected. Finally for the a) Mfcc are the most important features which are
different isolated words spoken by the different speakers training required among various kinds of speech applications
and testing process has been done. This speech recognition is b) It gives high accuracy results for clean speech.
used for speaker identification (speaker dependent), speech
recognition, security purposes etc.
In this paper a set of 7 words for both training and
Keywords: Speech recognition, frequency extraction, testing phases are collected from 16 speakers (8 male
Mel frequency Cepstral coefficient and 8 female) as input. Mel frequency Cepstral
coefficients are extracted from the speeches considered
INTRODUCTION: for training and testing. The speech model is developed
by K-means clustering algorithm for each word. In
Speech Communication has evolved to be efficient and
clustering method, clusters of size 256 is formed and
robust and it is clear that the route to computer based
stored. On considering directly, a model is being
speech recognition is the modelling of the human
created for each word by performing certain training
system. Speech recognition is the ability of the program
and testing. For each model mean of minimum
or machine to identify the words or the voices spoken in
distances are computed. The Word (speech) is classified
any language and converting them into the machine
based on the selection of the model which produces
readable format. Speech recognition systems can be
minimum of averages. Clusters of sizes 32, 64,128 can
classified as isolated or continuous. Isolated word
also be formed in such a way that they capture the
recognition requires a brief pause between the each
characteristics of the training data distribution. The
word spoken whereas in continuous word recognition
Euclidean distance is small for most frequently
we do not require as such. Isolated word recognition
occurring vectors and large for the least frequently
includes the applications of driving controls for the
occurring ones. As the cluster size increases execution
disabled people Speech recognition can further be
time increases.
classified as speaker dependent and speaker
independent. In speaker independent system, set of SPEECH IDENTIFICATION:
speeches taken for testing from the speakers whose
speeches are used for training. In speaker dependent, This paper includes samples for the words enter, erase,
speeches from the set of speakers for training and help, repeat, stop, start, yes from 16 speakers which
testing are different. Speech recognition is embedded in include 8 males and 8 females. Identification involves
the voice activated routing systems at call centres, voice
two stages namely training and testing. In training stage B. Frame Blocking:
data base of the words is given as input to the system
and MFCC features are extracted. Then the iterative Frames are the segments of pre-emphasized speech
clustering algorithm has been used to model the features signals with a duration of 20 -30 ms (here we consider
extracted. 25 ms).The voice signal is divided into frames of size N
samples and the adjacent frames are separated by M
MEL FREQUENCY CEPSTRAL (M<N). Framing is done for the short time spectral
COEFFICIENT analysis.

The speech input is typically recorded at a sampling C. Windowing :


rate of 16KHz.For minimizing the effect of aliasing in
analog -to-digital conversion this sample frequency is In order to keep the continuity of the signal, each of the
chosen. These sampled signals capture all frequencies above frames is multiplied by a hamming window. The
upto 8 KHz which cover most energy of sounds that are spectral distortion is minimized by using window to
generated by human beings. The Feature extraction taper the voice sample to zero at the both beginning and
block diagram is shown in Figure1. end of the each frame. Generally

A. Pre-Emphasis:

Pre-Emphasis is a very simple signal processing Where W[n] is the window function. The widely used
method that increases the amplitude of the higher hamming window function is given as
frequency bands and decreases the amplitude of lower
for n = 0 to N-1
frequency bands. It greatly reduces the noise in the
input signal .Here fixed first order high pass filter is
D. Fast Fourier Transform:
used to flatten the signal and make it less susceptible to
finite precision effects .A fixed or slowly adaptive The process of converting time domain into frequency
digital system is used in the pre-emphasizer. The most domain is known as FFT. We perform FFT to obtain
widely used pre-emphasis network is the fixed first magnitude frequency response of each frame. The
order system and is given by (1) obtained output is a spectrum or periodogram.

(1) E. Mel Frequency Wrapping:


Where y[n] is the first order high pass filter, ‘a’ is the The Mel frequency scale is linear frequency spacing
filter coefficient and ‘a’ generally lies in the interval below 1000Hz and logarithmic spacing above 1000Hz.
[0.95, 0.98] .We considered pre-emphasis coefficient One approach to simulate the Subjective spectrum is to
‘a’ as 0.97. use a filter bank, spaced uniformly on the Mel scale.
The filter bank that is used has triangular band pass
frequency response, and the spacing as well as
Input Speech bandwidth is determined by a constant Mel frequency
Mel-
interval. The constant K which is number of Mel
Cepstrum spectrum coefficients is taken as 13.
Pre-emphasis

Cepstrum F. Cepstrum:
Frame Blocking
The Cepstrum is a useful way of separating the source
and filter. It is a result of taking the inverse Fourier
Hamming Windowing Mel Frequency
transform of the logarithm of the estimated spectrum of
Wrapping
the signal. The Cepstral representation of the speech
FFT
spectrum provides a good representation of the local
spectral properties of the signal for the frame analysis

Figure 1 MFCC extraction


K-means Clustering Algorithm:

The process of partitioning a group of data points into a


small number of clusters is called clustering. K means is
one of the simplest unsupervised learning algorithms The conclusion matrix for the seven words is
that solve the well known clustering problem. Here the developed and is represented as shown in the
procedure follows the simple and easy way to classify a TABLE 1. It shows that for the word enter out of
given data set through a certain number of clusters. Our 155 samples 146 are correctly determined giving
idea is to define k centres for each cluster. The steps an accuracy of 94.19%. Out of all the words taken
involved in K-means algorithm are as follows the word ‘yes’ has got the highest accuracy of
a) The K points are placed into the space represented by 80.64%. The accuracy of all the words is
the objects that are being clustered. These points represented in Figure 2
represent the initial group centroids.
b) Each object is assigned to the group that has the
closest centroid
c) Recalculate the positions of the K centroids once all 120
the objects have been assigned

Recognition Accuracy
100
d)Repeat the above steps ‘b’ and ‘c’ until the centroids
no longer move .This produces a separation of objects 80
into groups from which the metric to be minimised 60
can be calculated.
40
20
RESULTS AND DISCUSSION 0

The data base used includes the words enter,


erase, help, repeat, start, stop, yes from 8 male
Words
speakers and 8 female speakers. Each word was
uttered 25 times by the speaker out of which 16
were taken for training and 9 for testing. For the Figure 2 ACCURACY
output obtained from the testing of 7 words the
accuracy of the words are measured using Here in the above Bar Graph1 the words are taken
on abscissa and their accuracy on ordinate axis

TABLE 1. CONFUSION MATRIX

Enter Erase Help Repeat Start Stop Yes Accuracy

Enter 146 5 0 0 3 1 0 94.19%

Erase 0 154 0 1 0 0 0 99.35%

Help 4 1 146 0 0 4 0 94.19%

Repeat 1 25 1 126 0 2 0 80.64%

Start 0 0 0 1 147 8 0 94.23%

Stop 0 1 2 0 13 140 0 89.74%

Yes 0 0 0 0 0 0 156 100%


CONCLUSION:

This paper addresses mfcc feature extraction of


isolated word recognition using iterative clustering
technique. The speech model of cluster 256 is
created for the each word. In testing phase mean of
the minimum distance is computed for the each
model and the word is classified based on the
selection of model which produces minimum of
average. The average accuracy of all the words taken
is 93.14%.

REFERENCES:

[1] Singer Identification using Clustering Algorithm by


D. Dharini and A. Revathy

[2] S. Dhingra, G. Nijhawan and P. Pandit, Isolated


Speech Recognition using MFCC and DTW,
International journal of Advanced Research in
inElectrical, Electronics and Instrumentation
Engineering,8(2), 2013

[3] An Approach to Extract Feature using MFCC by


Parwinder Pal Singh, Pushpa Rani

[4] N.N. Lokhande, N.S. Nehe and P.S. Vikhe , MFCC


based Robust features for English word Recognition, .
IEEE, 2012
[5] [ S. Dhingra, G. Nijhawan and P. Pandit, Isolated
Speech Recognition using MFCC and DTW,
International journal of Advanced Research
[6] Text independent speaker recognition and speaker
independent speech recognition using iterative
clustering by A.Revathy and Y.Venkataramani

Вам также может понравиться