Вы находитесь на странице: 1из 4

Text-Dependent Speaker Verification

Pratik Maheshwari 2
Mihir Kulkarni 3
Amol Madane 4

Abstract-- The aim of this project is to implement a  user input

text-dependent speaker verification system for  Knowledge of spoken text can improve system
authentication. The idea is to grant access only to a performance
particular person, whose speech was pre-recorded. The
proposed verification system is based on 7 test B. Text-independent recognition
s. The access is given to the person who qualifies 5 tests
out of 7 tests of authentication. We have tested the  Recognition system does not know text spoken by
prototype system in different environment. The results person
show that the prototype system is 100% efficient in noise  Examples: User selected phrase, conversational
free environment. The efficiency degrades if the
surrounding noise level is high.
 Used for applications with less control over user
 More flexible system but also more difficult

C URRENTLY, computers can only understand human

speech in a very limited capacity. It would be much
easier for anyone to use a computer if the computer could
 Speech recognition can provide knowledge of
spoken text
There are different kinds of speaker recognition
understand the person’s natural communication method. systems tools and methods were built based on
Speech recognition could also be used to provide access for different methods like:
anybody who has a handicap that prevents use of a keyboard.  Neural networks learning
There is an entire class of people that cannot use a computer  The Bayesian Maximum A Posteriori (MAP)
at all because they are disabled. Speech recognition could Adaptation Method
potentially make their lives easier. Computers also need a  Statistical analysis and vector quantization
way to be able to identify who is trying to use them. The  Gaussian mixture models (GMM)
most common method of user identification is through the  Hidden Markov models (HMM)
use of passwords. Passwords are not always effective for .
several reasons. The first reason is that the computer
identifies the user purely based on a sequence of characters II. TECHNICAL WORK
input by the user. It is easy to see that anyone knowing this PREPARATION
sequence can gain access, even if they are not the intended
The Speech Recognition Algorithm
user. Passwords can also be guessed or broken. There are
Our speech recognition algorithm contains 7 tests. In
several characteristics to a person’s voice that are unique to
the following sections, each of these tests are explained in
the individual. Because of this uniqueness, a persons voice
detail. Each of the individual tests has two flags associated
could be a very accurate way to authenticate a user. Voice
with it. If the condition is satisfied or not, the result is 1 or 0
recognition has the benefits of being very user friendly and
respectively. This results are derived from certain conditional
secure. Human speech can be very complex, so the scope of
probabilities derived from experimental conclusions. In order
this project is very small. We intend to design and implement
for the test to be effective the conditions should be satisfied.
a speech recognition system in which the computer will be
able to understand a few simple commands, and identify
Speech Period Detection
specific users.
When a machine is continuously listening to speech, a
difficulty arises when it is trying to figure out to where a
There are different speech modalities:
word starts and stops. We solved this problem by examining
A. Text-dependent recognition
the magnitude of several consecutive samples of sound. If the
 Recognition system knows text spoken by person
magnitude of these samples is great enough, then keep those
 Examples: fixed phrase, prompted phrase samples and examine them later.
 Used for applications with strong control over

Pratik Maheshwari is a student of third year Electronics Engg.,S.P.
In the example speech signal if lots of empty space where
College of Engg.,Mumbai,India.( email:pratik19@gmail.com). nothing is being said. We don’t want the computer to waste
Mihir Kulkarni is a student of third year Electronics Engg.,S.P. College of time analyzing this empty space, so we simply remove it. In
Engg.,Mumbai,India.( email:kmihiras@gmail.com).
Amol Madane is a faculty in Electronics Dept. of S.P. College of Matlab, this is done with the ‘clean’ function. An example
Engg.,Mumbai,India.( email:madane_a_r@yahoo.co.in). speech signal is shown in figure 1 before cleaning. The
K.T.Talele is a faculty in Electronics Dept. of S.P. College of cleaned signal is as shown in figure 2.
Engg.,Mumbai,India( email:kttalele@yahoo.co.uk)
R x y (l ) = ∑ y (n) x (n l)
Equation 1 : Cross-Correlation

This is a very effective test but due to large computations

required to perform this test, instead we use estimation. The
values obtained from cross-correlation give a good indication
of how similar the two signals are. Figure 3 illustrates the
cross-correlation of a signal with itself (autocorrelation).
After computing the correlation, there will be a huge peak in
the middle of the resulting signal if the two signals being
correlated are similar. In this case the peak is a value close to
200, which means they are very similar. To find out if the
sample passes this test, I compare this peak to a threshold. If
it is greater than the threshold, then it passes the second test.

Figure 1 Example speech signal.

Figure 3.Autocorrelation of the cleaned signal.

Figure 2 Example speech signal after it has been cleaned. 3. MEAN SQUARE ERROR
The formula for MSE is given by,
Speaker Recognition
Y= { ∑ xi2 } / N
The speaker recognition process relies heavily on
frequency analysis. This can be done because each person Equation 2
has some very unique characteristics to their voice that can
be isolated in the frequency domain. In this test we first find the autocorrelation of the original
signal and then we find the cross correlation of the original
1. The first test measures the length of speech. voice and the test signal.Now we find the MSE using the
This test is conducted mainly to eliminate speech segments above formula if the difference of the two resultant outputs.
that may contain too much data or too little data. If the If the MSE if above the certain threshold then it passes the
sample is found to be too short, it was most likely recorded test.
because background noise near the microphone tricked the
computer to start recording. If this is the case the results are 4. Discrete-time Fourier transform
ignored. If the sample is too long, then two possible
scenarios are imaginable. Either there is too much
background noise that is causing it to be constantly recorded,
X(e jω )= ∑ x[n] e jωn
or the user is issuing too many commands too quickly. In Equation 3 Discrete-time Fourier transform
both cases the other tests will still attempt to identify the
speech sample. The sample only passes the first test if the The rest of the tests all examine the signal in the frequency
length of the sample is within a percent threshold of the domain. All three remaining tests will examine the power
length of the template. This test is mainly designed to spectrum of the signals. First the Fourier transform is
prevent false positives. computed using Equation 3. It is convenient to have a
2. The second test used is max of time cross-correlation. standard length for the power spectrum computation when
comparing various signals. The length of the power spectrum
is equal to the length of the signal in the time domain when
computed with the method described in Equation 3. A plot of
the power spectrum obtained with this function can be seen
in Figure 4.

5. Power spectrum of example signal.

The fourth test calculates MSE from equation 2 of the difference of
the two power spectral densities.here also a threshold is selected
and decides the result of the test.

The fifth test is one of the most important to parts of speaker

recognition. We named this test frequency multiplication. In this
test we calculate the normalized power spectrum of the sample and
each template. Then we multiply each element of the power
spectrum of the original signal by the corresponding element in the
power spectrum of the test signal and we store this value. This is
illustrated in Figure 5.

Figure 5 Illustration of frequency multiplication.

We are trying to identify locations in the power spectrum

where both signals have peaks. This test works because only
peaks that show up in both signals will be passed on to the
resulting calculation. We then sum up the all the values of
the resulting spectrum and compare it with a threshold. If the
sum is greater than a threshold, then the sample passes the
fifth test.
The sixth test is similar to the fifth test, except that instead
of doing cross multiplication, we do cross-correlation. I have
already described how the correlation process works in the
time domain, and it can be used in the same way in the
frequency domain. We imply cross correlate the power
spectrum of one sample with that of a test signal. The highest
value in the autocorrelation is compared against a threshold,
and if it is greater than the threshold, then it passes the sixth
test. The seventh test compares the number of peaks in the
power spectrum of the original signal and the test signal. It
Figure 4 Power spectrum of example signal.
does this by counting the number of times the power
spectrum is above a certain threshold. If this value is close
enough to the template value, then it passes the seventh test.
This test, like the first test, is primarily designed to prevent
false positives.

We successfully implemented speaker verification system.
We have tested this prototype system under different
atmospheric conditions. We observed that efficiency is 100%
in noise free environment. As the normalized coefficients are
used for comparison, the system works irrespective of the
input signal amplitude. When the signal level is weak,
system performance degrades. Authenticaton based on Mean
Square Error of PSD and correlation tests found to be the
most accurate test. We have considered additional tests to
improve the efficiency and to avoid the possibility of false

IV. REFERENCES Pratik Maheshwari is student of third year Electronics

Engg.,S.P.College of Engg.,Mumbai. . He is a member
of IEEE. He won 1st prize in electronic workshop. His
[1] Jastiar,A.H.Abdullah and D.Mohamad,“Application of speaker area of interest includes Robotics Machine vision and
identification using discrete fourier tansform”. Embedded systems.
( pratik19@gmail.com)
[2] Thomas J.Plummer,” Investigation of Text Independent Speaker
VerificationProcess” on November 30,2004.
[3] K.K.Chin ,Darwin College,”Support verification machines applied
to speechpattern classification". Mihir Kulkarni is student of third year Electronics
Engg.,S.P.College of Engg.,Mumbai. .His area of
[4] Tomi Kinnunen,“Spectral Features for Automatic Text- interest includes Programming and Embedded systems.
( kmihiras@gmail.com)
Independent speaker recognition”.
[5] Harsh Gupta, Ville Hautamäki, Tomi Kinnunen and Pasi Fränti, “Field
Evaluation of Text-Dependent Speaker Recognition in an Access
Control Application”

[6] Jonathan Terleski,”Voice recognition using MATLAB”.

Amol R Madane is Lecturer in Electronics Engg Dept,
S. P. College of Engineering, Mumbai. His area of
[7] John G.Proakis,Dimitris G.Manolakis,”Digital Signal interest includes DSP, Image Processing and
Multimedia Comm. He has published nine papers in
Processing” National Conferences .
( madane_a_r@yahoo.co.in)

K.T.Talele is Assistant Professor in Electronics Engg

Dept, S. P. College of Engineering, Mumbai. He is a
member of IEEE. His area of interest includes DSP,
Image Processing and Multimedia Comm. He has
published twenty papers in National Conferences and
two papers in International conference.
( kttalele@yahoo.co.uk)