Вы находитесь на странице: 1из 6

DSP Lab Project

Speech Recognition using MFCC


Project Report

Deepak Chandran - B110116EC


Hashin Jithu - B110704EC
Hemanth P - B110147EC

Problem Statement

There has been a dramatic increase in the adoption of biometric verification into
our daily lives, egs. laptop fingerprint scanners, SIRI, etc. Among these, voice
verification occupies a large portion of biometric verification due to its ease of
use. Systems that utilize the human voice for verification do not require the
user to be anywhere near the verification system.
Depending upon the problem specification, the task can be either Automatic
Speaker Identification (determining who is speaking) or Automatic Speaker Verification (validating whether the same person is speaking that has being claimed,
or not).. The aim of this project is to implement a Speaker Identification system
using MFCC concepts.

2
2.1

Theory
Feature Extraction

The recognition performance is dictated by the extraction of the best parametric representation of the speech signals. There are different methods that are
normally used for feature extraction like MFCC, LPC, PLP. In this project we
focus our efforts on MFCC.
The Mel-Frequency Cepstrum Coefficient(MFCC) technique is based off of human hearing perceptions. The mel frequency scale is a linear frequency spacing
below 1000 Hz and logarithmic spacing above 1kHz. The human perception of
the frequency contents of sounds for speech signals does not follow a linear scale

2.2

MFCC

The steps involved in calculating the MFCC coefficients are shown in Fig. 1.
Continuous speech that is coming from a source like a microphone is processed
over a short period of time. It is divided into to frames and overlapped with
the previous one for the clear transition. In second step we used hamming
window for overlapping frame which is used to reduce the distortion caused by
the overlapping. After windowing, the speech signal undergoes FFT and gets
converted from time domain to frequency domain. .In Mel Frequency wrapping,
each frame signals are passed through Mel-Scale band pass filter to mimic the
human ear. In the final stage, again signals converted into time domain using
DCT. Instead of using inverse FFT, Discrete Cosine Transform is used as it is
more appropriate

2.3

Feature Matching

A speaker recognition system should be able to determine upto what probability


does the unknown speakers speech that present in the database. It would be a
2

Figure 1:
tedious task to store all the vectors generated during the training phase. Using
the process of vector quantization, each feature vector can be quantized one
of several template vectors. A small number of representative vectors can be
created from the dataset. In the recognition stage, the unknown speakers speech
is compared to the codebook of each speaker and the difference is measured.

Figure 2:

Implementation

For the framing section, the speech signal is converted into frames consisting of
N samples with the frames being seperated by M samples. In our implementa-

tion, M = 100 and N = 256. In the windowing section we utilized the Hamming
window.
The acoustic vectors that are created from the MFCC process provide the characteristics of a speakers voice. When an unknown speaker records his/her into
MATLAB, a fingerprint of their voice is created similarly and using the Eucliedean distance technique, a suitable match is determined.

Observations

To implement the speaker recognition system, a simple voice command like


Hello was used.

Figure 3: Speech Signal

Figure 4: Framed Signal

Figure 5: Signal after windowing

Figure 6: Autocorrelation

Results

The aim of this project was to implement a speaker recognition system that
could at a high level differentiate between genders. After calculating the features
extracted from the unknown speech, they were then compared to the stored
feature set and the gender of the unknown speakers were identified successfully.

Euclidean Distance was used to compare the test to the database and the speech
was recognized correctly 9 out of 10 times. The crude speaker recognition code
was written in MATLAB and compares the average pitch of the recored wav file
as well as the vector differences between the formant peaks in the PSD of each
file.

References
[1] Campbell, J.P., Jr Speaker recognition: a tutorial, Proceedings of the IEEE
Volume 85, Issue 9, Sept. 1997 Page(s):1437 1462
[2] Kumar Rakesh, Subhangi Dutta and Kumara ShamaGender Recognition using Speech Processing Techniques in Labview ,International Journal of Advances in Engineering Technology, May 2011, ISSN: 2231-1963

Вам также может понравиться