Вы находитесь на странице: 1из 82

Audio Signal Feature Extraction and Classification by CLARENCE CHEONG WEIHAN 1081106210 Session 2011/2012

The project report is prepared for Faculty of Engineering Multimedia University in partial fulfilment for Bachelor of Engineering

FACULTY OF ENGINEERING MULTIMEDIA UNIVERSITY May 2012

The copyright of this report belongs to the author under the terms of the Copyright Act 1987 as qualified by Regulation 4(1) of the Multimedia University Intellectual Property Regulations. Due acknowledgement shall always be made of the use of any material contained in, or derived from, this report.

ii

Declaration
I hereby declare that this work has been done by myself and no portion of the work contained in this report has been submitted in support of any application for any other degree or qualification of this or any other university or institute of learning. I also declare that pursuant to the provisions of the Copyright Act 1987, I have not engaged in any unauthorised act of copying or reproducing or attempt to copy / reproduce or cause to copy / reproduce or permit the copying / reproducing or the sharing and / or downloading of any copyrighted material or an attempt to do so whether by use of the Universitys facilities or outside networks / facilities whether in hard copy or soft copy format, of any material protected under the provisions of sections 3 and 7 of the Act whether for payment or otherwise save as specifically provided for therein. This shall include but not be limited to any lecture notes, course packs, thesis, text books, exam questions, any works of authorship fixed in any tangible medium of expression whether provided by the University or otherwise. I hereby further declare that in the event of any infringement of the provisions of the Act whether knowingly or unknowingly the University shall not be liable for the same in any manner whatsoever and undertakes to indemnify and keep indemnified the University against all such claims and actions.

Signature: ________________________ Name: Student ID: Date:

iii

Acknowledgements
I am very grateful that this project was completed successfully; however it would have been impossible to do so without the help of many. Hereby, I would like to extend my gratitude to them all. First and foremost, I would like to thank my Lord Jesus Christ for being with me and guiding me throughout the entire journey here in MMU and especially so in the completion of this Final Year Project. I would also like to express gratitude to my supervisor, Associate Professor Dr Mohammad Faizal for being kind and approachable, guiding and assisting me in birthing forth new ideas during the project. Along with that, I would also like to thank Multimedia University for giving me this chance to explore, cultivate and deepen my understanding in this area of research. I would also love to thank my family members for their moral and mental support and not forgetting my fellow friends who have been with me through the thick and thin. I am very grateful for the Multimedia University Christian Society and fellow church members who have been upholding me in prayer.

iv

Abstract
The aim of this project was to study how audio signal feature extraction and classification can be done in the areas of speech and speaker analysis. With the advancement of technology, speech recognition has become one of the key research areas to improve human-machine interaction; this project implemented algorithms such as the feature extraction, Mel-frequency Cepstral Coefficient technique and the vector quantization K-means method in an attempt to study and understand how well these algorithms work. With the help of MATLAB as the system backbone, various experiments were carried out by tweaking different variables and conditions. In view of classifying speakers based on given speech audio samples, the experiments performed were able to achieve an overall accuracy level of 66.67%. The system was able to perform classification despite given different Training Data such as same speech, variable speech and even variable sentence level.

Table of Contents
Declaration.....................................................................................................................iii Acknowledgements........................................................................................................iv Abstract...........................................................................................................................v Table of Contents..........................................................................................................vi List of Figures................................................................................................................vi List of Tables.................................................................................................................ix List of Equations............................................................................................................x .........................................................................................................................................x List of Abbreviations ....................................................................................................x CHAPTER 1: CHAPTER 1: CHAPTER 2: CHAPTER 3: INTRODUCTION...........................................................................1 LITERATURE REVIEW..............................................................6 METHODOLOGY........................................................................16 RESULTS AND DISCUSSIONS ................................................32

CHAPTER 4:CONCLUSIONS .................................................................................62 References.....................................................................................................................64 Appendix A MATLAB Code General Layout.......................................................66 Appendix B MATLAB Resampling of Sampling Frequency...............................68 Appendix C Multiple speaker sentences vs. 72 word length speech samples.....70

List of Figures
vi

Figure 1.1: Types of speech processing [3]..................................................................2 Figure 2.2: Representation of sound waves (a) Sine wave in time domain (b) Corresponding spectrum......................................6 Figure 2.3: Resonant frequencies along the coiled length of the basilar membrane [5].....................................................................................................................................7 Figure 2.4: Sound producing system [6]......................................................................8 Figure 2.5: Sampling process [7]..................................................................................9 Figure 2.6: Characteristic line of a quantizer [7].......................................................9 Figure 2.7: Reconstructed signal after sampling and quantization [7]....................9 Figure 2.8: Spectrogram..............................................................................................10 Figure 2.9: General speech recognition block diagram [8].....................................11 Figure 2.10: MFCC block diagram............................................................................12 Figure 2.11: Feature vectors (a) Before and (b) After vector quantization..........15 Figure 3.12: Downloading from AMI Corpus, screenshot......................................18 Figure 3.13: Training data of 4 different speakers saying Zero.........................19 Figure 3.14: Comparison of training and testing data (Speaker 1)........................19 Figure 3.15: Comparison of MFCC and feature vector after VQ..........................20 Figure 3.16: Comparison of training and testing data (Speaker 5) .......................20 Figure 3.17: Comparison of MFCC and feature vector after VQ..........................21 Figure 3.18: Audacity with an audio stream of 37 minutes.....................................22 Figure 3.19: Word extraction process........................................................................23 Figure 3.20: Screenshot of MATLAB........................................................................25 Figure 3.21: MATLAB Graphical Plot......................................................................26 Figure 3.22: Raw results of Speaker 1 sentences in terms of Euclidean Distance ........................................................................................................................................29

vii

Figure 3.23: Results of Speaker 1 sentences in terms of Euclidean Distance after data tabulation..............................................................................................................30 Figure 4.24: Plot of testing data 1 at different sampling frequency, fs..................36 Figure 4.25: Plot of training data 1's Feature Vector at different sampling frequency, fs .................................................................................................................37 Figure 4.26: Speaker 1 MFCC & Feature Vector....................................................40 Figure 4.27: Speaker 1 & test samples.......................................................................41

viii

List of Tables
Table 1.1: Project timeline............................................................................................4 Table 2.2: Identification rate (in %) for different windows [using Mel scale]......13 Table 3.3: List of Extracted words for each speaker...............................................24 Table 4.4: Results of training data vs. testing data using 4 Cepstral Coefficients32 Table 4.5: Results of training data vs. testing data using 12 Cepstral Coefficients ........................................................................................................................................33 Table 4.6: Results of training data vs. testing data using 20 Cepstral Coefficients ........................................................................................................................................33 Table 4.7: Results of training data vs. testing data using 28 Cepstral Coefficients ........................................................................................................................................34 Table 4.8: Number of Cepstral Coefficients vs. accuracy.......................................34 Table 4.9: Sampling frequency: 0.5*fs.......................................................................37 Table 4.10: Sampling frequency: 1*fs........................................................................37 Table 4.11: Sampling frequency: 1.5*fs.....................................................................38 Table 4.12: Speakers training data feature vector vs. 12 testing data...................41 Table 4.13: Speaker feature vector vs. 36 test data..................................................42 Table 4.14 (a, b & c): Speaker 1's speech retrieval, precision and recall rate ....44 Table 4.15 (a, b & c): Speaker 2's speech retrieval, precision and recall rate......46 Table 4.16 (a, b & c): Speaker 3's speech retrieval, precision and recall rate......48 Table 4.17 (a, b & c): Speaker 4's speech retrieval, precision and recall rate......50 Table 4.18: Speaker 1 vs. 20 test sentences................................................................53 Table 4.19: Speaker 2 vs. 20 test sentences................................................................54 Table 4.20: Speaker 3 vs. 20 test sentences................................................................54

Table 4.21: Speaker 4 vs. 20 test sentences................................................................55 Table 4.22: Speaker 2 & 4, feature vector 1 vs. 20 test sentence samples..............57 Table 4.23: Speaker 2 & 4, feature vector 2 vs. 20 test sentence samples..............57 Table 4.24: Speaker 1 & 4, feature vector 1 vs. 20 test sentence samples..............58 Table 4.25: Speaker 1 & 4, feature vector 2 vs. 20 test sentence samples..............58 Table 4.26: Speaker 2 & 4, feature vector vs. 72 test word length samples...........59 Table 4.27: Speaker 1 & 4, feature vector vs. 72 test word length samples...........59

List of Equations
(Equation 2.1)...............................................................................................................16

List of Abbreviations
MFCC VQ FV fs Mel-frequency Cepstral Coefficient Vector Quantization Feature Vector Sampling Frequency

CHAPTER 1:

INTRODUCTION

This chapter presents the importance and overview of the project in relation to audio signal feature extraction and classification along with the motivation and objectives behind it. The timeline of the project is also included to show how the project is conducted.

1.1

Project Overview
Audio signal processing and classification plays an important role in the everyday living of mankind. The topic of this project focuses on the extraction of vector features from a speech audio signal as well as the classification of the signals based on the extracted features.

1.2

Motivation
With the advancement of technology, humans are constantly trying to come up with cutting edge inventions and ideas to incorporate conventional communication methods into our everyday technology in view of ultimately bridging the gap between humans and machines. Unlike the eyes, the ear does not have any eyelid to block or reduce the input received asides from turning away from the source or by covering the ear. Audio signal processing and classification is a process that occurs so often and so naturally that most human beings do not realize the importance of it until one losses it. According to a research done by Serene J. Gondek, part of the brain still processes signals even while a person is asleep [1], hence humans are constantly processing received signals by classifying them into noise or information. Communication plays a major role in various parts and parcel of life. Beginning from the time a baby is born until the day he leaves the earth, one cannot escape from communicating with each other. While it is fairly simple and normal when it is done

between two or more individuals, computers in general find it very difficult to understand contextual information. Hence, speech processing has risen to be one of the most important processes among the various digital signal processing functions [2]. Information extracted from a speech signal may be used for biometric purpose such as speaker identification and speaker recognition, voice command recognition, language identification and even speaker diarization [17]. Speech recognition is defined as the capability of a system to identify words, or sentences in spoken language and allow system to understand what a user has said and act upon it. The figure below shows the breakdown of speech processing into smaller categories

Figure 1.1: Types of speech processing [3]

1.3

Project Objectives
The aim of this project is to study and implement feature extraction methods which are suitable for speaker classification purposes. This project focuses on Speaker Identification with the following scenarios: Text dependent, Cooperative speakers, High Quality Speech Text independent, Cooperative speakers, High Quality Speech Sentence level independent text, Cooperative speakers, High Quality

Speech Multiple speakers level, Cooperative speakers, High Quality Speech

1.4

Project Gantt chart


Research on project title was carried out on the second week where the requirements and fundamental theories were looked at. This was then followed by background studies and literature reviews of previous work done by various authors, studies has shown that the mel-frequency cepstrum coefficient method is proven to be one of the best speech feature extraction method. Next, the MATLAB environment was familiarized by reading tutorials and executing simple programs, consequently, the audio signal feature extraction and classification system design was planned. The speech database was developed through the seventh week until the tenth week of the project. While the data collection was on going, the first stage of the system development was carried out. This was then followed up by improvisation for better results and enhancements. A presentation was done during the thirteen week regarding the work done and further work that is to be developed in Part 2 of this project. The Part 2 of this project focuses mainly on experimenting on various speech data, to determine how the system performs when given different parameters. The thesis of this project was also written within the period. The following charts show the timeline of how the project was implemented.

Table 1.1: Project timeline

Part 1 Task \ Week Research on Project Title Background & literature review Familiarize with MATLAB System design & speech data collection Implementation & improvisation Presentation Part 2 Task \ Week Experiment on various speech data Thesis writing Presentation

1 -

10

11

12

13

1.5

Structure of Thesis
Chapter 2 presents background information on audio signal processing especially in the area of speech analysis using MFCC and VQ methods. Background information regarding the Human Auditory and Vocal system along with digital signal processing is also included. Chapter 3 proposes methods to study and implement feature extraction techniques which are suitable for speech audio classification. This includes the experiments designed and its procedures, development of speech database, and selection of system backbone. Chapter 4 reveals the results of experiments proposed in Chapter 3 and provides discussion on the results obtained. Chapter 5 presents the overall conclusion of this project and offers some suggestions for future work.

CHAPTER 1:

LITERATURE REVIEW

This chapter presents basic concepts of audio signals, the human hearing system and voice production system, analogue to digital conversion as well as the methods and idea behind a speech recognition system.

1.6

Audio Signals
Audio signals are made out of a series of longitudinal waves travelling in a direction parallel to the wave motion. These mechanical waves travel through mediums such as solids, liquid and air within the hearing range, propagating from one location to another. Figure 2.1.1(a) shows a basic sound wave representation using a sine wave.

(a)
Figure 2.2: Representation of sound waves

(b)

(a) Sine wave in time domain (b) Corresponding spectrum

The amplitude of the wave signals determine the loudness of the sound, while the frequency determines the pitch of which the wave produces. A sound wave can be represented in mainly two domains; the Time domain and the Frequency domain. Figure 2.1(a) shows a representation of a sound wave in the time domain and Figure 2.1(b) in the frequency domain. While a sine wave can be used to illustrate the sinusoidal characteristic of a sound wave, natural sound waves are usually more complex and are made out of multiple harmonics.

1.7

Human Auditory and Vocal System


In order to future study and create a system that is able to recognise spoken words or sentences, the human auditory system was first analysed and studied. Typically, the human hearing range is set between 20 Hz to 20 kHz; however, this is not the finite measurement of the human hearing system. In creating a system where speech is concern, this indirectly implies that signals exceeding these audible frequency ranges can be remove prior to any processing. The human ear structure plays an important role in the way audio signals are captured and processed. While in the natural, humans live in the time-based chronological timeline order, the ear, however, picks up signals and information in the frequency domain. A portion of the ear known as the Basilar Membrane, located within the cochlea of the inner ear, contains approximately 30,000 [3] hair cells that are each responsible for a particular frequency and is in charge of sending neural signals to the brain for further processing. Figure 2.2 shows a rough estimation of the positioning of the hair cells in response to frequency.

Figure 2.3: Resonant frequencies along the coiled length of the basilar membrane [5]

Next, the vocal system was studied. The human vocal tract produces a steam of sound made from the vibration of vocal folds. The vibration from the vocal folds releases streams of air from the lungs which will go through the vocal tracts and eventually

shape the sound accordingly. Since the make-up of each human being varies from person to person, the anatomical structure of the vocal tract is unique for every person. This characteristic is the fundamental principle of why speech recognition or speaker identification can be performed [2]. Figure 2.3 shows the sound producing system of a human being.

Figure 2.4: Sound producing system [6]

1.8

Analogue to Digital Conversion & Signal Processing


Since human produces voice in a continuous manner, it is impractical for machines to take in an infinite stream of audio signals. Hence, some pre-processing has to be performed in order to make the audio stream suitable for computation in the latter processes. The process first begins with obtaining audio streams through recording devices such as microphones connected to a computer or even a MP3 recorder. When a audio sample has been obtained, the continuous input is then sampled at regular quick succession of small time intervals. Figure 2.4 shows an example of such occurrence. Figure 2.5 shows the characteristic line of a Quantizer, this characteristic line is used to determine the numbers of levels an audio signal can be sampled into. Figure 2.6 demonstrations how a signal can be reconstructed after the sampling process.

Figure 2.5: Sampling process [7]

Following that, the sampled data are then quantized to a finite amount of levels, N, depending on the amplitude of the signal.

Figure 2.6: Characteristic line of a quantizer [7]

Figure 2.7: Reconstructed signal after sampling and quantization [7]

Next, for audio signal processing, the reconstructed signal is then transformed into the frequency domain. With reference to how the ear functions, a frequency domain plot would be able to provide more details for computational purposes. Another example of a frequency domain plot is the spectrogram as shown in Figure 2.7. A spectrogram is able to provide useful information by plotting the frequency response with respect to time. The amplitude of each frequency is represented by a different shade of colour.

Figure 2.8: Spectrogram

1.9

Speech Recognition System & Processes


A typical speech recognition system performs pattern comparison of features between a known waveform or data with the test input stream data. For such to happen, the system would first go through training by processing speech data from potential speakers. Depending on the applied approach, the training data set is usually obtained by having the speakers to provide audio streams of data based on the vocabulary size. The speech recognition system and process block diagram can be depicted with Figure 2.8 as illustrated by Mr. Kyaw Thu Ta [8]. With a larger set of training data, the system will be able to provide higher accuracy when the recognition process is performed. The feature extraction function plays a significant role in any speech recognition system. The algorithms used will determine the type of data stored in the training sets and the storage size required by the system as different features would require different storage space.

10

Figure 2.9: General speech recognition block diagram [8]

1.9.1 Feature Extraction Methods


Feature Extraction in general, is a special case of dimension reduction [10]. A large set of data is reduced to a smaller data set by removing redundancy and information which are not important. This enables faster computation as less data needs to be processed. There are various methods and algorithm that have been developed over the years. The feature extraction techniques can be broadly grouped into temporal analysis and spectral analysis. In temporal analysis, the time domain speech waveform itself is used for analysis while in spectral analysis, the spectral representation of the audio signal is used for analysis [10]. Some of the methods include Linear Prediction Coefficients (LPC), Cepstral Coefficients, Mel-frequency Cepstral Coefficients (MFCC), and Relative Spectra Filtering of Log Domain Coefficients (RASTA) and many more. In this project, the Mel-frequency Cepstral Coefficients method is selected and further discussed. The Mel-frequency Cepstrum Coefficients (MFCC) method is one of the most popular and accurate methods as it extracts feature vectors based on the known variation of the human ears critical bandwidth frequency [11, 16], hence, reducing the database into a smaller data set with all the crucial information. Another reason why MFCC is a popular algorithm is because it is useful for de-convolution. In order to understand how de-convolution is useful in the feature extraction, the speech vocal production model is studied. The speech vocal production model adopted in speech analysis is a

11

source-filter model, where the source refers to the air expelled from the lungs and the filter refers to the shape of the vocal tract. Thus, in the time domain, convolution takes place; whereas in the frequency domain, a multiplication takes place [12]. Convolution: source * filter = speech e (n) * h (n) = x (n)

Multiplication:

source x filter = speech E (z) x H (z) = X (z)

The MFCC algorithm utilises 2 types of filter to perform feature extraction, these filters are called Linearly Spaced Filter and Logarithmically Spaced Filter [13]. In order to pick up phonetically key characteristics of the audio stream, the signal is expressed in the Mel frequency scale. The Mel frequency scale has 2 different scales in which frequencies below 1000 Hz are linearly spaced while frequencies above 1000 Hz are logarithmically spaced. The extracted features may differ depending on the audio input stream given. Figure 2.9 shows the block diagram of a MFCC algorithm.

Figure 2.10: MFCC block diagram

Figure 2.9 shows how a MFCC processor works by extracting features from a continuous audio speech stream into the mel cepstrum coefficients. The goal of preemphasis is to compensate for the high frequency part that was suppressed when

12

produced by the human vocal system [13]. Framing and windowing are done by splitting a long stream of speech data into smaller chunk (frames). This is because speech signal analysis should be done in a quasiperiodic period as the human voice is generated in a quasiperiodic manner where the airflow is chopped into pieces via the vocal folds. Since, the vibration of the vocal folds produces the voice; a frequency domain analysis would be optimum. In order to minimize and avoid loss of data through the windowing and framing processes, successive frame blocks are slightly overlapped with each other around 30 to 50 percent [14]. Next, the frames are then multiplied with a window function. Based on a study done by Hassan et. al [12], an experiment was done to determine how the type of Window Function used affects the identification rate. The results of their experiments are shown in the table below:

Table 2.2: Identification rate (in %) for different windows [using Mel scale]

Code book size 1 2 4 8 16 32 64

Triangular 57.14 85.7 90.47 95.24 100 100 100

Rectangular 57.14 66.67 76.19 80.95 85.7 90.47 95.24

Hamming 57.14 85.7 100 100 100 100 100

13

1.9.2 Feature Classification Methods


After the vector features of the audio signal is obtained, the extracted data is then collected, stored, and classified into N number of classes [18]. Some of the famous classification methods includes the Hidden Markov Model (HMM) and the Vector Quantization (VQ) method. The HMM method uses statistics and probability to determine the class of a particular input signal whereas the VQ method aims to group the training data that are similar within an acceptable range and tolerance based on the centroids of each feature vector data. This is done because the training data set may be a finitely large database, and it would be computationally very expensive if the system goes through every single training data for each performed query.

14

(a)

(b)

Figure 2.11: Feature vectors (a) Before and (b) After vector quantization The advantage of performing Vector Quantization is that the database does not need to

contain every single training data along with its extracted features; rather, the feature vectors can be grouped and hence, simplifying and increasing the accuracy of a speech recognition system. This is proven to be especially useful when used on a system whereby data storage size is limited such as mobile devices and handhelds embedded systems.

1.9.3 Feature Vectors Comparison Methods


With the data collected through the training phase of the speech analysis system, the system is now ready for the recognition phase. In this phase, the system would first perform the feature extraction process on the input signal. Next, the extracted feature vector is then compared to the training datas feature vector by comparing the differences; the one with the smallest difference indicates the closest match to the

15

training data. The difference between the training data and the testing data is calculated by obtaining the Euclidean Distance. The Euclidean Distance can be calculated by using the following formula:

(Equation 2.1)

, Whereby p and q are the training and testing feature vectors respectively.

CHAPTER 2:
1.10

METHODOLOGY

Introduction
In order to study and implement feature extraction techniques suitable for the classification of audio signals into different speakers, experiments were designed to determine how different techniques would affect the recognition rate of the system.

1.11

Experiments
Four experiments were designed to determine how different feature extraction method and different speech files would affect the accuracy of the system. All the experiments were done using MFCC with the Hamming Windowing function. The experiments performed are: i. ii. iii. Accuracy (1): Same Speech, Variable Cepstral Coefficient Accuracy (2): Same Speech, Variable Sampling Frequency Accuracy (3a, b & c): Different speech, same sampling frequency

16

iv.

Accuracy (4): Variable sentence, same sampling frequency, different length

v.

Accuracy (5a & b): Multiple speakers in a sentence, same sampling frequency, different length

The size of the training data and testing data varied among each experiment.

1.12

Development of Speech Database


The speech database used in this project was obtained through the internet. However, before these audio signals were used for the experiments, the files were first checked if it was suitable to be applied in this system. Using audio signals at its optimum quality level would be able to best reflect the capability of the system created in this project.

The speech databases were obtained from:


http://www.ifp.illinois.edu/~minhdo/teaching/speaker_recognition/data/data.zip https://corpus.amiproject.org/download/download

The speech data downloaded from the first source was easy and simple to use as it was already separated into 16 Microsoft WAV format files of 8 speakers and the loudness as well as sampling rate of the speech files were decent. The files are separated into 2 different folders, one of which is the training folder, while the other is the testing folder. Each speaker was asked to utter the digit ZERO and the audio signal was recorded. Figure 3.1 shows a screenshot of the downloading process from AMI Corpus.

17

Figure 3.12: Downloading from AMI Corpus, screenshot

In order to simulate voice variation over time, the training and testing data were recorded at least 6 months apart. Figure 3.2 shows the training data of 4 different speakers saying Zero. The comparison between two recordings from the same speaker is shown in Figure 3.3 and Figure 3.5 while Figure 3.4 and Figure 3.6 shows the comparison of the MFCC and Feature Vector after VQ.

18

Figure 3.13: Training data of 4 different speakers saying Zero.

Figure 3.14: Comparison of training and testing data (Speaker 1)

19

Figure 3.15: Comparison of MFCC and feature vector after VQ

Figure 3.16: Comparison of training and testing data (Speaker 5)

20

Figure 3.17: Comparison of MFCC and feature vector after VQ

The speech data downloaded from the second source required some amount of preprocessing as the amplitude of each signal was considerably low and the data were in sentence form. With that, a free audio editor called Audacity was employed to perform the required pre-processing. Figure 3.7 shows a screenshot of Audacity with an audio stream of 37 minutes.

21

Figure 3.18: Audacity with an audio stream of 37 minutes

The audio files were taken from 4 different speakers; and each speaker has 4 sample files. All the speech samples were recorded using the same device at the same sampling frequency. However, the amplitude of each speech file is dependent of the placement of the microphones on each speaker. In order to obtain different speech samples (as required from Experiment: Accuracy (3a, b &c)) and sentences (as required from Experiment: Accuracy (4)), the audio stream was imported into Audacity for processing. The audio stream was zoomed in for a better resolution of where each word starts and ends. This enables and eases the process of extracting different speech samples or sentences from the audio stream.

Figure 3.8 shows how the individual words are extracted from the audio stream:

22

Figure 3.19: Word extraction process

Table 3.1 lists out all the extracted words for each speaker:

23

Table 3.3: List of Extracted words for each speaker

Speaker 1 Briefly Company Consumer Development Ideas Internal Note Okay Original Product Recognition Speech This Today Trendy Very Welcome works

Speaker 2 Battery Breakdown Capacitor Circuit Control Depend Dependable Dont Horses Much Pick Presenting Press Remote Signal The Voice Yourself

Speaker 3 And Animal Before But Complicated Choices Creative Designer Easy Experience Examples Flyer Interface Marketed Operation Player Right Variety

Speaker 4 Ahead And Buttons Contrast Device Easily Enhance Fashion Fingers Meeting Nature Person Players Practical Project Shape Research Underdog

1.13

System Backbone
In order to perform the listed experiments above, MATLAB - a software by MathWorks was selected. MATLAB is the abbreviation for Matrix Laboratory, it is a numerical programming environment that is integrated with many graphical and visualisation tools. MATLAB was chosen as the system backbone because it has a large library of functions that are ready to be used. Along with that, MATLAB also provides a very user friendly interface and a very comprehensive documentation. Furthermore, many

24

custom functions have been written by the community and this is an extra advantage for choosing MATLAB. In terms of speech processing, this backbone system comes with many build-in functions such as wavread (a function that reads in a Microsoft WAVE file into matrix loaded with a stream of binary data), resample (a function that is used to convert or change an audio signals sampling frequency. As mentioned above, the community has written some voice processing toolbox which is free for use. Among them are VOICEBOX: Speech Recognition Toolbox and ASR (Automatic Speech Recognition) Toolbox. These toolboxes would be able to provide much help in implementing this projects system. Figure 3.9 and Figure 3.10 show screenshots of the MATLAB environment as well as a graphical plot done using MATLAB.

Figure 3.20: Screenshot of MATLAB

25

Figure 3.21: MATLAB Graphical Plot

1.14

Experiment Procedures
With the experiments planned, the ready speech database and a backbone system, the experiment procedures can be laid out. The experiments planned should take place in a chronological order as it would provide a better understanding as well as a good build up to how each characteristic and variable affect the accuracy and overall performance of a speaker classification system. First, the training data is loaded into the backbone system, MATLAB by calling the wavread function. In this experiment, the wavread function outputs 3 parameters which include the speech signal stored in a vector matrix (Y), the sampling frequency (fs) and numbers of bits per sample (NBits). [ Y, fs, NBits ] = wavread (FILE)

26

Next, the Mel-frequency cepstrum coefficient is calculated by using the melcepst function. The melcepst function is written by Mike Brookes, Department of Electrical & Electronic Engineering, Imperial College [15]. The VOICEBOX: Speech Processing Toolbox for MATLAB can be obtained from: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html. In the course of this project, the melcepst function that was used required 4 parameters and outputs the coefficients as a result.

mfcc = melcepst (Y, fs, W, Nc)';

Parameters: Y fs W Speech Signal Sampling Frequency Windowing Function Default M for Hamming Window Nc Number of Cepstral Coefficients, default 12

In this melcepst function, each successive frame blocks are set to overlap by 50 percent to reduce data loss.

27

Another important function that would be frequently used in this project would be Vector Quantization using K-Means Algorithm. The kmeans function can be found through the statistical toolbox in MATLAB. The mfcc found using melcepst will be inputted into the kmeans function parameter, X. IDX = kmeans (X,k);

Parameters: X K n-by-p Matrix Number of clusters, K

28

1.15

Data Tabulation
The results obtained through the experiments were tabulated with the help of Microsoft Excel. The data were colour coded and then sorted numerically to help reduce the complexity result analysis. Figure 3.11 shows raw results of Speaker 1s sentences against 20 sentences from all 5 speakers in terms of Euclidean Distance. An example of data tabulation is as below:

Figure 3.22: Raw results of Speaker 1 sentences in terms of Euclidean Distance

29

Figure 3.12 shows the results of Speaker 1s sentences against 20 sentences from all 5 speakers in terms of Euclidean Distance after the colour coding and numerical sorting is performed. The ranks of each sentence is tabulated on Column A according to the sorted Euclidean Distance magnitude.

Figure 3.23: Results of Speaker 1 sentences in terms of Euclidean Distance after data tabulation

30

1.16

Experiment Setup Summary


There were mainly 5 different types of experiments that were carried during the project. The effectiveness of the mel-frequency cepstral coefficients coupled along with the K-means algorithm were tested using different kind of speech samples in terms of words, sentences from a single person to sentences with 2 speakers. In Accuracy (1) and Accuracy (2), a total number of 16 speech samples were used to determine how well the system works. There were 8 speakers involved, each with 2 speech samples that were taken 6 months apart to create time variation. These speech samples were then testes against itselves. To study how the system works with variable speech word samples, a total number of 72 speech samples from 4 different male and female speakers were used for Accuracy (3a, b, & c). The list of words used can be found in Table 3.1. Furthermore, in Accuracy (4), 5 speech sentence samples of 30 seconds each were collected from each of the same 4 speakers in Accuracy (3a, b & c). A total number of 20 speech sentences were used. Lastly, in Accuracy (5a & b), 2 speech sentences with 2 different speakers each were tested against the words and sentences extracted for Accuracy (3a, b, & c) and Accuracy (4) respectively.

31

CHAPTER 3:

RESULTS AND DISCUSSIONS

The results of the following experiments are tabulated in table forms. An ideal result table would have only 1 coloured cell per column. The cells in yellow should contain the smallest value as it represents the Euclidean Distance between the Testing Data and the correct Training Data; while the cells in red indicate computational errors. The ranking of the correct Training Data is also listed.

1.17

Accuracy (1): Same Speech, Variable Cepstral Coefficient


In the first experiment, the effect of different numbers of Cepstral Coefficients in an MFCC feature extraction technique is examined to observe the change in accuracy of system. 8 training files from 8 different speakers were tested against 8 testing files. The training file number and testing file number represents the same speaker. The results of using different numbers of Cepstral Coefficient in this experiment are shown in the following tables.
Table 4.4: Results of training data vs. testing data using 4 Cepstral Coefficients

No. Co. = 4 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8

Test 1

Test 2

Test 3

Test 4

Test 5

Test 6

Test 7

Test 8

0.292 2.663 1.599 1.525 0.715 3.971 3.205 1.269 2.492 0.813 0.697 5.342 1.975 3.605 1.946 3.860 3.887 1.948 0.862 7.235 3.729 2.590 2.124 5.608 2.172 9.766 8.017 0.528 2.938 12.86 0 9.159 1.491

3.601 0.300 0.369 6.556 2.194 5.421 0.753 4.186 3.119 2.226 0.973 5.945 3.371 1.400 2.920 4.699 5.809 0.729 0.605 9.074 3.916 5.778 0.832 5.843 1.376 1.608 0.698 2.474 0.602 5.538 1.404 0.942

32

Rank

Table 4.5: Results of training data vs. testing data using 12 Cepstral Coefficients

No. Co. = 12 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank

Test 1

Test 2

Test 3

Test 4

Test 5

Test 6

Test 7 3.993

Test 8 3.264 5.632 6.463 2.440 5.109 4.905 6.334 1.239 1

0.529 4.538 2.907 1.888 1.733 4.998

4.785 0.879 1.406 7.578 2.453 5.340 3.371 5.626 2.907 1.708 8.395 4.257 3.354 2.836 2.668 11.08 1 9.129 0.892 3.425 13.33 6 9.503

5.089 0.718 1.446 8.078 2.312 6.208 1.379 4.476 3.990 2.113 6.992 3.996 1.485 3.176

6.535 2.003 1.330 9.965 4.408 6.070 1.137 2.899 2.548 1.682 4.437 1.060 6.095 2.007 1 2 4 1 3 1 1

Table 4.6: Results of training data vs. testing data using 20 Cepstral Coefficients

No. Co. = 20 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6

Test 1 0.545 5.799 5.647 2.811 5.419 4.771

Test 2 5.85 2 0.96 6 3.95 0 12.5 89 2.47 1 4.80

Test 3 3.49 0 1.64 1 2.11 3 9.89 3 2.29 2 2.51

Test 4 1.955 8.450 8.442 1.107 8.542 7.350

Test 5 2.077 3.256 4.568 3.679 2.479 4.028

Test 6 5.46 1 5.68 9 3.70 3 13.8 05 6.73 9 1.58

Test 7 5.38 4 3.55 4 3.88 1 10.9 89 2.98 5 4.06

Test 8 3.817 6.298 6.905 2.838 5.439 4.963

33

8 Train 7 Train 8 Rank 7.060 3.276 2.49 0 3.24 7 1

1 1.51 9 1.97 6 4 10.48 2 4.855 4.712 1.146

3 6.22 6 6.19 7 1

8 1.47 6 2.60 6 1 6.554 1.286

Table 4.7: Results of training data vs. testing data using 28 Cepstral Coefficients

No. Co. = 28 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank

Test 1 0.600 5.861 5.713 2.924 5.706 4.909 7.444 3.436 1

Test 2 5.960 1.004 4.034 12.752 2.643 4.859 2.720 3.350 1

Test 3 3.587 1.729 2.185 10.002 2.351 2.529 1.607 2.001 4

Test 4 1.981 8.506 8.458 1.143 8.689 7.399 10.702 4.906 1

Test 5 2.409 3.590 4.741 4.033 2.529 4.205 4.784 1.379 3

Test 6 5.539 5.768 3.733 13.872 6.792 1.590 6.341 6.221 1

Test 7 5.648 3.737 4.081 11.300 3.134 4.172 1.622 2.768 1

Test 8 3.848 6.337 6.940 2.867 5.538 4.976 6.724 1.294 1

The table below presents the summary of Part (a) in terms of accuracy percentage:
Table 4.8: Number of Cepstral Coefficients vs. accuracy

Number of Cepstral Coefficients 8 12 20 28

Accuracy 50% 62.5% 75.0% 75.0%

34

This experiment shows that as the number of Cepstral Coefficients used in a MFCC feature extraction algorithm increases, the accuracy of a speaker classification system also increases.

1.18

Accuracy Frequency

(2):

Same

Speech,

Variable

Sampling

The goal of this second experiment is to see how the accuracy of the system will be affected when the sampling frequency of the Training Data changes, given the same speech and the same number of Cepstral Coefficients. 8 training files from 8 different speakers were tested against 8 testing files. The training file number and testing file number represents the same speaker. The original sampling frequency, fs, used is 44.1 kHz. The results of this experiment are shown in the following tables using Sampling Frequency at 0.5fs, fs, and 1.5fs. Figure 4.1 shows the plot of testing data 1 at different sampling rate.

35

Figure 4.24: Plot of testing data 1 at different sampling frequency, fs

36

Figure 4.25: Plot of training data 1's Feature Vector at different sampling frequency, fs Table 4.9: Sampling frequency: 0.5*fs Table 4.10: Sampling frequency: 1*fs

fs = 0.5*fs Test 1 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank 1.762 2.513 4.209 5.842 3.551 2.835 3.567 1.447 2

Test 2 11.310 6.179 7.309 6.688 8.468 5.529 9.781 2

Test 3 8.395 4.663 4.264 4.569 5.324 3.925 7.332 2

Test 4 1.955 3.792 5.406 4.263 3.204 4.377 4.758 1.402 5

Test 5 6.607 3.070 5.695 5.374 6.391 4.591 5.650 3

Test 6 5.826 3.535 3.878 5.621 3.956 4.570 6.810 3

Test 7 8.808 5.983 5.925 6.457 8.369 5.338 10.250 1

Test 8 5.857 3.582 6.954 4.728 6.986 5.214 4.118 2

22.266 18.122

14.306 16.659 20.520 11.082

37

fs = 1*fs Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8

Test 1 0.60 0 5.86 1 5.71 3 2.92 4 5.70 6 4.90 9 7.44 4 3.43 6 1

Test 2

Test 3

Test 4

Test 5 2.40 9 3.59 0 4.74 1 4.03 3 2.52 9 4.20 5 4.78 4 1.37 9 3

Test 6

Test 7

Test 8 3.84 8 6.33 7 6.94 0 2.86 7 5.53 8 4.97 6 6.72 4 1.29 4 1

5.960 3.587 1.981 1.004 1.729 8.506 4.034 2.185 8.458 12.75 10.00 2 2 1.143

5.539 5.648 5.768 3.737 3.733 4.081 13.87 11.30 2 0

2.643 2.351 8.689 4.859 2.529 7.399 2.720 1.607 10.70 2

6.792 3.134 1.590 4.172 6.341 1.622 6.221 2.768

3.350 2.001 4.906

Rank

Table 4.11: Sampling frequency: 1.5*fs

38

Based on the results, one can observe that the overall accuracy of the system has fs = Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 1.5*fs 19.3 19.5 19.8 20.9 20.7 21.7 22.2 20.77 Train 1 85 55 41 62 59 36 08 0 23.2 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 55 28.1 23 20.7 20 20.5 09 29.9 64 26.3 77 23.5 26 17.7 39 23.7 60 23.2 81 14.9 76 24.9 46 18.7 96 18.2 62 19.6 15 25.3 99 23.7 92 16.0 93 26.9 78 20.5 69 20.3 72 27.2 59 33.0 62 20.0 04 24.7 82 33.4 71 30.5 20 25.8 85 24.0 84 29.2 82 20.7 01 20.3 55 30.9 81 26.6 59 22.7 60 22.5 76 27.1 46 30.1 67 20.1 18 28.2 67 25.8 54 27.2 92 22.4 25.01 14 76 24 99 81 58 32 4 9 5 7 8 0 4 27.6 31.05 25.8 19.25 18.3 21.26 29.5 31.87 22.3 27.97 22.8 23.41

Rank 1 2 7 1 1 7 3 4 decreased. However, by inspecting the ranks of each test results, most of the returned results are still within the top 3 ranks.

1.19

Accuracy (3a, b & c): Variable Speech


The third experiment is divided into three portions where the speech database used in this experiment contains 4 different speakers with 18 different speech samples. In Part (a), 15 speech samples are used for training purpose while the other 3 are used for testing purposes. The experiment is followed by Part (b) where the training data is now reduced to 9 samples and the other 9 are used for testing purposes. Lastly, in contrast with Part (a & b), Part (c) reduces the training data each to a single speech sample, and is then tested against the other 71 samples to determine how well the system is able to classify a speaker based on only 1 training data.

39

The goal of this experiment is to determine how the size of a training set would affect the overall accuracy of a speaker classification system when given variable speech samples.

1.19.1 Part A: 15 training and 3 test samples


15 speech samples are used for training purposes while the other 3 are used for testing purposes. The figures below shows the MFCC of 4 speakers along with the feature vector after VQ is performed. Figure 4.3 depicts an example of a compilation of 15 training samples MFCC feature vectors into 1 feature vector for Speaker 1 while Figure 4.4 shows the differences of the feature vector against the 3 test samples in a graphical illustration.

Figure 4.26: Speaker 1 MFCC & Feature Vector

The following figure provide graphical comparisons of the speakers training Feature Vector against speech samples from the same speaker.

40

Figure 4.27: Speaker 1 & test samples

The following tables shows the results of the experiment in terms of Euclidean Distance between each speakers training feature vector against 12 test speech samples.
Table 4.12: Speakers training data feature vector vs. 12 testing data

FV 1 Speaker 1 Speaker 2 Speaker 3 Speaker 4 FV 2 Speaker 1 Speaker 2 Speaker 3 Speaker 4

Test 1 20.115 11.879 10.411 19.420 Test 1 18.758 8.830 8.581 18.959

Test 2 1.026 6.112 9.096 3.192 Test 2 6.133 5.536 6.893 6.505

Test 3 2.890 3.790 7.603 3.318 Test 3 5.464 1.949 3.378 4.727

FV 3 Speaker 1 Speaker 2 Speaker 3 Speaker 4 FV 4 Speaker 1 Speaker 2 Speaker 3 Speaker 4

Test 1 4.478 1.260 2.189 3.232 Test 1 4.432 11.930 16.140 6.319

Test 2 17.618 7.872 6.907 14.445 Test 2 2.477 7.524 10.893 4.423

Test 3 16.631 7.918 4.368 15.057 Test 3 3.372 7.317 9.086 3.200

The results presented in Table 4.9 shows that with 15 Training Data, the speaker recognition process can achieve up to 66.67% of accuracy when given 3 Testing Data.

41

1.19.2 Part B: 9 training and 9 test samples


9 speech samples are used for training purpose while the other 9 are used for testing purposes.
Table 4.13: Speaker feature vector vs. 36 test data

FV1 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Rank

Test 1 6.617 8.438 11.423 9.331 1

Test 2 7.746 7.548 5.669 6.546 4

Test 3 2.357 5.971 8.926 3.227 1

Test 4

Test 5

Test 6 1.947 6.161 8.723 3.714 1

Test 7 22.114 12.089 12.206 22.822 3

Test 8 1.078 6.177 9.616 2.490 1

Test 9 2.997 3.782 7.678 3.408 1

13.823 13.800 12.694 28.642 10.950 14.43 2 3 35.43 9 15.792 1

FV 2 Speaker 1 Speaker 2 Speaker 3 Speaker

Test 1 4.285 6.145 8.515 4.287

Test 2 25.321 14.65 6 8.808 22.576

Test 3 3.809 5.832 6.259 2.766

Test 4 16.04 9 5.979 6.280 15.330

Test 5 6.466 1.507 2.839 5.726

Test 6 4.696 3.143 6.499 4.816

Test 7 19.755 7.460 10.512 19.705

Test 8 5.936 5.800 6.901 5.946

Test 9 5.448 2.162 2.981 4.710

42

4 Rank 3 2 3 1 1 1 1 1 1

FV 3 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Rank

Test 1 10.183 3.368 6.226 11.077 2

Test 2 6.592 3.488 4.187 6.103 2

Test 3

Test 4

Test 5 4.958 3.331 2.535 4.618 1

Test 6 19.183 7.946 7.258 18.724 1

Test 7 5.029 1.837 2.574 4.336 2

15.272 11.012 6.525 5.192 14.37 8 1 2.948 2.850 10.669 1

Test 8 18.14 9 9.050 5.325 16.631 1

Test 9 18.260 8.381 5.844 17.582 1

FV 4 Speaker 1 Speaker 2 Speaker 3 Speaker 4 Rank

Test 1 9.182 15.285 11.49 7 7.225 1

Test 2 4.086 7.858 11.429 2.745 1

Test 3 4.616 3.966 4.314 3.557 1

Test 4 8.811 6.585 4.257 6.629 3

Test 5 1.812 9.082 10.592 1.396 1

Test 6 21.328

Test 7 3.885

Test 8 2.517 7.075

Test 9 3.021 8.067 8.614 2.636 1

13.103 11.930 11.75 4 19.007 3

16.085 11.962 4.491 2 3.447 2

The results returned from this experiment show a similar level of accuracy as compared to the results obtained in Part (3a), which is up to an accuracy of 66.67%. The reduction of 6 Training Data did not show a significant impact towards the systems ability to recognise speaker.

43

1.19.3 Part C: 1 training and 72 test samples


In Part C, the training data is reduced to a single speech sample. Each training data will be tested against 71 other speech samples and the training data itself. This experiment would provide a good overview of how well the mel-frequency cepstrum coefficient feature extraction technique would work, given variable speech samples. The results of this experiment are expressed in the forms of tables where only the top 10 tests results are returned by the system. The results returned are the values with the lowest Euclidean Value; the shaded cells represent speech samples from the same speaker while the rest are data from other speakers. The retrieved rate is the number of correctly retrieved data. The precision is calculated by taking the retrieved rate, divide it by 10 and multiply it by 100 to obtain the percentage rate. The recall rates are computed by taking the retrieved rate and divide it by 18 (as there are 18 speech samples from the same speaker). Therefore, the maximum recall rate achievable is 0.55 (10 divided by 18). Table 4.11 until Table 4.14 shows the Top 10 ranks of the results of each speech word samples against 72 other samples in terms of Euclidean Distance which are arranged in an ascending order. Speaker 1:
Table 4.14 (a, b & c): Speaker 1's speech retrieval, precision and recall rate

(a)

Speech 1

Speech 2

Speech 3

Speech 4

Speech 5

Speech 6

44

Retrieved Precision Recall

Retrieved Precision Recall

0.000 1.277 1.421 1.643 2.160 2.181 2.208 2.883 Speech 3.486 7 4.274 0.000 7 1.762 70% 2.604 0.389 2.674 2.756 2.986 3.036 3.087 3.088 3.231 6 60% 0.333

0.000 1.762 2.120 2.149 2.284 2.344 2.473 2.522 Speech 2.677 8 2.734 0.000 5 11.770 50% 18.813 0.278 21.321 30.804 31.825 35.746 36.725 38.103 38.145 3 30% 0.167

0.000 1.285 1.672 1.932 2.208 2.853 3.130 3.183 Speech 3.257 9 3.291 0.000 8 1.611 80% 2.160 0.444 2.180 3.036 3.137 3.257 3.738 3.960 4.141 9 90% 0.500

0.000 2.072 2.344 2.756 3.223 3.912 3.925 4.326 Speech 5.011 10 5.264 0.000 4 2.637 40% 3.282 0.222 4.141 4.908 5.791 5.854 6.030 6.188 6.336 8 80% 0.444

0.000 1.682 1.932 2.181 2.522 2.656 2.829 2.929 Speech 3.087 11 3.127 0.000 8 2.072 80% 3.043 0.444 3.131 3.987 4.051 4.308 4.865 5.013 5.143 4 40% 0.222

0.000 2.149 2.285 2.442 2.686 2.829 2.863 3.183 Speech 3.367 12 3.381 0.000 6 2.826 60% 3.287 0.333 3.367 3.633 3.730 3.745 3.814 3.858 4.019 4 40% 0.222

(b)

Retrieved Precision Recall

Speech 13 0.000 5.325 5.916 6.076 6.807 7.079 7.272 7.294 7.721 8.136 5 50% 0.278

Speech 14 0.000 3.057 7.863 10.544 10.831 11.520 11.729 11.770 12.264 12.652 4 40% 0.222

Speech 15 0.000 1.611 1.643 1.895 2.319 2.442 2.637 2.853 2.929 2.948 9 90% 0.500

Speech 16 0.000 4.546 6.810 7.629 7.738 8.778 9.178 9.232 9.823 10.091 1 10% 0.056

Speech 17 0.000 1.353 1.421 1.672 1.682 1.895 2.180 2.639 3.150 3.890 7 70% 0.389

Speech 18 0.000 3.231 3.386 3.672 3.733 4.127 4.424 4.492 4.528 4.629 6 45 60% 0.333

(c)

Speaker 2:

Table 4.15 (a, b & c): Speaker 2's speech retrieval, precision and recall rate

46

Retrieved Precision Recall Retrieved Precision Recall

Speech 1 0.000 6.783 Speech 7.384 13 8.199 0 9.651 3.063 9.731 4.075 9.838 4.156 10.335 4.314 10.389 4.427 11.432 4.836 4 5.484 40% 5.978 0.222 6.315 4 40% 0.222

Speech 2 0.000 2.852 Speech 3.290 14 3.482 0 3.511 1.134 3.712 2.071 3.809 2.631 3.881 3.088 4.049 3.309 4.272 3.532 4 3.549 40% 3.712 0.222 4.026 3 30% 0.167

Speech 3 0.000 2.573 Speech 2.741 15 3.046 0 3.201 1.998 3.517 2.522 3.677 3.046 4.047 3.775 4.136 3.905 4.404 4.046 5 4.602 50% 4.634 0.278 4.740 8 80% 0.444

Speech 4 0.000 1.134 Speech 2.210 16 2.222 0 2.573 4.257 3.326 5.157 3.385 6.874 3.482 7.887 3.629 7.974 3.685 8.199 4 8.523 40% 8.594 0.222 8.786 5 50% 0.278

Speech 5 0.000 3.726 Speech 4.427 17 5.067 0 5.743 3.091 5.760 3.767 5.996 3.905 6.353 4.606 6.766 4.773 6.937 4.824 5 5.373 50% 5.551 0.278 5.789 5 50% 0.278

(a) Speech 6 0.000 2.826 Speech 3.767 18 3.775 0 3.950 1.998 4.032 2.038 5.191 2.738 5.472 3.179 5.730 3.201 7.034 3.563 7 3.823 70% 3.878 0.389 3.989 6 60% 0.333

(b)

Retrieved Precision Recall

Speech 7 0 3.950 4.634 4.679 4.679 4.979 5.022 5.082 5.207 5.264 7 70% 0.389

Speech 8 0 2.220 2.924 4.085 4.271 4.303 4.368 4.404 4.740 4.895 5 50% 0.278

Speech 9 0 2.671 3.069 3.434 3.823 3.912 3.912 3.997 4.049 4.081 3 30% 0.167

Speech 10 0 4.019 4.032 4.324 4.551 4.602 4.876 5.022 5.316 5.586 6 60% 0.333

Speech 11 0 4.585 7.825 8.084 8.168 8.292 8.885 9.134 10.120 10.211 2 20% 0.111

Speech 12 0 3.179 3.730 3.828 3.864 3.915 4.274 4.311 4.324 4.559 4 40% 0.222

(c)

47

Speaker 3:

Table 4.16 (a, b & c): Speaker 3's speech retrieval, precision and recall rate

(a)

Retrieved Precision Recall

Speech 1 0 3.881 4.478 4.600 4.745 5.288 5.456 5.646 5.925 6.655 8 80% 0.444

Speech 2 0 2.852 4.443 4.468 4.745 4.896 4.995 5.119 5.437 5.519 6 60% 0.333

Speech 3 0 3.809 4.987 5.119 5.366 5.484 5.512 5.743 5.933 6.147 4 40% 0.222

Speech 4 0 3.063 3.240 5.079 5.760 6.810 7.024 7.200 7.988 8.416 6 60% 0.333

Speech 5 0 2.210 2.631 2.814 2.822 2.986 3.157 3.451 3.468 3.677 4 40% 0.222

Speech 6 0 1.115 2.220 2.234 2.734 3.069 3.142 3.232 3.563 3.777 6 60% 0.333

48

(b) Speech 7 0 2.622 3.992 4.600 4.987 5.157 5.334 5.524 5.831 5.963 8 80% 0.444 Speech 8 0 9.839 10.884 11.755 12.412 12.676 12.863 12.962 13.103 13.782 5 50% 0.278 Speech 9 0 2.038 2.234 2.958 3.470 3.915 4.046 4.183 4.290 4.304 6 60% 0.333 Speech 10 0 2.814 3.820 4.029 4.047 5.189 5.245 5.910 5.996 6.023 4 40% 0.222 Speech 11 0 2.522 2.738 2.741 2.924 2.958 3.142 3.598 4.661 5.414 6 60% 0.333 Speech 12 0 2.620 3.488 4.156 4.578 4.861 5.079 5.137 5.877 6.110 7 70% 0.389

Retrieved Precision Recall

49

(c)

Speaker 4:
Table 4.17 (a, b & c): Speaker 4's speech retrieval, precision and recall rate

(a)

Retrieved Precision Recall Retrieved Precision Recall

Speech Speech 1 2 0 0 4.799 Speech 2.285 Speech 4.933 2.518 13 14 4.972 2.895 0 0 5.419 3.608 2.622 1.115 5.614 4.069 3.326 1.952 6.036 4.714 3.598 2.284 6.082 5.188 3.796 2.674 6.273 5.399 4.165 2.822 6.923 5.630 4.257 3.072 4 4 4.472 3.309 40% 40% 4.501 3.470 0.222 0.222 4.861 3.571 6 5 60% 50% 0.333 0.278

Speech 3 0 6.612 Speech 7.034 15 7.115 0 7.480 2.620 7.704 3.240 8.461 4.075 9.424 4.308 10.020 4.546 10.292 5.472 6 6.494 40% 6.598 0.333 6.852 6 60% 0.333

Speech 4 0 0.822 Speech 2.120 16 2.895 0 3.131 1.952 3.245 2.071 3.381 2.222 3.451 2.604 3.532 2.671 3.571 2.677 3 3.157 30% 3.223 0.167 3.232 4 40% 0.222

Speech 5 0 2.270 Speech 3.589 17 3.733 0 3.799 3.290 3.807 4.995 3.963 5.512 3.991 6.655 4.559 7.234 4.882 8.080 5 8.168 50% 8.371 0.278 8.394 5 50% 0.278

Speech 6 0 1.635 Speech 3.211 18 3.291 0 3.807 3.488 3.895 3.796 4.229 3.992 4.524 4.308 4.535 6.315 5.176 6.753 6 6.921 60% 7.200 0.333 7.843 8 80% 0.444

50

(b)

Retrieved Precision Recall

Speech 7 0 4.029 6.128 6.546 6.859 7.023 7.426 7.506 7.793 7.889 1 10% 0.056

Speech 8 0 5.583 5.591 5.818 5.829 6.379 6.387 6.513 7.878 8.436 5 50% 0.278

Speech 9 0 2.379 3.057 3.895 4.137 5.922 6.547 6.553 6.966 7.721 6 60% 0.333

Speech 10 0 5.283 5.839 5.891 6.379 7.130 8.075 8.161 8.316 8.645 6 60% 0.333

Speech 11 0 2.270 3.950 4.803 4.824 4.862 4.944 5.097 5.211 5.224 3 30% 0.167

Speech 12 0 0.822 2.473 2.518 2.863 3.469 3.549 3.685 3.745 3.789 3 30% 0.167

(c)

Retrieved Precision Recall

Speech 13 0 3.072 3.767 3.837 3.987 4.081 4.166 4.272 4.680 4.746 3 30% 0.167

Speech 14 0 1.285 1.635 2.383 2.571 2.639 2.656 2.883 3.589 3.671 6 60% 0.333

Speech 15 0 6.273 7.102 7.607 7.721 7.738 8.250 8.299 8.513 9.134 3 30% 0.167

Speech 16 0 2.379 2.383 2.939 3.150 3.211 3.409 3.486 5.496 5.664 7 70% 0.389

Speech 17 0 1.277 1.353 2.319 2.939 3.127 3.137 3.624 3.671 4.882 4 40% 0.222

Speech 18 0 2.571 2.686 3.091 3.356 3.633 3.748 4.069 4.600 4.689 3 30% 0.167

51

Based on the results returned from Speaker 1 to 4, the highest Retrieved data is 9 while the lowest is 1. The average Precision rate for Speaker 1 to 4 is 57.8%, 48.3%, 60.0% and 43.33% respectively. This precision rate shows that with only 1 training data, the system is still able to retrieve and recognise speakers to an acceptable level.

1.20

Accuracy (4): Variable Sentence


The experiments done before this were in forms of words. Experiment: Accuracy (4) is designed to see how the system would function given variable sentences of 30 seconds each. Each speaker has 5 sets of sentences; out of which, 1 will be used as the training data and this training data will then be compared against all the other sentences. Table 4.15 - Table 4.18 records the results of the system when a single

52

sentence is tested against 20 sentences and arranged in ranks, in an ascending order. The results will be classified using the following legends. Speaker 1 (Male) Speaker 2 (Female) Speaker 3 (Female) Speaker 4 (Male) Speaker 1:
Table 4.18: Speaker 1 vs. 20 test sentences

Sentence Sentence Sentence Sentence Sentence 1 2 3 4 5 1 0 0 0 0 0 2 0.327 0.427 0.507 0.327 0.258 3 0.427 0.562 0.521 0.398 0.422 4 0.521 0.668 0.559 0.562 0.550 5 0.587 0.880 0.575 0.619 0.573 6 0.609 0.888 0.648 0.744 0.575 7 0.620 0.888 1.084 0.970 0.609 8 0.685 1.012 1.138 1.119 0.619 9 0.984 1.069 1.310 1.329 0.747 10 1.107 1.138 1.329 1.379 0.888 11 1.737 1.482 1.839 1.669 1.304 12 2.090 1.792 2.456 2.436 1.613 13 3.284 2.342 2.732 2.927 2.229 14 3.289 3.074 3.034 3.495 2.769 15 3.851 4.146 3.466 4.933 2.964 16 4.315 4.271 3.845 5.339 3.540 17 5.248 4.502 4.197 6.435 4.037 18 8.188 6.632 6.376 9.266 6.219 19 9.287 8.169 6.882 10.786 7.165 20 14.236 14.177 9.928 17.090 11.851 The results returned for Speaker 1s tests above shows that the top 3 ranks belong to Speaker 1 and Speaker 4. One main possibility of this occurrence is because both speakers are male speakers with similar range of fundamental frequencies. Sentence 1 shows excellent results while Sentence 5 shows an extremely poor result, this may be because Sentence 1 was able to capture essential information that the MFCC feature extraction methods needed while Sentence 5 did not.

53

Speaker 2:
Table 4.19: Speaker 2 vs. 20 test sentences

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Sentence 1 0 0.819 0.959 1.399 1.509 1.588 1.613 1.626 1.669 1.681 1.737 1.792 1.948 2.027 2.419 2.456 4.728 7.133 7.984 14.656

Sentence 2 0 0.139 0.770 1.948 2.027 2.329 2.666 2.732 2.805 2.843 2.849 2.964 3.146 3.636 3.851 4.053 4.120 4.146 4.933 6.923

Sentence 3 0 0.767 0.770 0.868 0.959 1.190 1.805 2.073 2.191 2.229 2.367 2.583 2.863 3.034 3.074 3.289 3.495 4.069 4.460 10.391

Sentence 4 0 0.139 0.767 1.948 2.253 2.359 2.728 3.466 3.478 3.507 3.540 3.561 3.626 4.118 4.315 4.502 4.535 4.751 5.339 8.070

Sentence 5 0 0.819 0.868 1.612 1.863 2.342 2.359 2.666 2.716 2.769 2.891 2.927 2.956 3.284 3.320 3.998 4.197 5.618 6.897 15.072

The table above shows that the top 3 returned results are all from Speaker 2. Sentence 3 displays an excellent result where all 5 sentences were retrieved as the top 5 results.

Speaker 3:
Table 4.20: Speaker 3 vs. 20 test sentences

1 2 3 4 5

Sentence 1 0 0.776 1.045 1.135 1.190

Sentence 2 0 2.804 4.420 4.682 6.923

Sentence 3 0 0.342 0.743 2.804 3.268

Sentence 4 0 0.342 0.482 2.422 4.053

Sentence 5 0 0.482 0.743 1.135 2.329

54

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1.304 1.352 1.482 1.603 1.612 1.681 1.839 1.948 2.090 2.192 2.253 2.422 2.436 3.268 8.888

8.070 8.888 9.835 9.928 10.391 11.851 11.966 12.263 12.291 14.177 14.236 14.656 15.072 15.299 17.090

3.636 4.118 4.460 5.604 6.114 6.882 6.897 7.040 7.165 7.649 7.984 8.169 9.287 9.603 10.786

4.069 4.420 4.535 4.746 4.760 5.618 6.029 6.219 6.376 6.632 6.734 7.133 8.188 8.258 9.266

2.583 2.728 2.816 2.962 3.787 3.845 3.998 4.037 4.271 4.414 4.682 4.728 5.248 5.695 6.435

The results obtained for Speaker 3 shows a good level of accuracy where sentences 2 to 4 were able to retrieve most of the correct data among the top 4 ranks. However, Sentence 1 shows a poor example of this feature extraction and classification system.

Speaker 4:
Table 4.21: Speaker 4 vs. 20 test sentences

1 2 3 4 5 6 7

Sentence 1 0 0.219 0.288 0.398 0.550 0.587 0.657

Sentence 2 0 0.169 0.288 0.363 0.395 0.573 0.648

Sentence 3 0 0.169 0.219 0.258 0.465 0.559 0.620

Sentence 4 0 0.395 0.422 0.436 0.465 0.507 0.836

Sentence 5 0 0.363 0.436 0.657 0.659 0.668 0.747

55

8 9 10 11 12 13 14 15 16 17 18 19 20

0.836 0.880 1.084 1.588 2.192 2.863 2.956 4.120 4.751 5.695 8.258 9.603 15.299

0.685 0.888 0.970 1.352 1.509 2.073 2.716 2.805 3.478 3.787 6.029 7.040 11.966

0.659 0.744 1.012 1.399 1.603 2.191 2.849 2.891 3.507 4.414 6.734 7.649 12.263

0.984 1.045 1.069 1.379 2.367 2.419 2.816 2.843 3.320 3.561 4.746 5.604 9.835

0.776 1.107 1.119 1.310 1.626 1.805 1.863 2.962 3.146 3.626 4.760 6.114 12.291

The results returned for Speaker 4 is somewhat similar to the results from Speaker 1 where the top 3 ranks belong to Speaker 4s sentences and the ranks below it is a mixture between the two.

1.21

Accuracy (5a & b): Multiple Speakers


In this experiments made used of sentences with multiple speakers as the training samples, these samples were then tested against sentence length samples in Part A and word length samples in Part B. The distinguishing factor that separates Accuracy (5) from the rest of the experiments lies in numbers of K clusters used to extract the number of feature vectors, since the sentences with multiple speakers contained a predefined value of two speakers, the number of K used is set to 2 for this experiment.

56

1.21.1 Part A: 20 speech sentences testing data


The audio signal feature extraction and classification system took in sentences with two speakers as the training data and was compared against 20 speech sentences that were used in Accuracy (4). Table 4.19 and Table 4.20 shows the result of the feature vectors from a sentence between Speaker 2 and Speaker 4 against 20 speech sentences in terms of Euclidean Distance. Speech samples of Speaker 2 are denoted with the blue highlight while samples of Speaker 4 are denoted with the yellow highlight. A cell filled with red indicates false positive error.

Table 4.22: Speaker 2 & 4, feature vector 1 vs. 20 test sentence samples

S1 S2 S3 S4 Result:

Sentence Sentence Sentence Sentence 1 2 3 4 1.9983 2.7698 0.9062 2.5251 3.708 3.6745 3.8766 4.9173 2.8483 9.5355 7.3772 6.8522 1.5297 1.1073 0.8542 1.0701 1.5297 1.1073 0.8542 1.0701

Sentence 5 1.0969 5.7507 4.8644 2.1179 1.0969

Table 4.23: Speaker 2 & 4, feature vector 2 vs. 20 test sentence samples

S1 S2 S3 S4 Result:

Sentence Sentence Sentence Sentence 1 2 3 4 17.2828 14.9276 21.502 14.942 11.0689 18.2939 12.8229 16.2994 15.0304 42.2468 25.2105 22.3842 16.5116 17.6414 17.8994 19.3882 11.0689 14.9276 12.8229 14.942

Sentence 5 17.5053 8.1823 20.3877 14.8988 8.1823

57

Table 4.21 and Table 4.22 show results of two feature vectors from a sentence between Speaker 1 and Speaker 4 against 20 speech sentences in terms of Euclidean Distance. Speech samples of Speaker 1 and Speaker 4 are denoted with green and yellow highlights respectively.

Table 4.24: Speaker 1 & 4, feature vector 1 vs. 20 test sentence samples

S1 S2 S3 S4 Result:

Sentence Sentence Sentence Sentence 1 2 3 4 10.5905 12.708 6.9247 12.6431 14.8117 10.3057 13.4369 12.585 11.9295 7.9822 12.6104 13.5847 10.6217 9.1999 8.7101 8.0696 10.5905 7.9822 6.9247 8.0696

Sentence 5 8.9829 18.4649 11.6796 11.6066 8.9829

Table 4.25: Speaker 1 & 4, feature vector 2 vs. 20 test sentence samples

S1 S2 S3 S4

Sentence Sentence Sentence Sentence 1 2 3 4 10.3266 7.6518 13.6508 8.6889 7.8215 14.1136 9.3329 12.9981 8.6219 32.2119 17.5356 14.3456 9.9203 10.5986 11.2788 11.4139

Sentence 5 10.878 5.1258 12.7325 7.8653

Result: 7.8215 7.6518 9.3329 8.6889 5.1258 The results obtained from Speaker 2 and 4 shows that the K-means algorithm was able to properly divide the mel-frequency cepstral coefficients into two good clusters of feature vectors when the speakers are of different gender, on the other hand, the results acquired from the sentence between Speaker 1 and 4 indicates poor clustering as most of the results returned belongs to Speaker 1. One possibility for this occurrence is because both speakers are male, hence, the algorithm was not able to properly distinguish between the two.

58

1.21.2 Part B: 72 speech word length testing data


In Part B, 72 word length testing data from 4 speakers were used against sentences with multiple speakers. Table 4.23 and Table 4.24 show summaries of the results obtained. The full results can be found in Appendix C and D.

Table 4.26: Speaker 2 & 4, feature vector vs. 72 test word length samples

Accuracy Feature Vector 1 Feature Vector 2 10 out of 18 samples, 55.56% 8 out of 18 samples, 44.44%

Table 4.27: Speaker 1 & 4, feature vector vs. 72 test word length samples

Accuracy Feature Vector 1 Feature Vector 2 15 out of 18 samples, 77.78% 11 out of 18 samples, 61.11%

1.22

Discussion

Through these experiments, it is discovered that there are many factors that would contributes to the success of each query. The general pattern observed shows a moderately acceptable level of accuracy using this system.

Initially, the experiment started with using the Number of Cepstral Coefficients as the variable, the results shows that as the number of coefficients increases, the accuracy of

59

the system also increases; this however would reach a certain saturation level. The possibilities of having the accuracy drop after exceeding a number of Cepstral Coefficients could not be verified through these experiments. A possible reason why the system was only able to reach a certain saturation level is because the Vector Quantization k-means method used compresses all the mel frequency cepstral coefficients into one single vector. The idea behind this implementation is to obtain the average level of energy information was good, but could have been improved by increasing the number of clusters, k.

The experiment was followed by using the Sampling Frequency as the variable. The change of sampling frequency displayed a tremendous drop in the accuracy of the system, but when the ranking of the correct match was taken into consideration, the speaker recognition rate of this system received some justification as most of were within the top 3 ranks. One of the main reasons of such occurrence is because when the sampling frequency, fs, of an audio speech signal is altered, the pitch and the speed of which the text is spoken differ. When the sampling frequency is increased, the output audio would have the chipmunk effect while lowering the sampling frequency would cause the audio to have more bass. The change of pitch was probably the reason why the system was not able to provide a good output. Since the MFCC algorithm is primarily based on the human auditory system, the result obtained was able to validate reason for the drop of accuracy. Subsequently, variable speech samples were used to determine how the system fairs. The system did not show a vivid change of accuracy when the Training Data set changed, this was probably because the Testing Data set was not large enough for sufficient comparison. Based on the results obtained, the huge difference between the highest and the lowest retrieve rate shows that while the MFCC feature extraction method may be able to provide a good feature vector, a good training data is essential in creating a system with high accuracy. Through this experiment, the MFCC algorithm was verified to be able to classify different speakers into different class even when a different set of word were used. Just as how any human beings are able

60

to classify and recognise voices of different people when given a different set of speech samples, this feature extraction method was again proven to be able to do so.

The next experiment performed made use of sentences of 30 seconds instead of a single word speech sample. The experiment returned a considerably good result, thus proving that MFCC is able to capture important features in sentence level. This however, could be improved if more training was done. A good set of training data would definitely increase the accuracy of the system, inconsistency in the quality of a training set (especially if there is noise) such as Speaker 3 Speech 1 displayed a good example of such. Upon further inspection, a common pattern found in the results shows that the Euclidean Distance between speakers of the same gender is generally small. Such results are seen in Speaker 1 and Speaker 4s result table, this proves that the feature extraction technique, MFCC does model after the human auditory system.

The last experiment that was carried out made use of sentences with multiple speakers. These files were then tested against speech samples both in sentence form and word length formats. The results obtained shows that while the system is able to better represent the speakers with a feature vector when both the speakers are off different genders; the system would perform better when it is required to distinguish the speakers available in the sentence from the speakers given that both speakers are off the same gender. This is most likely due to the fact that speakers from the same gender would have higher resemblance in their voices, hence the system is able to show a better result.

In essence, through this experiment, the MFCC algorithm has been proven to be robust in capturing important features in cases where a good audio speech sample is given and a sufficient number of Cepstral Coefficient is used. Variable speech samples or sentences do not have a significant impact on the overall system accuracy.

61

CHAPTER 4:CONCLUSIONS
1.23 Overview
This thesis involves audio signal feature extraction and classification in the area of speech and speaker analysis. The effect on the accuracy was the system was studied given various variables and condition.

The works in this thesis can be concluded in three main areas as follows: Development of Speech Database Speech analysis using Mel-frequency Cepstral Coefficients method Speaker classification using k-means method

1.24

Summary of the Work Done


The project started off with understanding the topic selected. The motivation of doing this project was laid out to further appreciate the need of such area of research. In view of accomplishing this project, the human anatomy was first studied to understand how the human auditory and vocal system works since the project aims to replicate or create a system that would work best with humans. Research was also done to study methods available in the market as well as proven methods in the area of speech analysis. While further studying the topic at hand, the Speech Database was first developed. The internet provided a good pool of speech data especially from speech database sites. There were 2 kinds of speech data used in this project, the first type were ready to be deployed as soon as it was obtained and the second required some preprocessing. The pre-processing process was done manually with the help of Audacity.

62

The speech samples were extracted on by one in order to perform experiments of variable speech data. Next, the codes found in the VOICEBOX: Speech Processing Toolbox was studied to understand how each function works especially the ones used in this project. This is done in view of making rooms for improvement and especially in the area of implementation. Various codes were studied; one of the problems faced during this phase was to understand how the obtained results can be processed. The experimental phase provided a good insight on how different variables would affect the accuracy of a system which uses MFCC and K-means. The results obtained indicate a moderately acceptable level of speech and speaker recognition accuracy (up to 80%) despite given different Training Data such as same speech samples, variable speech samples and even variable sentence level samples.

1.25

Future Work
A larger set of Speech Database should be developed in order to obtain a more comprehensive and accurate result on the ability of the system. Along with that, the Speech Database could also be developed manually in view of having specific terms or phrase spoken. In terms of algorithm and methods, other variables such as numbers of filters used and even the shapes of each filter can also be tweaked to see how each of the changes affect the accuracy of the speech analysis and speaker recognition system. Speaker diarization can almost be implemented since the MFCC and K-means method is able to detect multiple speakers in a single sentence. Lastly, an automated system that is able to tabulate the system quicker can also be implemented instead of utilizing tools such as Microsoft Excel.

63

References
[1] Johns Hopkins University (1998, April 30). How Do We Hear While We Sleep?. ScienceDaily. Retrieved April 14, 2012, from http://www.sciencedaily.com/releases/1998/04/980430044534.htm [2] Vibha Tiwari, (2010, February). MFCC and its applications in speaker recognition, International Journal on Emerging Technologies 1. [3] Joseph P. Campbell, Jr. Speaker Recognition: A Tutorial. Proceedings of the IEEE, vol. 85, no. 9, 1997 [4] Peter W. Alberti, (n.d.). The Anatomy and Physiology of the Ear and Hearing, [Online]. Available: http://www.who.int/occupational_health/publications/noise2.pdf [5] David J. M. Robinson, (1999, September). The Human Auditory System, [Online]. Available: http://www.mp3-tech.org/programmer/docs/human_auditory_system.pdf [6] Azlifa, (2006, July 18). The Sound Producing System. Azus Notes. Retrieved April 16, 2012, from http://www.azlifa.com/phonetics-phonology-lecture-2-notes/ [7] Robin Schmidt., (n.d.). Digital Signals Sampling and Quantization. RS-MET, Berlin, Germany [Online]. Available: http://www.rs-met.com/documents/tutorials/DigitalSignals.pdf [8] Ta, K. T. (2009, November). Robust Speaker Identification/Verification for Telephony Applications. (Electronic Engineering), Sim University. Retrieved from http://sst.unisim.edu.sg:8080/dspace/bitstream/123456789/281/1/09_Kyaw %20Thuta.doc [9] Amod Gupta, (2009). Audio Processing on Constrained Devices, (Master of Mathematics in Computer Science), University of Waterloo. Retrieved from http://uwspace.uwaterloo.ca/bitstream/10012/4830/1/Thesis.pdf

64

[10] Manish P. Kesarkar, (2003, November). Feature Extraction for Speech Recognition. M.Tech. Credit Seminar Report, Electronic Systems Group, EE. Dept, IIT Bombay. [Online]. Available: http://www.ee.iitb.ac.in/~esgroup/es_mtech03_sem/sem03_paper_03307003.pdf [11] Aldebaro Klautau, (2005, November 22). The MFCC, [Online]. Available: http://www.cic.unb.br/~lamar/te073/Aulas/mfcc.pdf [12] Md. Rashidul Hasan, Mustafa Zamil, Mohd Bolam Khabsani,Mohd Saifur Rehman. (2004, December 28-30). Speaker Identification using Mel Frequency Cepstral Coefficients. 3rd international conference on electrical and computer engineering (ICECE). [Online]. Available: www.assembla.com/spaces/strojno_ucenje_lab/documents/bikvce70wr3r6zeje5av nr/download/p141omfccu.pdf [13] Jyh-Shing Roger Jang. (n.d.). 12-2 MFCC, Retrieved April 16, 2012 from http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/speechfeaturemfcc. asp?title=12-2%20mfcc [14] Htin Aung, M., Ki-Seung, Asif Hirani, Ke Ye., (2006) Speaker Recognition, [Online]. Available: http://www.softwarepractice.org/mediawiki/images/5/5f/Finally_version1.pdf [15] VOICEBOX: Speech Processing Toolbox for MATLAB. (n.d.), Retrieved April 17, 2012 from http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html [16] Reynolds, D.A., Quatieri, T.F., Dunn, R. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10(1-3), 1941, 2000 [17] S. Tranter and D. Reynolds. An overview of automatic speaker diarization systems. IEEE Transactions on Audio Speech and Language Processing, vol. 14, no. 5, pp. 15571565, 2006 [18] D. Liu and F. Kubala. Online speaker clustering, in Proc. 2004 IEEE International Conference Acoustics, Speech, and Signal Processing, vol. 1, pp. 333-336, 2006

65

Appendix A MATLAB Code General Layout


% % % % Last Modified Date: 21 April 2012 Author: Clarence Cheong Student ID: 1081106210 Topic: Speaker Classification clear all clc % include path & subpaths addpath(genpath('D:\FYP\minip')) addpath(genpath('D:\FYP\AudioExtract')) addpath(genpath('D:\FYP\MATLAB\voicebox')) % EXPERIMENTS: % Same Speech, Same Sampling Frequency, Variable Cepstral Coefficients % import wav training data into MATLAB [s1, fs1, nbits1]=wavread('s1.wav'); [s2, fs2, nbits2]=wavread('s2.wav'); [s3, fs3, nbits3]=wavread('s3.wav'); [s4, fs4, nbits4]=wavread('s4.wav'); [s5, fs5, nbits5]=wavread('s5.wav'); [s6, fs6, nbits6]=wavread('s6.wav'); [s7, fs7, nbits7]=wavread('s7.wav'); [s8, fs8, nbits8]=wavread('s8.wav'); % import wav testing data into MATLAB [st1, fst1, nbitst1]=wavread('st1.wav'); [st2, fst2, nbitst2]=wavread('st2.wav'); [st3, fst3, nbitst3]=wavread('st3.wav'); [st4, fst4, nbitst4]=wavread('st4.wav'); [st5, fst5, nbitst5]=wavread('st5.wav'); [st6, fst6, nbitst6]=wavread('st6.wav'); [st7, fst7, nbitst7]=wavread('st7.wav'); [st8, fst8, nbitst8]=wavread('st8.wav');

% PLOT OF TRAINING or TESTING DATA if required (Example) figure(1); time=(1:length(s1))/fs1; plot(time, s1); title('Training Data 1'); xlabel('Time, S'); ylabel('Amplitude, A');

66

% FEATURE EXTRACTION, 4 coefficients % Training set, CoE = Number of Cepstral Coefficients, M = Hamming CoE = 4; mfcc_s1 mfcc_s2 mfcc_s3 mfcc_s4 mfcc_s5 mfcc_s6 mfcc_s7 mfcc_s8 = = = = = = = = melcepst(s1,fs1,'M', melcepst(s2,fs2,'M', melcepst(s3,fs3,'M', melcepst(s4,fs4,'M', melcepst(s5,fs5,'M', melcepst(s6,fs6,'M', melcepst(s7,fs7,'M', melcepst(s8,fs8,'M', CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)';

% Testing Set mfcc_st1 = melcepst(st1,fst1,'M', mfcc_st2 = melcepst(st2,fst2,'M', mfcc_st3 = melcepst(st3,fst3,'M', mfcc_st4 = melcepst(st4,fst4,'M', mfcc_st5 = melcepst(st5,fst5,'M', mfcc_st6 = melcepst(st6,fst6,'M', mfcc_st7 = melcepst(st7,fst7,'M', mfcc_st8 = melcepst(st8,fst8,'M',

CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)';

% Obtaining the Feature Vector using K-means % Grouping the MFCC into 1 cluster. % MFCC K-Means clustering (Training Data) km_s1 = kmeans(mfcc_s1', 1); km_s2 = kmeans(mfcc_s2', 1); km_s3 = kmeans(mfcc_s3', 1); km_s4 = kmeans(mfcc_s4', 1); km_s5 = kmeans(mfcc_s5', 1); km_s6 = kmeans(mfcc_s6', 1); km_s7 = kmeans(mfcc_s7', 1); km_s8 = kmeans(mfcc_s8', 1); % MFCC km_st1 km_st2 km_st3 km_st4 km_st5 km_st6 km_st7 km_st8 K-Means clustering (Testing Data) = kmeans(mfcc_st1', 1); = kmeans(mfcc_st2', 1); = kmeans(mfcc_st3', 1); = kmeans(mfcc_st4', 1); = kmeans(mfcc_st5', 1); = kmeans(mfcc_st6', 1); = kmeans(mfcc_st7', 1); = kmeans(mfcc_st8', 1);

% --------------------------- Testing Phase -------------------disp ('---------------------- Test Results----------------------'); disp (' ');

67

% Calculating Euclidean Distance using for loop, displaying results in a % Matrix for i=1:8 for j=1:8 eu_d(i,j) = eval(['sum((km_s' int2str(i) '- km_st' int2str(j) ').^2)']); end end disp(['Compute: Training Data(y) vs Testing Data(x)']) Eucleadean_Distance4 = eu_d

Appendix B MATLAB Resampling of Sampling Frequency

68

% import wav training data into MATLAB [s1, fs1, nbits1]=wavread('s1.wav'); [s2, fs2, nbits2]=wavread('s2.wav'); [s3, fs3, nbits3]=wavread('s3.wav'); [s4, fs4, nbits4]=wavread('s4.wav'); [s5, fs5, nbits5]=wavread('s5.wav'); [s6, fs6, nbits6]=wavread('s6.wav'); [s7, fs7, nbits7]=wavread('s7.wav'); [s8, fs8, nbits8]=wavread('s8.wav'); % import wav testing data into MATLAB [st1, fst1, nbitst1]=wavread('st1.wav'); [st2, fst2, nbitst2]=wavread('st2.wav'); [st3, fst3, nbitst3]=wavread('st3.wav'); [st4, fst4, nbitst4]=wavread('st4.wav'); [st5, fst5, nbitst5]=wavread('st5.wav'); [st6, fst6, nbitst6]=wavread('st6.wav'); [st7, fst7, nbitst7]=wavread('st7.wav'); [st8, fst8, nbitst8]=wavread('st8.wav'); % Resampling of data p = 1; q = 2; st1 = resample(st1, p,q); fst1 = fst1*(p/q); st2 = resample(st2, p,q); fst2 = fst2*(p/q); st3 = resample(st3, p,q); fst3 = fst3*(p/q); st4 = resample(st4, p,q); fst4 = fst4*(p/q); st5 = resample(st5, p,q); fst5 = fst5*(p/q); st6 = resample(st6, p,q); fst6 = fst6*(p/q); st7 = resample(st7, p,q); fst7 = fst7*(p/q); st8 = resample(st8, p,q); fst8 = fst8*(p/q);

69

Appendix C Multiple speaker sentences vs. 72 word length speech samples

Sample 1 S1 S2 S3 S4 10.51 40.43 14.93 25.26

Sample 2 9.95 18.33 20.81 6.33

Sample 3 8.73 17.83 18.58 7.00

Sample 4 17.06 15.16 33.08 8.34

Sample 5 8.13 26.30 14.07 13.68

Sample 6 5.43 9.08 10.33 11.57

Sample 7 16.53 15.01 19.60 20.69

Sample 8 40.87 12.73 20.63 14.31

Sample 9 15.38 15.83 9.27 15.36

Sample 10 19.38 12.90 23.52 5.27

Sample 11 14.49 30.67 12.76 16.21

Sample 12 6.79 8.28 23.32 7.55

Sample 13 26.79 23.69 16.90 14.15

Sample 14 20.85 15.78 12.78 7.29

Sample 15 10.86 11.21 30.70 35.10

Sample 16 32.83 28.51 14.58 11.41

Sample 17 11.15 5.78 20.95 10.94

Sample 18 14.80 10.92 23.27 2.38

Result:

10.51 Sample 1

6.33 Sample 2 8.68 5.20 8.51 13.51

7.00 Sample 3 18.46 6.57 8.04 39.27

8.34 Sample 4 8.30 3.27 5.96 10.29

8.13 Sample 5 12.05 5.10 6.50 15.70

5.43 Sample 6 13.45 21.33 9.67 22.36

15.01 Sample 7 5.19 14.02 7.59 12.59

12.73 Sample 8 93.05 10.63 27.33 19.40

9.27 Sample 9 12.04 8.18 12.26 35.75

5.27 Sample 10 10.23 18.66 6.29 25.69

12.76 Sample 11 9.29 10.22 10.26 11.80

6.79 Sample 12 16.41 16.87 5.69 8.57

14.15 Sample 13 8.25 4.19 4.08 6.96

7.29 Sample 14 50.36 4.03 6.71 20.76

10.86 Sample 15 12.50 12.15 2.70 5.10

11.41 Sample 16 5.44 8.44 4.99 25.50

5.78 Sample 17 14.73 15.82 11.93 15.43

2.38 Sample 18 9.72 10.47 6.07 19.32

S1 S2 S3 S4

16.52 8.11 10.51 4.39

Result:

4.39

5.20

6.57

3.27

5.10

9.67

5.19

10.63

8.18

6.29

9.29

5.69

4.08

4.03

2.70

4.99

11.93

6.07

70

Sample 1 S1 S2 S3 S4 24.1 65.0 29.2 49.3

Sample 2 25.9 37.5 39.4 18.1

Sample 3 21.0 37.0 35.9 11.2

Sample 4 37.2 33.9 56.7 22.8

Sample 5 22.4 48.4 31.1 30.2

Sample 6 17.5 19.7 25.4 24.5

Sample 7 36.4 30.0 36.5 38.6

Sample 8 38.9 28.9 33.1 28.2

Sample 9 32.4 34.6 22.2 24.6

Sample 10 38.2 25.7 43.9 11.7

Sample 11 32.7 51.6 28.3 35.2

Sample 12 18.8 20.6 43.0 22.0

Sample 13 49.5 44.1 35.0 31.4

Sample 14 26.6 34.8 29.7 18.0

Sample 15 25.7 26.2 54.7 62.2

Sample 16 57.4 48.7 33.6 22.4

Sample 17 25.9 15.6 38.9 25.0

Sample 18 33.0 26.0 42.1 10.4

Result:

24.1 Sample 1

18.1 Sample 2 3.2 4.4 6.9 7.1

11.2 Sample 3 10.5 8.1 7.9 28.9

22.8 Sample 4 2.9 4.3 10.4 4.5

22.4 Sample 5 6.5 10.5 5.2 10.2

17.5 Sample 6 7.6 18.7 5.3 13.5

30.0 Sample 7 2.3 14.5 10.0 13.0

28.2 Sample 8 81.6 8.1 21.6 10.5

22.2 Sample 9 8.1 4.7 8.7 27.0

11.7 Sample 10 7.5 14.5 9.2 14.5

28.3 Sample 11 3.0 8.7 10.0 6.3

18.8 Sample 12 11.6 10.6 6.3 4.5

31.4 Sample 13 3.7 6.6 6.5 2.0

18.0 Sample 14 41.3 3.3 3.2 12.4

25.7 Sample 15 7.7 11.8 5.7 4.4

22.4 Sample 16 7.8 16.3 2.9 18.5

15.6 Sample 17 9.3 13.5 11.5 11.6

10.4 Sample 18 7.0 9.2 8.1 12.6

S1 S2 S3 S4

11.6 15.7 8.1 3.2

Result:

3.2

3.2

7.9

2.9

5.2

5.3

2.3

8.1

4.7

7.5

3.0

4.5

2.0

3.2

4.4

2.9

9.3

7.0

71

Вам также может понравиться