Академический Документы
Профессиональный Документы
Культура Документы
The project report is prepared for Faculty of Engineering Multimedia University in partial fulfilment for Bachelor of Engineering
The copyright of this report belongs to the author under the terms of the Copyright Act 1987 as qualified by Regulation 4(1) of the Multimedia University Intellectual Property Regulations. Due acknowledgement shall always be made of the use of any material contained in, or derived from, this report.
ii
Declaration
I hereby declare that this work has been done by myself and no portion of the work contained in this report has been submitted in support of any application for any other degree or qualification of this or any other university or institute of learning. I also declare that pursuant to the provisions of the Copyright Act 1987, I have not engaged in any unauthorised act of copying or reproducing or attempt to copy / reproduce or cause to copy / reproduce or permit the copying / reproducing or the sharing and / or downloading of any copyrighted material or an attempt to do so whether by use of the Universitys facilities or outside networks / facilities whether in hard copy or soft copy format, of any material protected under the provisions of sections 3 and 7 of the Act whether for payment or otherwise save as specifically provided for therein. This shall include but not be limited to any lecture notes, course packs, thesis, text books, exam questions, any works of authorship fixed in any tangible medium of expression whether provided by the University or otherwise. I hereby further declare that in the event of any infringement of the provisions of the Act whether knowingly or unknowingly the University shall not be liable for the same in any manner whatsoever and undertakes to indemnify and keep indemnified the University against all such claims and actions.
iii
Acknowledgements
I am very grateful that this project was completed successfully; however it would have been impossible to do so without the help of many. Hereby, I would like to extend my gratitude to them all. First and foremost, I would like to thank my Lord Jesus Christ for being with me and guiding me throughout the entire journey here in MMU and especially so in the completion of this Final Year Project. I would also like to express gratitude to my supervisor, Associate Professor Dr Mohammad Faizal for being kind and approachable, guiding and assisting me in birthing forth new ideas during the project. Along with that, I would also like to thank Multimedia University for giving me this chance to explore, cultivate and deepen my understanding in this area of research. I would also love to thank my family members for their moral and mental support and not forgetting my fellow friends who have been with me through the thick and thin. I am very grateful for the Multimedia University Christian Society and fellow church members who have been upholding me in prayer.
iv
Abstract
The aim of this project was to study how audio signal feature extraction and classification can be done in the areas of speech and speaker analysis. With the advancement of technology, speech recognition has become one of the key research areas to improve human-machine interaction; this project implemented algorithms such as the feature extraction, Mel-frequency Cepstral Coefficient technique and the vector quantization K-means method in an attempt to study and understand how well these algorithms work. With the help of MATLAB as the system backbone, various experiments were carried out by tweaking different variables and conditions. In view of classifying speakers based on given speech audio samples, the experiments performed were able to achieve an overall accuracy level of 66.67%. The system was able to perform classification despite given different Training Data such as same speech, variable speech and even variable sentence level.
Table of Contents
Declaration.....................................................................................................................iii Acknowledgements........................................................................................................iv Abstract...........................................................................................................................v Table of Contents..........................................................................................................vi List of Figures................................................................................................................vi List of Tables.................................................................................................................ix List of Equations............................................................................................................x .........................................................................................................................................x List of Abbreviations ....................................................................................................x CHAPTER 1: CHAPTER 1: CHAPTER 2: CHAPTER 3: INTRODUCTION...........................................................................1 LITERATURE REVIEW..............................................................6 METHODOLOGY........................................................................16 RESULTS AND DISCUSSIONS ................................................32
CHAPTER 4:CONCLUSIONS .................................................................................62 References.....................................................................................................................64 Appendix A MATLAB Code General Layout.......................................................66 Appendix B MATLAB Resampling of Sampling Frequency...............................68 Appendix C Multiple speaker sentences vs. 72 word length speech samples.....70
List of Figures
vi
Figure 1.1: Types of speech processing [3]..................................................................2 Figure 2.2: Representation of sound waves (a) Sine wave in time domain (b) Corresponding spectrum......................................6 Figure 2.3: Resonant frequencies along the coiled length of the basilar membrane [5].....................................................................................................................................7 Figure 2.4: Sound producing system [6]......................................................................8 Figure 2.5: Sampling process [7]..................................................................................9 Figure 2.6: Characteristic line of a quantizer [7].......................................................9 Figure 2.7: Reconstructed signal after sampling and quantization [7]....................9 Figure 2.8: Spectrogram..............................................................................................10 Figure 2.9: General speech recognition block diagram [8].....................................11 Figure 2.10: MFCC block diagram............................................................................12 Figure 2.11: Feature vectors (a) Before and (b) After vector quantization..........15 Figure 3.12: Downloading from AMI Corpus, screenshot......................................18 Figure 3.13: Training data of 4 different speakers saying Zero.........................19 Figure 3.14: Comparison of training and testing data (Speaker 1)........................19 Figure 3.15: Comparison of MFCC and feature vector after VQ..........................20 Figure 3.16: Comparison of training and testing data (Speaker 5) .......................20 Figure 3.17: Comparison of MFCC and feature vector after VQ..........................21 Figure 3.18: Audacity with an audio stream of 37 minutes.....................................22 Figure 3.19: Word extraction process........................................................................23 Figure 3.20: Screenshot of MATLAB........................................................................25 Figure 3.21: MATLAB Graphical Plot......................................................................26 Figure 3.22: Raw results of Speaker 1 sentences in terms of Euclidean Distance ........................................................................................................................................29
vii
Figure 3.23: Results of Speaker 1 sentences in terms of Euclidean Distance after data tabulation..............................................................................................................30 Figure 4.24: Plot of testing data 1 at different sampling frequency, fs..................36 Figure 4.25: Plot of training data 1's Feature Vector at different sampling frequency, fs .................................................................................................................37 Figure 4.26: Speaker 1 MFCC & Feature Vector....................................................40 Figure 4.27: Speaker 1 & test samples.......................................................................41
viii
List of Tables
Table 1.1: Project timeline............................................................................................4 Table 2.2: Identification rate (in %) for different windows [using Mel scale]......13 Table 3.3: List of Extracted words for each speaker...............................................24 Table 4.4: Results of training data vs. testing data using 4 Cepstral Coefficients32 Table 4.5: Results of training data vs. testing data using 12 Cepstral Coefficients ........................................................................................................................................33 Table 4.6: Results of training data vs. testing data using 20 Cepstral Coefficients ........................................................................................................................................33 Table 4.7: Results of training data vs. testing data using 28 Cepstral Coefficients ........................................................................................................................................34 Table 4.8: Number of Cepstral Coefficients vs. accuracy.......................................34 Table 4.9: Sampling frequency: 0.5*fs.......................................................................37 Table 4.10: Sampling frequency: 1*fs........................................................................37 Table 4.11: Sampling frequency: 1.5*fs.....................................................................38 Table 4.12: Speakers training data feature vector vs. 12 testing data...................41 Table 4.13: Speaker feature vector vs. 36 test data..................................................42 Table 4.14 (a, b & c): Speaker 1's speech retrieval, precision and recall rate ....44 Table 4.15 (a, b & c): Speaker 2's speech retrieval, precision and recall rate......46 Table 4.16 (a, b & c): Speaker 3's speech retrieval, precision and recall rate......48 Table 4.17 (a, b & c): Speaker 4's speech retrieval, precision and recall rate......50 Table 4.18: Speaker 1 vs. 20 test sentences................................................................53 Table 4.19: Speaker 2 vs. 20 test sentences................................................................54 Table 4.20: Speaker 3 vs. 20 test sentences................................................................54
Table 4.21: Speaker 4 vs. 20 test sentences................................................................55 Table 4.22: Speaker 2 & 4, feature vector 1 vs. 20 test sentence samples..............57 Table 4.23: Speaker 2 & 4, feature vector 2 vs. 20 test sentence samples..............57 Table 4.24: Speaker 1 & 4, feature vector 1 vs. 20 test sentence samples..............58 Table 4.25: Speaker 1 & 4, feature vector 2 vs. 20 test sentence samples..............58 Table 4.26: Speaker 2 & 4, feature vector vs. 72 test word length samples...........59 Table 4.27: Speaker 1 & 4, feature vector vs. 72 test word length samples...........59
List of Equations
(Equation 2.1)...............................................................................................................16
List of Abbreviations
MFCC VQ FV fs Mel-frequency Cepstral Coefficient Vector Quantization Feature Vector Sampling Frequency
CHAPTER 1:
INTRODUCTION
This chapter presents the importance and overview of the project in relation to audio signal feature extraction and classification along with the motivation and objectives behind it. The timeline of the project is also included to show how the project is conducted.
1.1
Project Overview
Audio signal processing and classification plays an important role in the everyday living of mankind. The topic of this project focuses on the extraction of vector features from a speech audio signal as well as the classification of the signals based on the extracted features.
1.2
Motivation
With the advancement of technology, humans are constantly trying to come up with cutting edge inventions and ideas to incorporate conventional communication methods into our everyday technology in view of ultimately bridging the gap between humans and machines. Unlike the eyes, the ear does not have any eyelid to block or reduce the input received asides from turning away from the source or by covering the ear. Audio signal processing and classification is a process that occurs so often and so naturally that most human beings do not realize the importance of it until one losses it. According to a research done by Serene J. Gondek, part of the brain still processes signals even while a person is asleep [1], hence humans are constantly processing received signals by classifying them into noise or information. Communication plays a major role in various parts and parcel of life. Beginning from the time a baby is born until the day he leaves the earth, one cannot escape from communicating with each other. While it is fairly simple and normal when it is done
between two or more individuals, computers in general find it very difficult to understand contextual information. Hence, speech processing has risen to be one of the most important processes among the various digital signal processing functions [2]. Information extracted from a speech signal may be used for biometric purpose such as speaker identification and speaker recognition, voice command recognition, language identification and even speaker diarization [17]. Speech recognition is defined as the capability of a system to identify words, or sentences in spoken language and allow system to understand what a user has said and act upon it. The figure below shows the breakdown of speech processing into smaller categories
1.3
Project Objectives
The aim of this project is to study and implement feature extraction methods which are suitable for speaker classification purposes. This project focuses on Speaker Identification with the following scenarios: Text dependent, Cooperative speakers, High Quality Speech Text independent, Cooperative speakers, High Quality Speech Sentence level independent text, Cooperative speakers, High Quality
1.4
Part 1 Task \ Week Research on Project Title Background & literature review Familiarize with MATLAB System design & speech data collection Implementation & improvisation Presentation Part 2 Task \ Week Experiment on various speech data Thesis writing Presentation
1 -
10
11
12
13
1.5
Structure of Thesis
Chapter 2 presents background information on audio signal processing especially in the area of speech analysis using MFCC and VQ methods. Background information regarding the Human Auditory and Vocal system along with digital signal processing is also included. Chapter 3 proposes methods to study and implement feature extraction techniques which are suitable for speech audio classification. This includes the experiments designed and its procedures, development of speech database, and selection of system backbone. Chapter 4 reveals the results of experiments proposed in Chapter 3 and provides discussion on the results obtained. Chapter 5 presents the overall conclusion of this project and offers some suggestions for future work.
CHAPTER 1:
LITERATURE REVIEW
This chapter presents basic concepts of audio signals, the human hearing system and voice production system, analogue to digital conversion as well as the methods and idea behind a speech recognition system.
1.6
Audio Signals
Audio signals are made out of a series of longitudinal waves travelling in a direction parallel to the wave motion. These mechanical waves travel through mediums such as solids, liquid and air within the hearing range, propagating from one location to another. Figure 2.1.1(a) shows a basic sound wave representation using a sine wave.
(a)
Figure 2.2: Representation of sound waves
(b)
The amplitude of the wave signals determine the loudness of the sound, while the frequency determines the pitch of which the wave produces. A sound wave can be represented in mainly two domains; the Time domain and the Frequency domain. Figure 2.1(a) shows a representation of a sound wave in the time domain and Figure 2.1(b) in the frequency domain. While a sine wave can be used to illustrate the sinusoidal characteristic of a sound wave, natural sound waves are usually more complex and are made out of multiple harmonics.
1.7
Figure 2.3: Resonant frequencies along the coiled length of the basilar membrane [5]
Next, the vocal system was studied. The human vocal tract produces a steam of sound made from the vibration of vocal folds. The vibration from the vocal folds releases streams of air from the lungs which will go through the vocal tracts and eventually
shape the sound accordingly. Since the make-up of each human being varies from person to person, the anatomical structure of the vocal tract is unique for every person. This characteristic is the fundamental principle of why speech recognition or speaker identification can be performed [2]. Figure 2.3 shows the sound producing system of a human being.
1.8
Following that, the sampled data are then quantized to a finite amount of levels, N, depending on the amplitude of the signal.
Next, for audio signal processing, the reconstructed signal is then transformed into the frequency domain. With reference to how the ear functions, a frequency domain plot would be able to provide more details for computational purposes. Another example of a frequency domain plot is the spectrogram as shown in Figure 2.7. A spectrogram is able to provide useful information by plotting the frequency response with respect to time. The amplitude of each frequency is represented by a different shade of colour.
1.9
10
11
source-filter model, where the source refers to the air expelled from the lungs and the filter refers to the shape of the vocal tract. Thus, in the time domain, convolution takes place; whereas in the frequency domain, a multiplication takes place [12]. Convolution: source * filter = speech e (n) * h (n) = x (n)
Multiplication:
The MFCC algorithm utilises 2 types of filter to perform feature extraction, these filters are called Linearly Spaced Filter and Logarithmically Spaced Filter [13]. In order to pick up phonetically key characteristics of the audio stream, the signal is expressed in the Mel frequency scale. The Mel frequency scale has 2 different scales in which frequencies below 1000 Hz are linearly spaced while frequencies above 1000 Hz are logarithmically spaced. The extracted features may differ depending on the audio input stream given. Figure 2.9 shows the block diagram of a MFCC algorithm.
Figure 2.9 shows how a MFCC processor works by extracting features from a continuous audio speech stream into the mel cepstrum coefficients. The goal of preemphasis is to compensate for the high frequency part that was suppressed when
12
produced by the human vocal system [13]. Framing and windowing are done by splitting a long stream of speech data into smaller chunk (frames). This is because speech signal analysis should be done in a quasiperiodic period as the human voice is generated in a quasiperiodic manner where the airflow is chopped into pieces via the vocal folds. Since, the vibration of the vocal folds produces the voice; a frequency domain analysis would be optimum. In order to minimize and avoid loss of data through the windowing and framing processes, successive frame blocks are slightly overlapped with each other around 30 to 50 percent [14]. Next, the frames are then multiplied with a window function. Based on a study done by Hassan et. al [12], an experiment was done to determine how the type of Window Function used affects the identification rate. The results of their experiments are shown in the table below:
Table 2.2: Identification rate (in %) for different windows [using Mel scale]
13
14
(a)
(b)
Figure 2.11: Feature vectors (a) Before and (b) After vector quantization The advantage of performing Vector Quantization is that the database does not need to
contain every single training data along with its extracted features; rather, the feature vectors can be grouped and hence, simplifying and increasing the accuracy of a speech recognition system. This is proven to be especially useful when used on a system whereby data storage size is limited such as mobile devices and handhelds embedded systems.
15
training data. The difference between the training data and the testing data is calculated by obtaining the Euclidean Distance. The Euclidean Distance can be calculated by using the following formula:
(Equation 2.1)
, Whereby p and q are the training and testing feature vectors respectively.
CHAPTER 2:
1.10
METHODOLOGY
Introduction
In order to study and implement feature extraction techniques suitable for the classification of audio signals into different speakers, experiments were designed to determine how different techniques would affect the recognition rate of the system.
1.11
Experiments
Four experiments were designed to determine how different feature extraction method and different speech files would affect the accuracy of the system. All the experiments were done using MFCC with the Hamming Windowing function. The experiments performed are: i. ii. iii. Accuracy (1): Same Speech, Variable Cepstral Coefficient Accuracy (2): Same Speech, Variable Sampling Frequency Accuracy (3a, b & c): Different speech, same sampling frequency
16
iv.
v.
Accuracy (5a & b): Multiple speakers in a sentence, same sampling frequency, different length
The size of the training data and testing data varied among each experiment.
1.12
The speech data downloaded from the first source was easy and simple to use as it was already separated into 16 Microsoft WAV format files of 8 speakers and the loudness as well as sampling rate of the speech files were decent. The files are separated into 2 different folders, one of which is the training folder, while the other is the testing folder. Each speaker was asked to utter the digit ZERO and the audio signal was recorded. Figure 3.1 shows a screenshot of the downloading process from AMI Corpus.
17
In order to simulate voice variation over time, the training and testing data were recorded at least 6 months apart. Figure 3.2 shows the training data of 4 different speakers saying Zero. The comparison between two recordings from the same speaker is shown in Figure 3.3 and Figure 3.5 while Figure 3.4 and Figure 3.6 shows the comparison of the MFCC and Feature Vector after VQ.
18
19
20
The speech data downloaded from the second source required some amount of preprocessing as the amplitude of each signal was considerably low and the data were in sentence form. With that, a free audio editor called Audacity was employed to perform the required pre-processing. Figure 3.7 shows a screenshot of Audacity with an audio stream of 37 minutes.
21
The audio files were taken from 4 different speakers; and each speaker has 4 sample files. All the speech samples were recorded using the same device at the same sampling frequency. However, the amplitude of each speech file is dependent of the placement of the microphones on each speaker. In order to obtain different speech samples (as required from Experiment: Accuracy (3a, b &c)) and sentences (as required from Experiment: Accuracy (4)), the audio stream was imported into Audacity for processing. The audio stream was zoomed in for a better resolution of where each word starts and ends. This enables and eases the process of extracting different speech samples or sentences from the audio stream.
Figure 3.8 shows how the individual words are extracted from the audio stream:
22
Table 3.1 lists out all the extracted words for each speaker:
23
Speaker 1 Briefly Company Consumer Development Ideas Internal Note Okay Original Product Recognition Speech This Today Trendy Very Welcome works
Speaker 2 Battery Breakdown Capacitor Circuit Control Depend Dependable Dont Horses Much Pick Presenting Press Remote Signal The Voice Yourself
Speaker 3 And Animal Before But Complicated Choices Creative Designer Easy Experience Examples Flyer Interface Marketed Operation Player Right Variety
Speaker 4 Ahead And Buttons Contrast Device Easily Enhance Fashion Fingers Meeting Nature Person Players Practical Project Shape Research Underdog
1.13
System Backbone
In order to perform the listed experiments above, MATLAB - a software by MathWorks was selected. MATLAB is the abbreviation for Matrix Laboratory, it is a numerical programming environment that is integrated with many graphical and visualisation tools. MATLAB was chosen as the system backbone because it has a large library of functions that are ready to be used. Along with that, MATLAB also provides a very user friendly interface and a very comprehensive documentation. Furthermore, many
24
custom functions have been written by the community and this is an extra advantage for choosing MATLAB. In terms of speech processing, this backbone system comes with many build-in functions such as wavread (a function that reads in a Microsoft WAVE file into matrix loaded with a stream of binary data), resample (a function that is used to convert or change an audio signals sampling frequency. As mentioned above, the community has written some voice processing toolbox which is free for use. Among them are VOICEBOX: Speech Recognition Toolbox and ASR (Automatic Speech Recognition) Toolbox. These toolboxes would be able to provide much help in implementing this projects system. Figure 3.9 and Figure 3.10 show screenshots of the MATLAB environment as well as a graphical plot done using MATLAB.
25
1.14
Experiment Procedures
With the experiments planned, the ready speech database and a backbone system, the experiment procedures can be laid out. The experiments planned should take place in a chronological order as it would provide a better understanding as well as a good build up to how each characteristic and variable affect the accuracy and overall performance of a speaker classification system. First, the training data is loaded into the backbone system, MATLAB by calling the wavread function. In this experiment, the wavread function outputs 3 parameters which include the speech signal stored in a vector matrix (Y), the sampling frequency (fs) and numbers of bits per sample (NBits). [ Y, fs, NBits ] = wavread (FILE)
26
Next, the Mel-frequency cepstrum coefficient is calculated by using the melcepst function. The melcepst function is written by Mike Brookes, Department of Electrical & Electronic Engineering, Imperial College [15]. The VOICEBOX: Speech Processing Toolbox for MATLAB can be obtained from: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html. In the course of this project, the melcepst function that was used required 4 parameters and outputs the coefficients as a result.
Parameters: Y fs W Speech Signal Sampling Frequency Windowing Function Default M for Hamming Window Nc Number of Cepstral Coefficients, default 12
In this melcepst function, each successive frame blocks are set to overlap by 50 percent to reduce data loss.
27
Another important function that would be frequently used in this project would be Vector Quantization using K-Means Algorithm. The kmeans function can be found through the statistical toolbox in MATLAB. The mfcc found using melcepst will be inputted into the kmeans function parameter, X. IDX = kmeans (X,k);
28
1.15
Data Tabulation
The results obtained through the experiments were tabulated with the help of Microsoft Excel. The data were colour coded and then sorted numerically to help reduce the complexity result analysis. Figure 3.11 shows raw results of Speaker 1s sentences against 20 sentences from all 5 speakers in terms of Euclidean Distance. An example of data tabulation is as below:
29
Figure 3.12 shows the results of Speaker 1s sentences against 20 sentences from all 5 speakers in terms of Euclidean Distance after the colour coding and numerical sorting is performed. The ranks of each sentence is tabulated on Column A according to the sorted Euclidean Distance magnitude.
Figure 3.23: Results of Speaker 1 sentences in terms of Euclidean Distance after data tabulation
30
1.16
31
CHAPTER 3:
The results of the following experiments are tabulated in table forms. An ideal result table would have only 1 coloured cell per column. The cells in yellow should contain the smallest value as it represents the Euclidean Distance between the Testing Data and the correct Training Data; while the cells in red indicate computational errors. The ranking of the correct Training Data is also listed.
1.17
No. Co. = 4 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7
Test 8
0.292 2.663 1.599 1.525 0.715 3.971 3.205 1.269 2.492 0.813 0.697 5.342 1.975 3.605 1.946 3.860 3.887 1.948 0.862 7.235 3.729 2.590 2.124 5.608 2.172 9.766 8.017 0.528 2.938 12.86 0 9.159 1.491
3.601 0.300 0.369 6.556 2.194 5.421 0.753 4.186 3.119 2.226 0.973 5.945 3.371 1.400 2.920 4.699 5.809 0.729 0.605 9.074 3.916 5.778 0.832 5.843 1.376 1.608 0.698 2.474 0.602 5.538 1.404 0.942
32
Rank
Table 4.5: Results of training data vs. testing data using 12 Cepstral Coefficients
No. Co. = 12 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank
Test 1
Test 2
Test 3
Test 4
Test 5
Test 6
Test 7 3.993
4.785 0.879 1.406 7.578 2.453 5.340 3.371 5.626 2.907 1.708 8.395 4.257 3.354 2.836 2.668 11.08 1 9.129 0.892 3.425 13.33 6 9.503
5.089 0.718 1.446 8.078 2.312 6.208 1.379 4.476 3.990 2.113 6.992 3.996 1.485 3.176
6.535 2.003 1.330 9.965 4.408 6.070 1.137 2.899 2.548 1.682 4.437 1.060 6.095 2.007 1 2 4 1 3 1 1
Table 4.6: Results of training data vs. testing data using 20 Cepstral Coefficients
33
3 6.22 6 6.19 7 1
Table 4.7: Results of training data vs. testing data using 28 Cepstral Coefficients
No. Co. = 28 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank
The table below presents the summary of Part (a) in terms of accuracy percentage:
Table 4.8: Number of Cepstral Coefficients vs. accuracy
34
This experiment shows that as the number of Cepstral Coefficients used in a MFCC feature extraction algorithm increases, the accuracy of a speaker classification system also increases.
1.18
Accuracy Frequency
(2):
Same
Speech,
Variable
Sampling
The goal of this second experiment is to see how the accuracy of the system will be affected when the sampling frequency of the Training Data changes, given the same speech and the same number of Cepstral Coefficients. 8 training files from 8 different speakers were tested against 8 testing files. The training file number and testing file number represents the same speaker. The original sampling frequency, fs, used is 44.1 kHz. The results of this experiment are shown in the following tables using Sampling Frequency at 0.5fs, fs, and 1.5fs. Figure 4.1 shows the plot of testing data 1 at different sampling rate.
35
36
Figure 4.25: Plot of training data 1's Feature Vector at different sampling frequency, fs Table 4.9: Sampling frequency: 0.5*fs Table 4.10: Sampling frequency: 1*fs
fs = 0.5*fs Test 1 Train 1 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 Rank 1.762 2.513 4.209 5.842 3.551 2.835 3.567 1.447 2
22.266 18.122
37
Test 2
Test 3
Test 4
Test 6
Test 7
5.960 3.587 1.981 1.004 1.729 8.506 4.034 2.185 8.458 12.75 10.00 2 2 1.143
Rank
38
Based on the results, one can observe that the overall accuracy of the system has fs = Test 1 Test 2 Test 3 Test 4 Test 5 Test 6 Test 7 Test 8 1.5*fs 19.3 19.5 19.8 20.9 20.7 21.7 22.2 20.77 Train 1 85 55 41 62 59 36 08 0 23.2 Train 2 Train 3 Train 4 Train 5 Train 6 Train 7 Train 8 55 28.1 23 20.7 20 20.5 09 29.9 64 26.3 77 23.5 26 17.7 39 23.7 60 23.2 81 14.9 76 24.9 46 18.7 96 18.2 62 19.6 15 25.3 99 23.7 92 16.0 93 26.9 78 20.5 69 20.3 72 27.2 59 33.0 62 20.0 04 24.7 82 33.4 71 30.5 20 25.8 85 24.0 84 29.2 82 20.7 01 20.3 55 30.9 81 26.6 59 22.7 60 22.5 76 27.1 46 30.1 67 20.1 18 28.2 67 25.8 54 27.2 92 22.4 25.01 14 76 24 99 81 58 32 4 9 5 7 8 0 4 27.6 31.05 25.8 19.25 18.3 21.26 29.5 31.87 22.3 27.97 22.8 23.41
Rank 1 2 7 1 1 7 3 4 decreased. However, by inspecting the ranks of each test results, most of the returned results are still within the top 3 ranks.
1.19
39
The goal of this experiment is to determine how the size of a training set would affect the overall accuracy of a speaker classification system when given variable speech samples.
The following figure provide graphical comparisons of the speakers training Feature Vector against speech samples from the same speaker.
40
The following tables shows the results of the experiment in terms of Euclidean Distance between each speakers training feature vector against 12 test speech samples.
Table 4.12: Speakers training data feature vector vs. 12 testing data
Test 1 20.115 11.879 10.411 19.420 Test 1 18.758 8.830 8.581 18.959
Test 2 1.026 6.112 9.096 3.192 Test 2 6.133 5.536 6.893 6.505
Test 3 2.890 3.790 7.603 3.318 Test 3 5.464 1.949 3.378 4.727
Test 1 4.478 1.260 2.189 3.232 Test 1 4.432 11.930 16.140 6.319
Test 2 17.618 7.872 6.907 14.445 Test 2 2.477 7.524 10.893 4.423
Test 3 16.631 7.918 4.368 15.057 Test 3 3.372 7.317 9.086 3.200
The results presented in Table 4.9 shows that with 15 Training Data, the speaker recognition process can achieve up to 66.67% of accuracy when given 3 Testing Data.
41
Test 4
Test 5
42
4 Rank 3 2 3 1 1 1 1 1 1
Test 3
Test 4
Test 6 21.328
Test 7 3.885
The results returned from this experiment show a similar level of accuracy as compared to the results obtained in Part (3a), which is up to an accuracy of 66.67%. The reduction of 6 Training Data did not show a significant impact towards the systems ability to recognise speaker.
43
(a)
Speech 1
Speech 2
Speech 3
Speech 4
Speech 5
Speech 6
44
0.000 1.277 1.421 1.643 2.160 2.181 2.208 2.883 Speech 3.486 7 4.274 0.000 7 1.762 70% 2.604 0.389 2.674 2.756 2.986 3.036 3.087 3.088 3.231 6 60% 0.333
0.000 1.762 2.120 2.149 2.284 2.344 2.473 2.522 Speech 2.677 8 2.734 0.000 5 11.770 50% 18.813 0.278 21.321 30.804 31.825 35.746 36.725 38.103 38.145 3 30% 0.167
0.000 1.285 1.672 1.932 2.208 2.853 3.130 3.183 Speech 3.257 9 3.291 0.000 8 1.611 80% 2.160 0.444 2.180 3.036 3.137 3.257 3.738 3.960 4.141 9 90% 0.500
0.000 2.072 2.344 2.756 3.223 3.912 3.925 4.326 Speech 5.011 10 5.264 0.000 4 2.637 40% 3.282 0.222 4.141 4.908 5.791 5.854 6.030 6.188 6.336 8 80% 0.444
0.000 1.682 1.932 2.181 2.522 2.656 2.829 2.929 Speech 3.087 11 3.127 0.000 8 2.072 80% 3.043 0.444 3.131 3.987 4.051 4.308 4.865 5.013 5.143 4 40% 0.222
0.000 2.149 2.285 2.442 2.686 2.829 2.863 3.183 Speech 3.367 12 3.381 0.000 6 2.826 60% 3.287 0.333 3.367 3.633 3.730 3.745 3.814 3.858 4.019 4 40% 0.222
(b)
Speech 13 0.000 5.325 5.916 6.076 6.807 7.079 7.272 7.294 7.721 8.136 5 50% 0.278
Speech 14 0.000 3.057 7.863 10.544 10.831 11.520 11.729 11.770 12.264 12.652 4 40% 0.222
Speech 15 0.000 1.611 1.643 1.895 2.319 2.442 2.637 2.853 2.929 2.948 9 90% 0.500
Speech 16 0.000 4.546 6.810 7.629 7.738 8.778 9.178 9.232 9.823 10.091 1 10% 0.056
Speech 17 0.000 1.353 1.421 1.672 1.682 1.895 2.180 2.639 3.150 3.890 7 70% 0.389
Speech 18 0.000 3.231 3.386 3.672 3.733 4.127 4.424 4.492 4.528 4.629 6 45 60% 0.333
(c)
Speaker 2:
Table 4.15 (a, b & c): Speaker 2's speech retrieval, precision and recall rate
46
Speech 1 0.000 6.783 Speech 7.384 13 8.199 0 9.651 3.063 9.731 4.075 9.838 4.156 10.335 4.314 10.389 4.427 11.432 4.836 4 5.484 40% 5.978 0.222 6.315 4 40% 0.222
Speech 2 0.000 2.852 Speech 3.290 14 3.482 0 3.511 1.134 3.712 2.071 3.809 2.631 3.881 3.088 4.049 3.309 4.272 3.532 4 3.549 40% 3.712 0.222 4.026 3 30% 0.167
Speech 3 0.000 2.573 Speech 2.741 15 3.046 0 3.201 1.998 3.517 2.522 3.677 3.046 4.047 3.775 4.136 3.905 4.404 4.046 5 4.602 50% 4.634 0.278 4.740 8 80% 0.444
Speech 4 0.000 1.134 Speech 2.210 16 2.222 0 2.573 4.257 3.326 5.157 3.385 6.874 3.482 7.887 3.629 7.974 3.685 8.199 4 8.523 40% 8.594 0.222 8.786 5 50% 0.278
Speech 5 0.000 3.726 Speech 4.427 17 5.067 0 5.743 3.091 5.760 3.767 5.996 3.905 6.353 4.606 6.766 4.773 6.937 4.824 5 5.373 50% 5.551 0.278 5.789 5 50% 0.278
(a) Speech 6 0.000 2.826 Speech 3.767 18 3.775 0 3.950 1.998 4.032 2.038 5.191 2.738 5.472 3.179 5.730 3.201 7.034 3.563 7 3.823 70% 3.878 0.389 3.989 6 60% 0.333
(b)
Speech 7 0 3.950 4.634 4.679 4.679 4.979 5.022 5.082 5.207 5.264 7 70% 0.389
Speech 8 0 2.220 2.924 4.085 4.271 4.303 4.368 4.404 4.740 4.895 5 50% 0.278
Speech 9 0 2.671 3.069 3.434 3.823 3.912 3.912 3.997 4.049 4.081 3 30% 0.167
Speech 10 0 4.019 4.032 4.324 4.551 4.602 4.876 5.022 5.316 5.586 6 60% 0.333
Speech 11 0 4.585 7.825 8.084 8.168 8.292 8.885 9.134 10.120 10.211 2 20% 0.111
Speech 12 0 3.179 3.730 3.828 3.864 3.915 4.274 4.311 4.324 4.559 4 40% 0.222
(c)
47
Speaker 3:
Table 4.16 (a, b & c): Speaker 3's speech retrieval, precision and recall rate
(a)
Speech 1 0 3.881 4.478 4.600 4.745 5.288 5.456 5.646 5.925 6.655 8 80% 0.444
Speech 2 0 2.852 4.443 4.468 4.745 4.896 4.995 5.119 5.437 5.519 6 60% 0.333
Speech 3 0 3.809 4.987 5.119 5.366 5.484 5.512 5.743 5.933 6.147 4 40% 0.222
Speech 4 0 3.063 3.240 5.079 5.760 6.810 7.024 7.200 7.988 8.416 6 60% 0.333
Speech 5 0 2.210 2.631 2.814 2.822 2.986 3.157 3.451 3.468 3.677 4 40% 0.222
Speech 6 0 1.115 2.220 2.234 2.734 3.069 3.142 3.232 3.563 3.777 6 60% 0.333
48
(b) Speech 7 0 2.622 3.992 4.600 4.987 5.157 5.334 5.524 5.831 5.963 8 80% 0.444 Speech 8 0 9.839 10.884 11.755 12.412 12.676 12.863 12.962 13.103 13.782 5 50% 0.278 Speech 9 0 2.038 2.234 2.958 3.470 3.915 4.046 4.183 4.290 4.304 6 60% 0.333 Speech 10 0 2.814 3.820 4.029 4.047 5.189 5.245 5.910 5.996 6.023 4 40% 0.222 Speech 11 0 2.522 2.738 2.741 2.924 2.958 3.142 3.598 4.661 5.414 6 60% 0.333 Speech 12 0 2.620 3.488 4.156 4.578 4.861 5.079 5.137 5.877 6.110 7 70% 0.389
49
(c)
Speaker 4:
Table 4.17 (a, b & c): Speaker 4's speech retrieval, precision and recall rate
(a)
Speech Speech 1 2 0 0 4.799 Speech 2.285 Speech 4.933 2.518 13 14 4.972 2.895 0 0 5.419 3.608 2.622 1.115 5.614 4.069 3.326 1.952 6.036 4.714 3.598 2.284 6.082 5.188 3.796 2.674 6.273 5.399 4.165 2.822 6.923 5.630 4.257 3.072 4 4 4.472 3.309 40% 40% 4.501 3.470 0.222 0.222 4.861 3.571 6 5 60% 50% 0.333 0.278
Speech 3 0 6.612 Speech 7.034 15 7.115 0 7.480 2.620 7.704 3.240 8.461 4.075 9.424 4.308 10.020 4.546 10.292 5.472 6 6.494 40% 6.598 0.333 6.852 6 60% 0.333
Speech 4 0 0.822 Speech 2.120 16 2.895 0 3.131 1.952 3.245 2.071 3.381 2.222 3.451 2.604 3.532 2.671 3.571 2.677 3 3.157 30% 3.223 0.167 3.232 4 40% 0.222
Speech 5 0 2.270 Speech 3.589 17 3.733 0 3.799 3.290 3.807 4.995 3.963 5.512 3.991 6.655 4.559 7.234 4.882 8.080 5 8.168 50% 8.371 0.278 8.394 5 50% 0.278
Speech 6 0 1.635 Speech 3.211 18 3.291 0 3.807 3.488 3.895 3.796 4.229 3.992 4.524 4.308 4.535 6.315 5.176 6.753 6 6.921 60% 7.200 0.333 7.843 8 80% 0.444
50
(b)
Speech 7 0 4.029 6.128 6.546 6.859 7.023 7.426 7.506 7.793 7.889 1 10% 0.056
Speech 8 0 5.583 5.591 5.818 5.829 6.379 6.387 6.513 7.878 8.436 5 50% 0.278
Speech 9 0 2.379 3.057 3.895 4.137 5.922 6.547 6.553 6.966 7.721 6 60% 0.333
Speech 10 0 5.283 5.839 5.891 6.379 7.130 8.075 8.161 8.316 8.645 6 60% 0.333
Speech 11 0 2.270 3.950 4.803 4.824 4.862 4.944 5.097 5.211 5.224 3 30% 0.167
Speech 12 0 0.822 2.473 2.518 2.863 3.469 3.549 3.685 3.745 3.789 3 30% 0.167
(c)
Speech 13 0 3.072 3.767 3.837 3.987 4.081 4.166 4.272 4.680 4.746 3 30% 0.167
Speech 14 0 1.285 1.635 2.383 2.571 2.639 2.656 2.883 3.589 3.671 6 60% 0.333
Speech 15 0 6.273 7.102 7.607 7.721 7.738 8.250 8.299 8.513 9.134 3 30% 0.167
Speech 16 0 2.379 2.383 2.939 3.150 3.211 3.409 3.486 5.496 5.664 7 70% 0.389
Speech 17 0 1.277 1.353 2.319 2.939 3.127 3.137 3.624 3.671 4.882 4 40% 0.222
Speech 18 0 2.571 2.686 3.091 3.356 3.633 3.748 4.069 4.600 4.689 3 30% 0.167
51
Based on the results returned from Speaker 1 to 4, the highest Retrieved data is 9 while the lowest is 1. The average Precision rate for Speaker 1 to 4 is 57.8%, 48.3%, 60.0% and 43.33% respectively. This precision rate shows that with only 1 training data, the system is still able to retrieve and recognise speakers to an acceptable level.
1.20
52
sentence is tested against 20 sentences and arranged in ranks, in an ascending order. The results will be classified using the following legends. Speaker 1 (Male) Speaker 2 (Female) Speaker 3 (Female) Speaker 4 (Male) Speaker 1:
Table 4.18: Speaker 1 vs. 20 test sentences
Sentence Sentence Sentence Sentence Sentence 1 2 3 4 5 1 0 0 0 0 0 2 0.327 0.427 0.507 0.327 0.258 3 0.427 0.562 0.521 0.398 0.422 4 0.521 0.668 0.559 0.562 0.550 5 0.587 0.880 0.575 0.619 0.573 6 0.609 0.888 0.648 0.744 0.575 7 0.620 0.888 1.084 0.970 0.609 8 0.685 1.012 1.138 1.119 0.619 9 0.984 1.069 1.310 1.329 0.747 10 1.107 1.138 1.329 1.379 0.888 11 1.737 1.482 1.839 1.669 1.304 12 2.090 1.792 2.456 2.436 1.613 13 3.284 2.342 2.732 2.927 2.229 14 3.289 3.074 3.034 3.495 2.769 15 3.851 4.146 3.466 4.933 2.964 16 4.315 4.271 3.845 5.339 3.540 17 5.248 4.502 4.197 6.435 4.037 18 8.188 6.632 6.376 9.266 6.219 19 9.287 8.169 6.882 10.786 7.165 20 14.236 14.177 9.928 17.090 11.851 The results returned for Speaker 1s tests above shows that the top 3 ranks belong to Speaker 1 and Speaker 4. One main possibility of this occurrence is because both speakers are male speakers with similar range of fundamental frequencies. Sentence 1 shows excellent results while Sentence 5 shows an extremely poor result, this may be because Sentence 1 was able to capture essential information that the MFCC feature extraction methods needed while Sentence 5 did not.
53
Speaker 2:
Table 4.19: Speaker 2 vs. 20 test sentences
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sentence 1 0 0.819 0.959 1.399 1.509 1.588 1.613 1.626 1.669 1.681 1.737 1.792 1.948 2.027 2.419 2.456 4.728 7.133 7.984 14.656
Sentence 2 0 0.139 0.770 1.948 2.027 2.329 2.666 2.732 2.805 2.843 2.849 2.964 3.146 3.636 3.851 4.053 4.120 4.146 4.933 6.923
Sentence 3 0 0.767 0.770 0.868 0.959 1.190 1.805 2.073 2.191 2.229 2.367 2.583 2.863 3.034 3.074 3.289 3.495 4.069 4.460 10.391
Sentence 4 0 0.139 0.767 1.948 2.253 2.359 2.728 3.466 3.478 3.507 3.540 3.561 3.626 4.118 4.315 4.502 4.535 4.751 5.339 8.070
Sentence 5 0 0.819 0.868 1.612 1.863 2.342 2.359 2.666 2.716 2.769 2.891 2.927 2.956 3.284 3.320 3.998 4.197 5.618 6.897 15.072
The table above shows that the top 3 returned results are all from Speaker 2. Sentence 3 displays an excellent result where all 5 sentences were retrieved as the top 5 results.
Speaker 3:
Table 4.20: Speaker 3 vs. 20 test sentences
1 2 3 4 5
54
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1.304 1.352 1.482 1.603 1.612 1.681 1.839 1.948 2.090 2.192 2.253 2.422 2.436 3.268 8.888
8.070 8.888 9.835 9.928 10.391 11.851 11.966 12.263 12.291 14.177 14.236 14.656 15.072 15.299 17.090
3.636 4.118 4.460 5.604 6.114 6.882 6.897 7.040 7.165 7.649 7.984 8.169 9.287 9.603 10.786
4.069 4.420 4.535 4.746 4.760 5.618 6.029 6.219 6.376 6.632 6.734 7.133 8.188 8.258 9.266
2.583 2.728 2.816 2.962 3.787 3.845 3.998 4.037 4.271 4.414 4.682 4.728 5.248 5.695 6.435
The results obtained for Speaker 3 shows a good level of accuracy where sentences 2 to 4 were able to retrieve most of the correct data among the top 4 ranks. However, Sentence 1 shows a poor example of this feature extraction and classification system.
Speaker 4:
Table 4.21: Speaker 4 vs. 20 test sentences
1 2 3 4 5 6 7
55
8 9 10 11 12 13 14 15 16 17 18 19 20
0.836 0.880 1.084 1.588 2.192 2.863 2.956 4.120 4.751 5.695 8.258 9.603 15.299
0.685 0.888 0.970 1.352 1.509 2.073 2.716 2.805 3.478 3.787 6.029 7.040 11.966
0.659 0.744 1.012 1.399 1.603 2.191 2.849 2.891 3.507 4.414 6.734 7.649 12.263
0.984 1.045 1.069 1.379 2.367 2.419 2.816 2.843 3.320 3.561 4.746 5.604 9.835
0.776 1.107 1.119 1.310 1.626 1.805 1.863 2.962 3.146 3.626 4.760 6.114 12.291
The results returned for Speaker 4 is somewhat similar to the results from Speaker 1 where the top 3 ranks belong to Speaker 4s sentences and the ranks below it is a mixture between the two.
1.21
56
Table 4.22: Speaker 2 & 4, feature vector 1 vs. 20 test sentence samples
S1 S2 S3 S4 Result:
Sentence Sentence Sentence Sentence 1 2 3 4 1.9983 2.7698 0.9062 2.5251 3.708 3.6745 3.8766 4.9173 2.8483 9.5355 7.3772 6.8522 1.5297 1.1073 0.8542 1.0701 1.5297 1.1073 0.8542 1.0701
Table 4.23: Speaker 2 & 4, feature vector 2 vs. 20 test sentence samples
S1 S2 S3 S4 Result:
Sentence Sentence Sentence Sentence 1 2 3 4 17.2828 14.9276 21.502 14.942 11.0689 18.2939 12.8229 16.2994 15.0304 42.2468 25.2105 22.3842 16.5116 17.6414 17.8994 19.3882 11.0689 14.9276 12.8229 14.942
57
Table 4.21 and Table 4.22 show results of two feature vectors from a sentence between Speaker 1 and Speaker 4 against 20 speech sentences in terms of Euclidean Distance. Speech samples of Speaker 1 and Speaker 4 are denoted with green and yellow highlights respectively.
Table 4.24: Speaker 1 & 4, feature vector 1 vs. 20 test sentence samples
S1 S2 S3 S4 Result:
Sentence Sentence Sentence Sentence 1 2 3 4 10.5905 12.708 6.9247 12.6431 14.8117 10.3057 13.4369 12.585 11.9295 7.9822 12.6104 13.5847 10.6217 9.1999 8.7101 8.0696 10.5905 7.9822 6.9247 8.0696
Table 4.25: Speaker 1 & 4, feature vector 2 vs. 20 test sentence samples
S1 S2 S3 S4
Sentence Sentence Sentence Sentence 1 2 3 4 10.3266 7.6518 13.6508 8.6889 7.8215 14.1136 9.3329 12.9981 8.6219 32.2119 17.5356 14.3456 9.9203 10.5986 11.2788 11.4139
Result: 7.8215 7.6518 9.3329 8.6889 5.1258 The results obtained from Speaker 2 and 4 shows that the K-means algorithm was able to properly divide the mel-frequency cepstral coefficients into two good clusters of feature vectors when the speakers are of different gender, on the other hand, the results acquired from the sentence between Speaker 1 and 4 indicates poor clustering as most of the results returned belongs to Speaker 1. One possibility for this occurrence is because both speakers are male, hence, the algorithm was not able to properly distinguish between the two.
58
Table 4.26: Speaker 2 & 4, feature vector vs. 72 test word length samples
Accuracy Feature Vector 1 Feature Vector 2 10 out of 18 samples, 55.56% 8 out of 18 samples, 44.44%
Table 4.27: Speaker 1 & 4, feature vector vs. 72 test word length samples
Accuracy Feature Vector 1 Feature Vector 2 15 out of 18 samples, 77.78% 11 out of 18 samples, 61.11%
1.22
Discussion
Through these experiments, it is discovered that there are many factors that would contributes to the success of each query. The general pattern observed shows a moderately acceptable level of accuracy using this system.
Initially, the experiment started with using the Number of Cepstral Coefficients as the variable, the results shows that as the number of coefficients increases, the accuracy of
59
the system also increases; this however would reach a certain saturation level. The possibilities of having the accuracy drop after exceeding a number of Cepstral Coefficients could not be verified through these experiments. A possible reason why the system was only able to reach a certain saturation level is because the Vector Quantization k-means method used compresses all the mel frequency cepstral coefficients into one single vector. The idea behind this implementation is to obtain the average level of energy information was good, but could have been improved by increasing the number of clusters, k.
The experiment was followed by using the Sampling Frequency as the variable. The change of sampling frequency displayed a tremendous drop in the accuracy of the system, but when the ranking of the correct match was taken into consideration, the speaker recognition rate of this system received some justification as most of were within the top 3 ranks. One of the main reasons of such occurrence is because when the sampling frequency, fs, of an audio speech signal is altered, the pitch and the speed of which the text is spoken differ. When the sampling frequency is increased, the output audio would have the chipmunk effect while lowering the sampling frequency would cause the audio to have more bass. The change of pitch was probably the reason why the system was not able to provide a good output. Since the MFCC algorithm is primarily based on the human auditory system, the result obtained was able to validate reason for the drop of accuracy. Subsequently, variable speech samples were used to determine how the system fairs. The system did not show a vivid change of accuracy when the Training Data set changed, this was probably because the Testing Data set was not large enough for sufficient comparison. Based on the results obtained, the huge difference between the highest and the lowest retrieve rate shows that while the MFCC feature extraction method may be able to provide a good feature vector, a good training data is essential in creating a system with high accuracy. Through this experiment, the MFCC algorithm was verified to be able to classify different speakers into different class even when a different set of word were used. Just as how any human beings are able
60
to classify and recognise voices of different people when given a different set of speech samples, this feature extraction method was again proven to be able to do so.
The next experiment performed made use of sentences of 30 seconds instead of a single word speech sample. The experiment returned a considerably good result, thus proving that MFCC is able to capture important features in sentence level. This however, could be improved if more training was done. A good set of training data would definitely increase the accuracy of the system, inconsistency in the quality of a training set (especially if there is noise) such as Speaker 3 Speech 1 displayed a good example of such. Upon further inspection, a common pattern found in the results shows that the Euclidean Distance between speakers of the same gender is generally small. Such results are seen in Speaker 1 and Speaker 4s result table, this proves that the feature extraction technique, MFCC does model after the human auditory system.
The last experiment that was carried out made use of sentences with multiple speakers. These files were then tested against speech samples both in sentence form and word length formats. The results obtained shows that while the system is able to better represent the speakers with a feature vector when both the speakers are off different genders; the system would perform better when it is required to distinguish the speakers available in the sentence from the speakers given that both speakers are off the same gender. This is most likely due to the fact that speakers from the same gender would have higher resemblance in their voices, hence the system is able to show a better result.
In essence, through this experiment, the MFCC algorithm has been proven to be robust in capturing important features in cases where a good audio speech sample is given and a sufficient number of Cepstral Coefficient is used. Variable speech samples or sentences do not have a significant impact on the overall system accuracy.
61
CHAPTER 4:CONCLUSIONS
1.23 Overview
This thesis involves audio signal feature extraction and classification in the area of speech and speaker analysis. The effect on the accuracy was the system was studied given various variables and condition.
The works in this thesis can be concluded in three main areas as follows: Development of Speech Database Speech analysis using Mel-frequency Cepstral Coefficients method Speaker classification using k-means method
1.24
62
The speech samples were extracted on by one in order to perform experiments of variable speech data. Next, the codes found in the VOICEBOX: Speech Processing Toolbox was studied to understand how each function works especially the ones used in this project. This is done in view of making rooms for improvement and especially in the area of implementation. Various codes were studied; one of the problems faced during this phase was to understand how the obtained results can be processed. The experimental phase provided a good insight on how different variables would affect the accuracy of a system which uses MFCC and K-means. The results obtained indicate a moderately acceptable level of speech and speaker recognition accuracy (up to 80%) despite given different Training Data such as same speech samples, variable speech samples and even variable sentence level samples.
1.25
Future Work
A larger set of Speech Database should be developed in order to obtain a more comprehensive and accurate result on the ability of the system. Along with that, the Speech Database could also be developed manually in view of having specific terms or phrase spoken. In terms of algorithm and methods, other variables such as numbers of filters used and even the shapes of each filter can also be tweaked to see how each of the changes affect the accuracy of the speech analysis and speaker recognition system. Speaker diarization can almost be implemented since the MFCC and K-means method is able to detect multiple speakers in a single sentence. Lastly, an automated system that is able to tabulate the system quicker can also be implemented instead of utilizing tools such as Microsoft Excel.
63
References
[1] Johns Hopkins University (1998, April 30). How Do We Hear While We Sleep?. ScienceDaily. Retrieved April 14, 2012, from http://www.sciencedaily.com/releases/1998/04/980430044534.htm [2] Vibha Tiwari, (2010, February). MFCC and its applications in speaker recognition, International Journal on Emerging Technologies 1. [3] Joseph P. Campbell, Jr. Speaker Recognition: A Tutorial. Proceedings of the IEEE, vol. 85, no. 9, 1997 [4] Peter W. Alberti, (n.d.). The Anatomy and Physiology of the Ear and Hearing, [Online]. Available: http://www.who.int/occupational_health/publications/noise2.pdf [5] David J. M. Robinson, (1999, September). The Human Auditory System, [Online]. Available: http://www.mp3-tech.org/programmer/docs/human_auditory_system.pdf [6] Azlifa, (2006, July 18). The Sound Producing System. Azus Notes. Retrieved April 16, 2012, from http://www.azlifa.com/phonetics-phonology-lecture-2-notes/ [7] Robin Schmidt., (n.d.). Digital Signals Sampling and Quantization. RS-MET, Berlin, Germany [Online]. Available: http://www.rs-met.com/documents/tutorials/DigitalSignals.pdf [8] Ta, K. T. (2009, November). Robust Speaker Identification/Verification for Telephony Applications. (Electronic Engineering), Sim University. Retrieved from http://sst.unisim.edu.sg:8080/dspace/bitstream/123456789/281/1/09_Kyaw %20Thuta.doc [9] Amod Gupta, (2009). Audio Processing on Constrained Devices, (Master of Mathematics in Computer Science), University of Waterloo. Retrieved from http://uwspace.uwaterloo.ca/bitstream/10012/4830/1/Thesis.pdf
64
[10] Manish P. Kesarkar, (2003, November). Feature Extraction for Speech Recognition. M.Tech. Credit Seminar Report, Electronic Systems Group, EE. Dept, IIT Bombay. [Online]. Available: http://www.ee.iitb.ac.in/~esgroup/es_mtech03_sem/sem03_paper_03307003.pdf [11] Aldebaro Klautau, (2005, November 22). The MFCC, [Online]. Available: http://www.cic.unb.br/~lamar/te073/Aulas/mfcc.pdf [12] Md. Rashidul Hasan, Mustafa Zamil, Mohd Bolam Khabsani,Mohd Saifur Rehman. (2004, December 28-30). Speaker Identification using Mel Frequency Cepstral Coefficients. 3rd international conference on electrical and computer engineering (ICECE). [Online]. Available: www.assembla.com/spaces/strojno_ucenje_lab/documents/bikvce70wr3r6zeje5av nr/download/p141omfccu.pdf [13] Jyh-Shing Roger Jang. (n.d.). 12-2 MFCC, Retrieved April 16, 2012 from http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/speechfeaturemfcc. asp?title=12-2%20mfcc [14] Htin Aung, M., Ki-Seung, Asif Hirani, Ke Ye., (2006) Speaker Recognition, [Online]. Available: http://www.softwarepractice.org/mediawiki/images/5/5f/Finally_version1.pdf [15] VOICEBOX: Speech Processing Toolbox for MATLAB. (n.d.), Retrieved April 17, 2012 from http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html [16] Reynolds, D.A., Quatieri, T.F., Dunn, R. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing 10(1-3), 1941, 2000 [17] S. Tranter and D. Reynolds. An overview of automatic speaker diarization systems. IEEE Transactions on Audio Speech and Language Processing, vol. 14, no. 5, pp. 15571565, 2006 [18] D. Liu and F. Kubala. Online speaker clustering, in Proc. 2004 IEEE International Conference Acoustics, Speech, and Signal Processing, vol. 1, pp. 333-336, 2006
65
% PLOT OF TRAINING or TESTING DATA if required (Example) figure(1); time=(1:length(s1))/fs1; plot(time, s1); title('Training Data 1'); xlabel('Time, S'); ylabel('Amplitude, A');
66
% FEATURE EXTRACTION, 4 coefficients % Training set, CoE = Number of Cepstral Coefficients, M = Hamming CoE = 4; mfcc_s1 mfcc_s2 mfcc_s3 mfcc_s4 mfcc_s5 mfcc_s6 mfcc_s7 mfcc_s8 = = = = = = = = melcepst(s1,fs1,'M', melcepst(s2,fs2,'M', melcepst(s3,fs3,'M', melcepst(s4,fs4,'M', melcepst(s5,fs5,'M', melcepst(s6,fs6,'M', melcepst(s7,fs7,'M', melcepst(s8,fs8,'M', CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)'; CoE)';
% Testing Set mfcc_st1 = melcepst(st1,fst1,'M', mfcc_st2 = melcepst(st2,fst2,'M', mfcc_st3 = melcepst(st3,fst3,'M', mfcc_st4 = melcepst(st4,fst4,'M', mfcc_st5 = melcepst(st5,fst5,'M', mfcc_st6 = melcepst(st6,fst6,'M', mfcc_st7 = melcepst(st7,fst7,'M', mfcc_st8 = melcepst(st8,fst8,'M',
% Obtaining the Feature Vector using K-means % Grouping the MFCC into 1 cluster. % MFCC K-Means clustering (Training Data) km_s1 = kmeans(mfcc_s1', 1); km_s2 = kmeans(mfcc_s2', 1); km_s3 = kmeans(mfcc_s3', 1); km_s4 = kmeans(mfcc_s4', 1); km_s5 = kmeans(mfcc_s5', 1); km_s6 = kmeans(mfcc_s6', 1); km_s7 = kmeans(mfcc_s7', 1); km_s8 = kmeans(mfcc_s8', 1); % MFCC km_st1 km_st2 km_st3 km_st4 km_st5 km_st6 km_st7 km_st8 K-Means clustering (Testing Data) = kmeans(mfcc_st1', 1); = kmeans(mfcc_st2', 1); = kmeans(mfcc_st3', 1); = kmeans(mfcc_st4', 1); = kmeans(mfcc_st5', 1); = kmeans(mfcc_st6', 1); = kmeans(mfcc_st7', 1); = kmeans(mfcc_st8', 1);
% --------------------------- Testing Phase -------------------disp ('---------------------- Test Results----------------------'); disp (' ');
67
% Calculating Euclidean Distance using for loop, displaying results in a % Matrix for i=1:8 for j=1:8 eu_d(i,j) = eval(['sum((km_s' int2str(i) '- km_st' int2str(j) ').^2)']); end end disp(['Compute: Training Data(y) vs Testing Data(x)']) Eucleadean_Distance4 = eu_d
68
% import wav training data into MATLAB [s1, fs1, nbits1]=wavread('s1.wav'); [s2, fs2, nbits2]=wavread('s2.wav'); [s3, fs3, nbits3]=wavread('s3.wav'); [s4, fs4, nbits4]=wavread('s4.wav'); [s5, fs5, nbits5]=wavread('s5.wav'); [s6, fs6, nbits6]=wavread('s6.wav'); [s7, fs7, nbits7]=wavread('s7.wav'); [s8, fs8, nbits8]=wavread('s8.wav'); % import wav testing data into MATLAB [st1, fst1, nbitst1]=wavread('st1.wav'); [st2, fst2, nbitst2]=wavread('st2.wav'); [st3, fst3, nbitst3]=wavread('st3.wav'); [st4, fst4, nbitst4]=wavread('st4.wav'); [st5, fst5, nbitst5]=wavread('st5.wav'); [st6, fst6, nbitst6]=wavread('st6.wav'); [st7, fst7, nbitst7]=wavread('st7.wav'); [st8, fst8, nbitst8]=wavread('st8.wav'); % Resampling of data p = 1; q = 2; st1 = resample(st1, p,q); fst1 = fst1*(p/q); st2 = resample(st2, p,q); fst2 = fst2*(p/q); st3 = resample(st3, p,q); fst3 = fst3*(p/q); st4 = resample(st4, p,q); fst4 = fst4*(p/q); st5 = resample(st5, p,q); fst5 = fst5*(p/q); st6 = resample(st6, p,q); fst6 = fst6*(p/q); st7 = resample(st7, p,q); fst7 = fst7*(p/q); st8 = resample(st8, p,q); fst8 = fst8*(p/q);
69
Result:
10.51 Sample 1
S1 S2 S3 S4
Result:
4.39
5.20
6.57
3.27
5.10
9.67
5.19
10.63
8.18
6.29
9.29
5.69
4.08
4.03
2.70
4.99
11.93
6.07
70
Result:
24.1 Sample 1
S1 S2 S3 S4
Result:
3.2
3.2
7.9
2.9
5.2
5.3
2.3
8.1
4.7
7.5
3.0
4.5
2.0
3.2
4.4
2.9
9.3
7.0
71