Академический Документы
Профессиональный Документы
Культура Документы
Ambavi Patel Darshan Menon Dhwanil Savla Under Guidance of Prof J.H.Nirmal
BIOMETRIC
Biometrics is an emerging field of technology using unique and measurable physical, biological, or behavioral characteristics that can be processed to identify a person. Biometric properties of human are fingerprint, iris, face and voice. A concise definition of biometrics is the automatic recognition of person using distinguishing traits.
Some Biometric Systems are: Fingerprints Eye Patterns Signature Dynamics Keystroke Dynamics Facial Features Speaker Recognition
Speech is the most desirable way of communication between humans. It is 1-dimentional signal. It carries information about -Message -Information about language -Information about speaker -information about emotion
Characteristics of speech
The Bandwidth of the signal is 4 kHz The signal is periodic with a Fundamental Frequency between 50 Hz and 400 Hz There are Peaks in the Spectral Distribution of energy at (2n 1) 500 Hz ; n = 1, 2, 3, . . . The Envelope of the Power Spectrum of the signal shows a decrease with increasing frequency (-6dB per octave)
Speaker
Listener
Physiological Level(2kBps)
The source of most speech occurs in the larynx. It contains two folds of tissue called the vocal folds or vocal cords which can open and shut like a pair of fans. The gap between the vocal cords is called the glottis and as air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. This process is known as phonation. The frequency of vibration determines the pitch of the voice , typically in the range 50-400Hz
Amplitude
Glottal Pulse
50
Opening phase
Intensity
Closing phase
Time (ms)
Closure
Frequency (Hz)
The vocal tract consists of the air passages from the vocal cords to the lips, including the nasal cavity. The vocal tract behaves like a resonance chamber, amplifying and attenuating certain frequencies. The shape and size of the vocal tract can be modified by moving the lips, tongue, velum, etc. which results in continually changing resonant frequencies. Consequently a spectrum of frequencies is produced which contains peaks at certain frequencies. The nasal cavity can be either included or excluded by opening or closing the soft palate or velum. Sounds are classified as nasal if they are produced by the passage of air through the nasal cavity.
Intensity
Frequency (Hz)
The spectrum of vocal tract response consist of number of resonant frequencies called Formants
Sound classes
Based on vocal chords role there are two category Voiced Unvoiced others are Plosive Nasal
Voiced speech
In case of voiced speech vocal chords vibrates with particular frequency called fundamental frequency. Hence air flows in discrete manner. Ex. /a/ , /e/ ,/i/
Unvoiced speech
In this case vocal chords does not vibrate but narrowed resulting turbulent flow of air through tract. Ex./s/ ,/f/
Nasal sounds: This type of sounds generates when velum is lowered and nasal cavity coupled with vocal tract. Ex./m/ , /n/ Plosive sounds: Characterized by complete closure constriction towards front of vocal tract. Build up pressure behind closure and sudden release generates plosive sounds. Ex. /p/ ,/t/ ,/k/
SPEAKER RECOGNITION
Flow chart
Start Find the minimum result Read the speech sample Amplify the signal Remove DC offset Calculate MFCC Speaker Not recognized Find centroid Load the stored database Find euclidean distance Store the result Speaker recognized
no
yes
FLOW CHART
INPUT VOICE SIGNAL AMPLIFICATION OVERLAP FRAMING REMOVAL OF D.C. EFFECT
WINDOWING
DFT
LOG
DCT
QUEFRENCY DOMAIN
NORMALISATION
Preprocessing
Amplification :
The speech signal is amplified to get
Better pitch variations. Good envelop. Energy returned is better.
DC elimination method 1
xf=fft(x); % forward FFT xf(1)=0; % clear the first component of FFT xc=ifft(xf); % inverse y=real(xc); % take only real part
DC elimination method 2
ofs=mean(x); % calculate mean y=x-ofs; % subtract from each sample
Spectrum
Each fourier transformed speech segment is binned by correlating with each triangular filter bank.
Post Processing
Normalization: The enhancement done is a normalization, meaning that the feature vectors are normalized over time to get zero mean and unit variance. The mean vector is :
Ex. Consider Sampling frequency =16000 Recording duration =3 sec. Time slice =15 msec. Input signal =>
480001
After frame blocking= Frame size =240 Overlap size =120 No. of windows =399
240399
120399
40120
MFCC=>
40399
(3to15) 13399
Speaker coding
Here we used a method which is analogous to cross parity method .which is used for error detection and correction in digital communication system. In our case instead of finding parity bit, root mean square is calculated for each row and column. Root mean square of each column and Root mean square of each row consist of one code book for one speaker. Such N codebooks are created for N speakers.
Feature vectors=>
13 1
1399
1412
Feature matching
This set is compared with stored database values by mean of Euclidean distance. The Euclidean distance measure is the standard distance measure between two vectors in feature space.
If this distortion falls below threshold value than only access allowed to speaker.
Result
20
15
10 5
Test speaker 1
Test speaker 2
Test speaker 3 Test speaker 4
Test speaker 5
Threshold
Analysis
This method is simple. Since no complexities, it takes small time to give results. Threshold value needs to be changed depending on the input device. i.e. it is sensitive to device variation. Its accuracy ranges from 60% to 80% depending upon physical and emotional condition of speaker and surrounding environment. Due to low performance we classified feature extracted from speakers using Back propagation neural network
Neural network
An artificial neural network is an information processing model inspired by the way human biological nervous system process information. Relationship between biological and artificial neurons are,
Biological neuron Cell Dendrites Soma Axon Artificial neuron neuron Synaptic weight Net input Output
Biological Neuron
NEUCLEUS
CELL BODY
AXON
DENDRITES
Artificial Neuron
OUTPUT LAYER
HIDDEN LAYER
INPUT LAYER
Characteristics of network
Input layer contains 20 neurons. Hidden layer contains 20 neurons. Output layer contains neurons equal to total number of speaker in database. Back propagation weight/bias learning function is Gradient descent with momentum. The network's performance is measures according to the mean of squared errors.
Activation function for hidden layer is hyperbolic tangent sigmoid transfer function. Activation function for output layer is Linear transfer function
Train network
no
If regression >threshold
yes
Speaker recognized
Input to network
20 centroids from MFCC matrix are generated and used as input to network.
120
Training
The training algorithm involves four stages. Initialization of weights: Step 1: Initialize weight to small random values. Step 2: While stopping condition is false, do step 3-10. Step 3: For each training pair do step 4-9.
Feed Forward: Step 4: Each input unit receives the input signal Xi (i=1:n) , n =total input neuron. Step 5: Each hidden unit sums its weighted input signals Zinj=Voj+(i=1:n)Xi*Vij Apply activation function Zj(j=1:p)=f(Zinj), p=total neuron of hidden layer. Where Vo and V are bias and weights of hidden unit respectively. Step 6: Each output unit sums its weighted input signals Yink=Wok+(j=1:p)Zj*Wjk Apply activation function Yk(k=1:m)=f(Yink), m=total neuron of output layer=N. Where Wo and W are bias and weights of output unit respectively.
Back propagation of error: Step 7: Each output unit Yk receives target pattern corresponding to an input pattern. Error information is calculated as k=(Tk-Yk)*f(Yink) Step 8: Each hidden unit Zj sums its delta inputs in the layer above inj=(k=1:m) j*Wjk The error information term is calculated as j= k*f(Zinj)
Updation of weight and biases: Step 9: Each output unit Yk updates its bias and weights. The weight correction term is given by Wjk=* k*Zj Wjk(new)=Wjk(old)+ Wjk Wok (new) =Wok (old) + Wok Each hidden unit Zj updates its bias and weights. The weight correction term is given by Vij=* j*Xi Vij (new)= Vijold)+ Vij Voj (new)= Voj (old)+ Voj Step 10: Test the stopping condition. Stopping condition are, Maximum epochs=100 and mean square error 0. The final updated weights are saved and used in testing phase.
Testing
After successful training with minimum decided error the environment of network is saved. For testing The voice sample of speaker is applied to network and generated output is correlated with desired data. Regression = T/Y T=target or desired data Y=output of network
If value of regression is more than threshold value then speaker is said to be recognized.
Results
No. of speakers:5 No. of sample per speaker:1 Testing sample :speaker 1 network was trained two times, result produced at each time are,
Regression
Regression
Analysis
This method is complex. If the number of speakers increases and hence neuron and calculation increase which produces large delay for training. Since each time training provides different results, hence to get proper result network should get trained well. It requires large amount of memory. Its accuracy range is from 70% to 95% dependent on initial random weights and training of network.
Performance
1-euclidean distance 2-Neural network
Applications
Voice based biometric system are being used in many applications where identification of speaker along with password or command is required such as, Attendance Systems Access Control Systems Biometric Login to telephone aided shopping systems Information Information and Reservation Services Security control for confidential information Forensic purposes Voice command and control Voice dialing in hands-free devices And so on
References
[1] Ben J. Shannon, Kuldip K. Paliwal , A Comparative Study of Filter Bank Spacing for Speech Recognition, MICROELECTRONIC ENGINEERING RESEARCH CONFERENCE 2003. [2] Brian Kan-Wing Mak, Yik-Cheung Tam, and Peter Qi Li, Discriminative Auditory-Based Features for Robust Speech Recognition, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004. [3] G.Sreenu, P.N.Girija, M.Narendra Prasad and M.Nagamani, A Human Machine Speaker Dependent Speech Interactive System, INDIAN INSTITUTE OFTECHNOLOGY. KHARAGPUR 721302, DECEMBER 20-22, 2004.
[4] Robert W. Ives, Yingzi Du, Delores M. Etterand Thad B. Welch, A Multidisciplinary Approach to Biometrics, IEEE TRANSACTIONS ON EDUCATION, VOL. 48, NO. 3, AUGUST 2005.. [5] R.V Pawar, P.P.Kajave, and S.N.Mali, Speaker Identification using Neural Networks, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2005 [6] .R. AidaZade, C. Ardil and S.S. Rustamov, Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems, World WORLD ACADEMY OF SCIENCE,ENGINEERING AND TECHNOLOGY 2006
[7] Rozeha A. Rashid, Nur Hija Mahalin, Mohd Adib Sarijari, Ahmad Aizuddin Abdul Aziz, Security System Using Biometric Technology: Design and Implementation of Voice Recognition System (VRS), IEEE COMPUTER AND COMMUNICATION PROCEEDINGS MAY 13-15, 2008. [8] M.A.Anusuya &.S.K.Katti, Speech Recognition by Machine: A Review , (IJCSIS) International Journal of Computer Science and Information Security, Vol. 6, No. 3 2009. [9] Meysam Mohamad pour, Fardad Farokhi, An Advanced Method for Speech Recognition, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2009.
[10] Mahdi Shaneh, and Azizollah Taheri, Voice Command Recognition System Based on MFCC and VQ Algorithms, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2009. [11] Wen-Shiung Chen &Jr-Feng Huang, Speaker Recognition using Spectral Dimension Features,INTERNATIONAL MULTI-CONFERENCE ON COMPUTING IN GLOBAL INFORMATION TECHNOLOGY 2009. [12] Dhiadeen SALIH, Arabic Word Speaker Identification using Fuzzy Wavelet Neural Network, DEPARTMENT OF COMPUTER SCIENCE, UNIVERCITY OF KIRKUK,IRAQ 2009.
[13] Wanchun Fei, Liangjun Xu, Xingxing Lu,Speaker Recognition on Non-stationary Characteristics, INTERNATIONAL CONFERENCE ON FUZZY SYSTEM 2010. [14] Wael Al-Sawalmeh, Khaled Daqrouq, Omar Daoud &Abdel-Rahman Al-Qawasmi, Speaker Identification System-based Mel Frequency and Wavelet Transform using Neural Network Classifier, EUROPEAN JOURNAL OF SCINTIFIC RESEARCH VOL. 41 2010. [15] Santi Nuratch1, Panuthat Boonpramuk1, Chai Wutiwiwatchai, Feature Smoothing and Frame Reduction for Speaker Recognition, INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING 2010.
[16] Sunil Kumar Kopparapu and M Laxminarayana, Choice of MEL filter bank in computing MFCC of a resampled speech, ISSPA SIGNAL PROCESSING AND THEIR APPLICATION 2010. [17] Ahmad A. M. Abushariah, Teddy S. Gunawan,,Othman O. Khalifa &Mohammad A. M. Abushariah, English Digits Speech Recognition System Based on Hidden Markov Models, INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNIACTION ENGINEERING MAY 11-13 2010. [18] Cini Kurian, Kannan Balakrishnan, Speech Recognition of Malayalam Numbers. [19] Athulya Jayakurnar, Vimal Krishnan VR & BabuAnto P, Text Dependent Speaker Recognition Using Discrete Stationary Wavelet Transform and PCA.
[20] Promiti Dutta and Alexander Haubold, Audio based classification of speaker characteristics. [21] Xing Fan, John H. L. Hansen, Speaker Identification within Whispered Speech Audio Streams, CENTER FOR ROBUST SPEECH SYSTEM, TEXAS, USA. [22] Tudor Barbu, Comparing Various Voice Recognition Techniques, Institute of computer science, Romania. [23] Jorge Salomon Fuentes, Dr. Chit-Sang Tsang, Speech Recognition using Frequency Transformations, CALIFORNIA STATE UNIVERSITY.
[24] Lawrence Rabiner and Ronald Schafer, Digital Processing of Speech Signals, Prentice Hall publication. [25]S. Sivanandam, S. Smathiand S. Deepa, Introduction to Neural Network, Wiley publication. [26]B. Plannerer, An Introduction to Speech Recognition (http://www.speech-recognition.de/). [27] Dr. Gunamani Jena, Dr. Rameswar Baliarsingh, An Introductory Course on Artificial Neural Network and Fuzzy Logic (http://drgjena.co.cc/). [28] http://www.mathworks.com [29]http://www.youtube.com/watch?v=Xjzm7S__kBU&feature=rel mfu [30]http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/speech%20co urse.html