Вы находитесь на странице: 1из 74

BIOMETRIC SYSTEM

Ambavi Patel Darshan Menon Dhwanil Savla Under Guidance of Prof J.H.Nirmal

BIOMETRIC
Biometrics is an emerging field of technology using unique and measurable physical, biological, or behavioral characteristics that can be processed to identify a person. Biometric properties of human are fingerprint, iris, face and voice. A concise definition of biometrics is the automatic recognition of person using distinguishing traits.

Various biometric systems

Some Biometric Systems are: Fingerprints Eye Patterns Signature Dynamics Keystroke Dynamics Facial Features Speaker Recognition

The speech signal

Speech is the most desirable way of communication between humans. It is 1-dimentional signal. It carries information about -Message -Information about language -Information about speaker -information about emotion

Characteristics of speech
The Bandwidth of the signal is 4 kHz The signal is periodic with a Fundamental Frequency between 50 Hz and 400 Hz There are Peaks in the Spectral Distribution of energy at (2n 1) 500 Hz ; n = 1, 2, 3, . . . The Envelope of the Power Spectrum of the signal shows a decrease with increasing frequency (-6dB per octave)

Human vocal system

Humans speech production and recognition system

Speaker

Listener

Linguistic Level (50-200 Bps)

Physiological Level(2kBps)

Acoustic Physiological Level Level(2kBps) (30-64kBps)

Linguistic Level (50-200 Bps)

Speech production model

The source of most speech occurs in the larynx. It contains two folds of tissue called the vocal folds or vocal cords which can open and shut like a pair of fans. The gap between the vocal cords is called the glottis and as air is forced through the glottis the vocal cords will start to vibrate and modulate the air flow. This process is known as phonation. The frequency of vibration determines the pitch of the voice , typically in the range 50-400Hz

Amplitude

Glottal Pulse

50

Opening phase
Intensity

Closing phase

Time (ms)

Closure

Spectrum of glottal pulse

Frequency (Hz)

The vocal tract consists of the air passages from the vocal cords to the lips, including the nasal cavity. The vocal tract behaves like a resonance chamber, amplifying and attenuating certain frequencies. The shape and size of the vocal tract can be modified by moving the lips, tongue, velum, etc. which results in continually changing resonant frequencies. Consequently a spectrum of frequencies is produced which contains peaks at certain frequencies. The nasal cavity can be either included or excluded by opening or closing the soft palate or velum. Sounds are classified as nasal if they are produced by the passage of air through the nasal cavity.

Intensity

Spectrum of glottal pulse filtered by the vocal tract

Frequency (Hz)

The spectrum of vocal tract response consist of number of resonant frequencies called Formants

Some vocal tract positions

Sound classes

Based on vocal chords role there are two category Voiced Unvoiced others are Plosive Nasal

Voiced speech
In case of voiced speech vocal chords vibrates with particular frequency called fundamental frequency. Hence air flows in discrete manner. Ex. /a/ , /e/ ,/i/

Unvoiced speech
In this case vocal chords does not vibrate but narrowed resulting turbulent flow of air through tract. Ex./s/ ,/f/

Nasal sounds: This type of sounds generates when velum is lowered and nasal cavity coupled with vocal tract. Ex./m/ , /n/ Plosive sounds: Characterized by complete closure constriction towards front of vocal tract. Build up pressure behind closure and sudden release generates plosive sounds. Ex. /p/ ,/t/ ,/k/

SPEAKER RECOGNITION

Flow chart
Start Find the minimum result Read the speech sample Amplify the signal Remove DC offset Calculate MFCC Speaker Not recognized Find centroid Load the stored database Find euclidean distance Store the result Speaker recognized

no

Check if result > threshold

yes

Feature extraction :MFCC


The frequently used features for speech processing, also known as the Mel-Frequency Cepstral Coefficients (MFCC) is considered as the standard method for feature extraction in speech recognition tasks. The mel-frequency cepstral coefficients (MFCCs), which are obtained by first performing a standard Fourier analysis, and then converting the powerspectrum to a melfrequency spectrum. By taking the logarithm of that spectrum and by computing its inverse Fourier transform one then obtains the MFCC.

FLOW CHART
INPUT VOICE SIGNAL AMPLIFICATION OVERLAP FRAMING REMOVAL OF D.C. EFFECT

WINDOWING

DFT

LOG

MEL FREQUENCY WARPING

DCT

QUEFRENCY DOMAIN

MEL FREQUENCY CEPSTRAL COEFFICIENTS

NORMALISATION

Reading speech signal


Two ways
Reading a wav file(wavread) or Recording a live speech through microphone using (wavrecord). Sampling frequency=16000 and duration=2 sec.

Preprocessing
Amplification :
The speech signal is amplified to get
Better pitch variations. Good envelop. Energy returned is better.

D.C. offset elimination


Speech signal contain dc part which is not required can be eliminated by

The removal of the DC component is very useful for spectrum visualization.

DC elimination method 1
xf=fft(x); % forward FFT xf(1)=0; % clear the first component of FFT xc=ifft(xf); % inverse y=real(xc); % take only real part

DC elimination method 2
ofs=mean(x); % calculate mean y=x-ofs; % subtract from each sample

Framing and windowing

Discrete fourier transform


Fourier Transform enable a Non-periodic function to be represented as a sum of sinusoids and convert a speech signal from the time domain to the frequency domain. Y(w) = FT [ h(t) * x(t) ] The discrete Fourier transform (DFT) is normally computed via the Fast Fourier Transform (FFT) algorithm, which is a widely used technique for evaluating the frequency spectrum of speech. The spectrum are symmetric, considering only half part of spectrum.

Spectrum

Mel Filter Bank


As Frequency perception of speech signal by human is non-linear; hence Mel mapping scale is used. Warping is a technique to convert Frequency in Hertz to Frequency in MEL scale This scale is warping a measured frequency of a pitch to a corresponding pitch measured on the Mel scale and is described by following equation:
Frequency (Mel Scaled) = [2595log (1+f (Hz)/700] .

Each fourier transformed speech segment is binned by correlating with each triangular filter bank.

Mel Cepstrum Coefficients


(2n 1)k ceps (n; m) k log( fmelk ) cos( ) , n 0,1,2,...N 1 2N k 0 The process of filtering in the cepstral domain is also called liftering. The restriction to the first Q (quefrency) coefficients reflects the liftering process. In liftering lower frequency stationary component and high frequency noise part is removed.
N 1

Post Processing
Normalization: The enhancement done is a normalization, meaning that the feature vectors are normalized over time to get zero mean and unit variance. The mean vector is :

To normalize following operation is applied:

f (n; m) x _ mf cc(n, m) f (n)


Ex. Consider Sampling frequency =16000 Recording duration =3 sec. Time slice =15 msec. Input signal =>

480001

After frame blocking= Frame size =240 Overlap size =120 No. of windows =399

240399

After FFT =>

120399

Mel filter bank=> No. of channels=40

40120

MFCC=>

40399

Considering quefrency domain=>

(3to15) 13399

Speaker coding
Here we used a method which is analogous to cross parity method .which is used for error detection and correction in digital communication system. In our case instead of finding parity bit, root mean square is calculated for each row and column. Root mean square of each column and Root mean square of each row consist of one code book for one speaker. Such N codebooks are created for N speakers.

Feature vectors=>

(1)Centroid Of Each Row

13 1

(2) Centroid of each column

1399

1412

Feature matching
This set is compared with stored database values by mean of Euclidean distance. The Euclidean distance measure is the standard distance measure between two vectors in feature space.

If this distortion falls below threshold value than only access allowed to speaker.

Result
20

15
10 5

Test speaker 1

Test speaker 2
Test speaker 3 Test speaker 4

Test speaker 5
Threshold

0 speaker 1 speaker 2 speaker 3 speaker 4 speaker 5

Analysis
This method is simple. Since no complexities, it takes small time to give results. Threshold value needs to be changed depending on the input device. i.e. it is sensitive to device variation. Its accuracy ranges from 60% to 80% depending upon physical and emotional condition of speaker and surrounding environment. Due to low performance we classified feature extracted from speakers using Back propagation neural network

Neural network
An artificial neural network is an information processing model inspired by the way human biological nervous system process information. Relationship between biological and artificial neurons are,
Biological neuron Cell Dendrites Soma Axon Artificial neuron neuron Synaptic weight Net input Output

Biological Neuron

NEUCLEUS

CELL BODY

AXON

DENDRITES

Artificial Neuron

INPUTS WEIGHTS OUTPUT

NEURON (PROCESSING ELEMENT)

Neural network architecture


There are many classes of Neural Network. Mainly they are classified into three fundamentally different classes , 1) Feed Forward Network a) Single Layer Artificial Neural Network b) Multi Layer Artificial Neural Network 2) Recurrent Network. 3) Competitive Network

Back propagation network


It is multi-layer forward network using extend gradient-descent based delta learning rule, commonly known as back propagation (of errors) rule. Back propagation provides a computationally efficient method for changing the weights in a fed forward network, with differentiable activation function units, to learn a training set of input-output examples.

DIRACTION OF ACTIVATION PROPAGATION

OUTPUT LAYER

DIRECTION OF ERROR PROPAGATION

HIDDEN LAYER

INPUT LAYER

Network of our project

Characteristics of network
Input layer contains 20 neurons. Hidden layer contains 20 neurons. Output layer contains neurons equal to total number of speaker in database. Back propagation weight/bias learning function is Gradient descent with momentum. The network's performance is measures according to the mean of squared errors.

Activation function for hidden layer is hyperbolic tangent sigmoid transfer function. Activation function for output layer is Linear transfer function

MFCC Create Neural Network

Train network

Save network Environment


Test network

no

If regression >threshold

yes

Speaker not recognized

Speaker recognized

Input to network

20 centroids from MFCC matrix are generated and used as input to network.

120

Training

The training algorithm involves four stages. Initialization of weights: Step 1: Initialize weight to small random values. Step 2: While stopping condition is false, do step 3-10. Step 3: For each training pair do step 4-9.

Feed Forward: Step 4: Each input unit receives the input signal Xi (i=1:n) , n =total input neuron. Step 5: Each hidden unit sums its weighted input signals Zinj=Voj+(i=1:n)Xi*Vij Apply activation function Zj(j=1:p)=f(Zinj), p=total neuron of hidden layer. Where Vo and V are bias and weights of hidden unit respectively. Step 6: Each output unit sums its weighted input signals Yink=Wok+(j=1:p)Zj*Wjk Apply activation function Yk(k=1:m)=f(Yink), m=total neuron of output layer=N. Where Wo and W are bias and weights of output unit respectively.

Back propagation of error: Step 7: Each output unit Yk receives target pattern corresponding to an input pattern. Error information is calculated as k=(Tk-Yk)*f(Yink) Step 8: Each hidden unit Zj sums its delta inputs in the layer above inj=(k=1:m) j*Wjk The error information term is calculated as j= k*f(Zinj)

Updation of weight and biases: Step 9: Each output unit Yk updates its bias and weights. The weight correction term is given by Wjk=* k*Zj Wjk(new)=Wjk(old)+ Wjk Wok (new) =Wok (old) + Wok Each hidden unit Zj updates its bias and weights. The weight correction term is given by Vij=* j*Xi Vij (new)= Vijold)+ Vij Voj (new)= Voj (old)+ Voj Step 10: Test the stopping condition. Stopping condition are, Maximum epochs=100 and mean square error 0. The final updated weights are saved and used in testing phase.

Testing
After successful training with minimum decided error the environment of network is saved. For testing The voice sample of speaker is applied to network and generated output is correlated with desired data. Regression = T/Y T=target or desired data Y=output of network

If value of regression is more than threshold value then speaker is said to be recognized.

Results

No. of speakers:5 No. of sample per speaker:1 Testing sample :speaker 1 network was trained two times, result produced at each time are,

First time training performance

Regression

Second time training performance

Regression

Analysis
This method is complex. If the number of speakers increases and hence neuron and calculation increase which produces large delay for training. Since each time training provides different results, hence to get proper result network should get trained well. It requires large amount of memory. Its accuracy range is from 70% to 95% dependent on initial random weights and training of network.

Performance
1-euclidean distance 2-Neural network

Applications
Voice based biometric system are being used in many applications where identification of speaker along with password or command is required such as, Attendance Systems Access Control Systems Biometric Login to telephone aided shopping systems Information Information and Reservation Services Security control for confidential information Forensic purposes Voice command and control Voice dialing in hands-free devices And so on

References
[1] Ben J. Shannon, Kuldip K. Paliwal , A Comparative Study of Filter Bank Spacing for Speech Recognition, MICROELECTRONIC ENGINEERING RESEARCH CONFERENCE 2003. [2] Brian Kan-Wing Mak, Yik-Cheung Tam, and Peter Qi Li, Discriminative Auditory-Based Features for Robust Speech Recognition, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 1, JANUARY 2004. [3] G.Sreenu, P.N.Girija, M.Narendra Prasad and M.Nagamani, A Human Machine Speaker Dependent Speech Interactive System, INDIAN INSTITUTE OFTECHNOLOGY. KHARAGPUR 721302, DECEMBER 20-22, 2004.

[4] Robert W. Ives, Yingzi Du, Delores M. Etterand Thad B. Welch, A Multidisciplinary Approach to Biometrics, IEEE TRANSACTIONS ON EDUCATION, VOL. 48, NO. 3, AUGUST 2005.. [5] R.V Pawar, P.P.Kajave, and S.N.Mali, Speaker Identification using Neural Networks, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2005 [6] .R. AidaZade, C. Ardil and S.S. Rustamov, Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems, World WORLD ACADEMY OF SCIENCE,ENGINEERING AND TECHNOLOGY 2006

[7] Rozeha A. Rashid, Nur Hija Mahalin, Mohd Adib Sarijari, Ahmad Aizuddin Abdul Aziz, Security System Using Biometric Technology: Design and Implementation of Voice Recognition System (VRS), IEEE COMPUTER AND COMMUNICATION PROCEEDINGS MAY 13-15, 2008. [8] M.A.Anusuya &.S.K.Katti, Speech Recognition by Machine: A Review , (IJCSIS) International Journal of Computer Science and Information Security, Vol. 6, No. 3 2009. [9] Meysam Mohamad pour, Fardad Farokhi, An Advanced Method for Speech Recognition, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2009.

[10] Mahdi Shaneh, and Azizollah Taheri, Voice Command Recognition System Based on MFCC and VQ Algorithms, WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY 2009. [11] Wen-Shiung Chen &Jr-Feng Huang, Speaker Recognition using Spectral Dimension Features,INTERNATIONAL MULTI-CONFERENCE ON COMPUTING IN GLOBAL INFORMATION TECHNOLOGY 2009. [12] Dhiadeen SALIH, Arabic Word Speaker Identification using Fuzzy Wavelet Neural Network, DEPARTMENT OF COMPUTER SCIENCE, UNIVERCITY OF KIRKUK,IRAQ 2009.

[13] Wanchun Fei, Liangjun Xu, Xingxing Lu,Speaker Recognition on Non-stationary Characteristics, INTERNATIONAL CONFERENCE ON FUZZY SYSTEM 2010. [14] Wael Al-Sawalmeh, Khaled Daqrouq, Omar Daoud &Abdel-Rahman Al-Qawasmi, Speaker Identification System-based Mel Frequency and Wavelet Transform using Neural Network Classifier, EUROPEAN JOURNAL OF SCINTIFIC RESEARCH VOL. 41 2010. [15] Santi Nuratch1, Panuthat Boonpramuk1, Chai Wutiwiwatchai, Feature Smoothing and Frame Reduction for Speaker Recognition, INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING 2010.

[16] Sunil Kumar Kopparapu and M Laxminarayana, Choice of MEL filter bank in computing MFCC of a resampled speech, ISSPA SIGNAL PROCESSING AND THEIR APPLICATION 2010. [17] Ahmad A. M. Abushariah, Teddy S. Gunawan,,Othman O. Khalifa &Mohammad A. M. Abushariah, English Digits Speech Recognition System Based on Hidden Markov Models, INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNIACTION ENGINEERING MAY 11-13 2010. [18] Cini Kurian, Kannan Balakrishnan, Speech Recognition of Malayalam Numbers. [19] Athulya Jayakurnar, Vimal Krishnan VR & BabuAnto P, Text Dependent Speaker Recognition Using Discrete Stationary Wavelet Transform and PCA.

[20] Promiti Dutta and Alexander Haubold, Audio based classification of speaker characteristics. [21] Xing Fan, John H. L. Hansen, Speaker Identification within Whispered Speech Audio Streams, CENTER FOR ROBUST SPEECH SYSTEM, TEXAS, USA. [22] Tudor Barbu, Comparing Various Voice Recognition Techniques, Institute of computer science, Romania. [23] Jorge Salomon Fuentes, Dr. Chit-Sang Tsang, Speech Recognition using Frequency Transformations, CALIFORNIA STATE UNIVERSITY.

[24] Lawrence Rabiner and Ronald Schafer, Digital Processing of Speech Signals, Prentice Hall publication. [25]S. Sivanandam, S. Smathiand S. Deepa, Introduction to Neural Network, Wiley publication. [26]B. Plannerer, An Introduction to Speech Recognition (http://www.speech-recognition.de/). [27] Dr. Gunamani Jena, Dr. Rameswar Baliarsingh, An Introductory Course on Artificial Neural Network and Fuzzy Logic (http://drgjena.co.cc/). [28] http://www.mathworks.com [29]http://www.youtube.com/watch?v=Xjzm7S__kBU&feature=rel mfu [30]http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/speech%20co urse.html

Вам также может понравиться