Академический Документы
Профессиональный Документы
Культура Документы
KHARAGPUR
Report on AV Biometrics
Overview 2
Introduction and Need of AV Biometrics 3
Why is it Required ? 5
How does an AV biometric System Work 6
Extraction of Visual Features 7
How do we Recognize a Speaker ? 8
Various types of Speaker Recognition Systems 8
Working of a Speaker Recognition System 9
Calculation of a Performance of a Speaker Recognition System 10
How will we Fuse Audio and Video ? 11
Pre-Mapping Fusion 11
Midst-Mapping Fusion 12
Post-Mapping Fusion 12
Conclusions and Challenges 14
Overview
In this seminar we went through special types of Biometrics that are Audio Visual
Biometrics.
This Report contains various concepts that were discussed in Seminar along with
Explanation of key ideas of the project.
We will go through introduction and need of AV biometrics , then moving onto their
potential in current security systems, then we will look at the description on how Audio
visual biometrics work , we will see how various information (i.e Audio and Visual data) is
extracted , also we will have a look at the various fusion methods that aim to combine both
audio and visual data for final Processing.
2
In end we will look at various challenges we are facing currently while developing an Anti
spoofing Audio and visual data Biometric.
As we all know that security is very important nowadays for almost every organisation and
it's a crucial debate going on in every organisation to choose the best secure Security
measure for their organization.
Identification of the Person from a set of persons who can demand access to that
particular Security System
Verification of the person by taking Some information and classifying the person into an
imposter or an authentic user.
Information used for verification can be knowledge based information such as Token
,OTP ,Passwords ,PIN etc.
Problem with Such a knowledge based system is that verification information can be stolen
and duplicated hence is not advisable for a stronger security system.
Another type of information that is widely used for verifying individuals are Biometrics.
In such a system various sensors are used to get a better and less error Prone Design.
3
There are thousands of Sensors in the market available list of which were shown in
Seminar. Similar image is shown here.
As we can see there are quite a lot of sensors to choose from and deciding on a particular
one depends on the various characteristics a particular company is interested in.
Most widely used sensors are Speech and Face Recognition Systems.
BUT as we saw in Seminar both of them alone can not get the job done because Speech
recognition systems are sensitive to microphone types , acoustic environment, channel
noise and complexity of Scenario.
Even both of the systems are susceptible to imposter attacks as one can have both the
photograph and audio recording of the person.
To solve this Problem , Researchers have come up with an advanced concept of AV
Biometrics.
It Requires us to obtain Both visual and Sensory data using our sensors and combine both
data in order to verify the authenticity of the person.
4
Why is it Required?
This is an important question. One reason for Such has already been discussed is inability
of speech and Face sensors to be immune to imposter attacks along with their inability to
adjust in noisy environments.
Having an AV biometric system Utilizes both the sensors hence giving us an added layer of
security hence negating the possibility of an imposter.
Also having a noisy environment for both the sensors altogether is a realistically impractical
scenario.
Also we know that Although great progress has been achieved over the past decades in
computer processing of speech, it still lags significantly compared to human performance
levels especially in noisy environments. On the other hand,humans easily accomplish
complex communication tasks by utilizing additional sources of information whenever
required, especially visual information
Many researchers have Experimented that articulatory and vocal tract data is somewhat
related.
5
For example, Yehiaet al.[55] investigated the degrees of this correlation. They measured the
motion of markers,which were placed on the face and in the vocal tract. Their results show
that 91% of the total variance observed in the facial motion could be determined from the
vocal tract motion, using simple linear estimators.
6
While extracting Visual features we need to look for three types of Visual Features.
i) Appearance-based features, such as transformed vectors of the face or mouth region
pixel intensities using, for example, image compression techniques.
ii) Shape-based features, such as geometric or model-based representations of the face or
lip contours
iii) Features that are a combination of both appearance and shape features in 1) and 2)
In the Seminar we saw how we extract the Following Features ,Let us go over Briefly on
them again.
To Extract Shape Based Features we focus more on lips and contour region , we extract
geometrical features, parametric model features and statistical model features of the
region.
An example of model-based visual features is represented by the facial animation
parameters (FAPs) of the outer- and inner-lip contours. FAPs describe facial movement and
are used in the MPEG-4 AV object-based video representation standard to control facial
animation, together with the so-called facial definition parameters (FDPs) that describe the
shape of the face.
7
i) Training Phase :- In the training phase, the speaker is asked to utter certain phrases in
order to acquire the data to be used for training. In general, the larger the amount of
training data, the better the performance of a speaker recognition system.
ii) Testing Phase :- In the testing phase, the speaker utters a certain phrase, and the
system accepts or rejects his/her claim (speaker authentication) or makes a decision on the
speaker’s identity (speaker identification). The testing phase can be followed by the
adaptation phase during which the recognized speaker’s data is used to update the models
corresponding to
him/her.
Various types of Speaker Recognition Systems :-
As discussed in Seminar there are various Speaker recognition systems such as Text
Dependent and Text independent Systems
Text-D
ependent Speaker Recognition. Text-d ependent speaker recognition
characterizes a speaker recognition task, such as verification or identification, in which the
set of words (or lexicon) used during the testing phase is a subset of the ones present
8
Text independent systems use words aparts from words used in the Testing phase making
them more Secure than the Text Dependent Systems.
Let C denote the set of all classes. In speaker identification systems, C typically consists of
the enrolled subject population possibly augmented by a class denoting the unknown
subject.
We calculate Posterior Probability of the input stream belonging to a class c belonging to
set of Classes C.
On the other hand, in speaker authentication systems, C reduces to a two-member set,
consisting of the class corresponding to the user and the general population (impostor
9
class).
Where we maximize the Posterior Probability of the input stream belonging to a class c
belonging to set of Classes C.
Calculation of Performance of an AV biometric System
As discussed in Seminar Two Commonly Used error measures for verification performance
are :-
i) F
AR (False Acceptance Rate) :- This error rate denotes the capability of our biometric
system to detect Imposters.
Denoted by :-
where Ia = Imposter Claims Accepted and I = Total number of imposter Claims
ii) FRR (False Rejection Rate) :- This error Rate Denotes the possibility of our biometric
system to reject a Valid customer claim.
Denoted by :-
where Cr = Total number of valid Customer claims Rejected and C = Number of client
claims
We can see that for high Security applications such as the military our main aim is to
Minimize False acceptance rate while for low security Applications such as offices we need
to reduce FRR.
10
Various Techniques were Discussed in Seminar which i will talk briefly about in the Report
Pre Mapping fusion can be divided into sensor data level and feature level fusion. In sensor
data level fusion , the sensor data obtained from different sensors of the same modality is
combined. Weighted summation and mosaic construction are typically utilized in order to
enable sensor data level fusion. In the weighted summation approach the data is first
normalized, usually by mapping them to a common interval, and then combined utilizing
weights.
For example, weighted summation can be utilized to combine multiple visible and infrared
images or to combine acoustic data obtained from several microphones of dif-ferent types
11
and quality. Mosaic construction can be utilized to create one image from images of parts
of the face obtained by several different cameras.
Feature level fusion represents the combination of the features obtained from different
sensors. Joint feature vectors are obtained either by weighted summation (after
normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector
with normalization in order to obtain a joint feature vector). The features obtained by the
concatenation approach are usually of high dimensionality, which can affect reliable
training of a classification system (Curse of dimensionality[) and ultimately recognition
performance. In addition, concatenation does not allow for the modeling of the reliability of
individual feature streams. For example, in the case of audio and visual streams, it cannot
take advantage of the information that might be available, about the acoustic or the visual
noise in the environment. Furthermore, audio and visual feature streams should be
synchronized before the concatenation is performed.
With midst-mapping fusion, information streams are processed during the procedure of
mapping the feature space into the opinion or decision space.
It exploits the temporal dynamics contained in different streams, thus avoiding problems
resulting from vector concatenation, such as the Curse of dimensionality[ and the
requirement of matching rates.
Furthermore, stream weights can be utilized in midst-mapping fusion to account for the
reliability of different streams of information.
Postmapping fusion approaches are grouped into decision and opinion fusion (also
referred to as score-level fusion). With d
ecision fusion classifier decisions are combined in
12
order to obtain the final decision by utilizing majority voting, or combination of ranked lists,
or and and or operators. In majority voting the final decision is made when the majority of
the classifiers reaches the same decision.
The number of classifiers should be chosen carefully in order to prevent ties (e.g., for a two
class problem, such as speaker verification, the number of classifiers should be odd).
In ranked list combination fusion ranked lists provided by each classifier are combined in
order to obtain the final ranked list. There exist various approaches for combination of
ranked lists .
In A
ND fusion the final decision is made only if all classifiers reach the same decision. This
type of fusion is typically used for high-security applications where we want to achieve very
low FA rates, by allowing higher FR rates.
On the other hand, when the OR fusion method is utilized, the final decision is made as
soon as one of the classifiers reaches a decision. This type of fusion is utilized for low-
security applications where we want to achieve lower FR rates and prevent causing
inconvenience to the registered users by allowing higher FA rates.
The opinions are usually first normalized by mapping them to a common interval and then
combined utilizing weights (e.g., either weighted summation or weighted product fusion).
The Weights are determined based on the discriminating ability of the classifier and the
quality of the utilized features (usually affected by the feature extraction method and/or
presence of different types of noise).
For example, when audio and visual information are employed, the acoustic SNR and/or
the quality of the visual feature extraction algorithms are considered in determining
weights. After the opinions are combined, the class that corresponds to the highest opinion
is chosen.
13
14
15