Вы находитесь на странице: 1из 16

INDIAN INSTITUTE OF TECHNOLOGY

KHARAGPUR

Report on AV Biometrics

Under The Guidance of Prof. KS Rao

Seminar and Report By:​- ​Shirish Kumar Shukla


Roll no :​- 19CS60R54
Table of Contents 

 
Overview 2 
 
Introduction and Need of AV Biometrics 3 
 
Why is it Required ?  5 
 
How does an AV biometric System Work 6 
 
Extraction of Visual Features 7 
 
How do we Recognize a Speaker ? 8 
 
Various types of Speaker Recognition Systems 8 
 
Working of a Speaker Recognition System 9 
 
Calculation of a Performance of a Speaker Recognition System 10 
 
How will we Fuse Audio and Video ? 11 
 
Pre-Mapping Fusion 11 
 
Midst-Mapping Fusion 12 
 
Post-Mapping Fusion 12 
 
Conclusions and Challenges 14 

 
 
 
 

 
 
 

Audio Visual Biometrics 


 

Overview 

Biometrics are the most widely used security systems nowadays. 

In this seminar we went through special types of Biometrics that are Audio Visual 
Biometrics. 

This Report contains various concepts that were discussed in Seminar along with 
Explanation of key ideas of the project. 

We will go through introduction and need of AV biometrics , then moving onto their 
potential in current security systems, then we will look at the description on how Audio 
visual biometrics work , we will see how various information (i.e Audio and Visual data) is 
extracted , also we will have a look at the various fusion methods that aim to combine both 
audio and visual data for final Processing. 

 

 
 

In end we will look at various challenges we are facing currently while developing an Anti 
spoofing Audio and visual data Biometric. 

Introduction and Need of AV Biometrics 


 

As we all know that security is very important nowadays for almost every organisation and 
it's a crucial debate going on in every organisation to choose the best secure Security 
measure for their organization. 

Every Security System has 2 tasks :- 

Identification​ of the Person from a set of persons who can demand access to that 
particular Security System 

Verification​ of the person by taking Some information and classifying the person into an 
imposter or an authentic user. 

Information used for verification can be ​knowledge based information​ such as Token 
,OTP ,Passwords ,PIN etc. 

Problem with Such a knowledge based system is that verification information can be stolen 
and duplicated hence is not advisable for a stronger security system. 

Another type of information that is widely used for verifying individuals are​ Biometrics. 

In such a system various sensors are used to get a better and less error Prone Design. 

 

 
 

There are thousands of Sensors in the market available list of which were shown in 
Seminar. Similar image is shown here.

As we can see there are quite a lot of sensors to choose from and deciding on a particular 
one depends on the various characteristics a particular company is interested in. 

Most widely used sensors are Speech and Face Recognition Systems. 

BUT as we saw in Seminar both of them alone can not get the job done because Speech 
recognition systems are sensitive to microphone types , acoustic environment, channel 
noise and complexity of Scenario. 

While Visual data is affected by extreme lighting changes, shadowing , changing 


background etc. 

Even both of the systems are susceptible to imposter attacks as one can have both the 
photograph and audio recording of the person. 

To solve this Problem , Researchers have come up with an advanced concept of AV 
Biometrics. 

It Requires us to obtain Both visual and Sensory data using our sensors and combine both 
data in order to verify the authenticity of the person. 

A schematic Diagram For a audio Visual Biometric is as follows :- 

 

 
 

Why is it Required? 

This is an important question. One reason for Such has already been discussed is inability 
of speech and Face sensors to be immune to imposter attacks along with their inability to 
adjust in noisy environments. 

Having an AV biometric system Utilizes both the sensors hence giving us an added layer of 
security hence negating the possibility of an imposter. 

Also having a noisy environment for both the sensors altogether is a realistically impractical 
scenario. 

Also we know that Although great progress has been achieved over the past decades in 
computer processing of speech, it still lags significantly compared to human performance 
levels especially in noisy environments. On the other hand,humans easily accomplish 
complex communication tasks by utilizing additional sources of information whenever 
required, especially visual information 

Many researchers have Experimented that articulatory and vocal tract data is somewhat 
related. 

 

 
 

For example, Yehiaet al.[55] investigated the degrees of this correlation. They measured the 
motion of markers,which were placed on the face and in the vocal tract. Their results show 
that 91% of the total variance observed in the facial motion could be determined from the 
vocal tract motion, using simple linear estimators.  

This shows that AV biometrics can be a better person recognition model. 

How does an AV Biometric System Work ? 


 
Preprocessing and feature extraction are performed in parallel for the two modalities.  
 
The preprocessing of the audio signal under noisy conditions includes signal 
enhancement,tracking environmental and channel noise, feature estimation, and 
smoothing. 
 
The preprocessing of the video signal typically consists of the challenging problems of 
detecting and tracking of the face and the important facial features. 
 
Fusion requires Synchronization of Audio and Video Streams of sensor data(using 
Interpolation) . 
  
Diagrammatically :- 

 

 
 

Extraction Of Visual Features :- 

Visual Characteristics can be either S


​ tatic ​(like an image of a person) used for face 
recognition or it can be ​Dynamic​ which contains additional information such as changes 
in mouth region (also known to be visual Speech). 

While extracting Visual features we need to look for three types of Visual Features. 

 
i) ​Appearance-based features​, such as transformed vectors of the face or mouth region 
pixel intensities using, for example, image compression techniques. 
 
ii) ​Shape-based features​, such as geometric or model-based representations of the face or 
lip contours 
 
iii) Features that are a combination of both appearance and shape features in 1) and 2) 

In the Seminar we saw how we extract the Following Features ,Let us go over Briefly on 
them again. 

To Extract Appearance Based Features w


​ e first detect faces using traditional face 
detection Algorithms such as edge detection or thresholding , color segmentation etc. Then 
we use hierarchical based techniques to detect special features such as mouth Corners , 
eyes , nostrils, chin etc. these features are used to extract and Normalized Region of 
interest (ROI) in visual speech information. 

To Extract Shape Based Features​ we focus more on lips and contour region , we extract 
geometrical features, parametric model features and statistical model features of the 
region. 

 
An example of model-based visual features is represented by the facial animation 
parameters (FAPs) of the outer- and inner-lip contours. FAPs describe facial movement and 
are used in the MPEG-4 AV object-based video representation standard to control facial 
animation, together with the so-called facial definition parameters (FDPs) that describe the 
shape of the face. 

 

 
 

Following Procedure was explained in Seminar which is shown diagrammatically here 

How do we Recognize a Speaker ? 

Speaker Recognition Goes through two Phases :- 

 
i) ​Training Phase :- ​In the training phase, the speaker is asked to utter certain phrases in 
order to acquire the data to be used for training. In general, the larger the amount of 
training data, the better the performance of a speaker recognition system. 
 
ii) ​Testing Phase :-​ In the testing phase, the speaker utters a certain phrase, and the 
system accepts or rejects his/her claim (speaker authentication) or makes a decision on the 
speaker’s identity (speaker identification). The testing phase can be followed by the 
adaptation phase during which the recognized speaker’s data is used to update the models 
corresponding to 
him/her. 
 
 
Various types of Speaker Recognition Systems :- 
 
As discussed in Seminar there are various Speaker recognition systems such as Text 
Dependent and Text independent Systems 

Text​-D
​ ependent Speaker Recognition​. ​Text​-d ​ ependent speaker recognition 
characterizes a speaker recognition task, such as verification or identification, in which the 
set of words (or lexicon) used during the testing phase is a subset of the ones present 

 

 
 

during the enrollment phase 


 
Text Dependent Systems can use Either Fixed Phrases (called ​Fixed phrases Text 
Dependent Systems​) or can prompt a phrase to be said by the user and are called 
Prompted Phase Text Dependent Systems.  

Text independent systems use words aparts from words used in the Testing phase making 
them more Secure than the Text Dependent Systems. 

Working Of a Speaker Recognition System 


The speaker recognition problem is a classification problem. A set of classes needs to be 
defined first, and then based on the observations one of these classes is chosen. 

 
 
 
Let C denote the set of all classes. In speaker identification systems, C typically consists of 
the enrolled subject population possibly augmented by a class denoting the unknown 
subject. 
 
We calculate Posterior Probability of the input stream belonging to a class c belonging to 
set of Classes C. 
 
  
On the other hand, in speaker authentication systems, C reduces to a two-member set, 
consisting of the class corresponding to the user and the general population (impostor 

 

 
 

class). 
 
Where we maximize the Posterior Probability of the input stream belonging to a class c 
belonging to set of Classes C. 
 
Calculation of Performance of an AV biometric System 
 
As discussed in Seminar Two Commonly Used error measures for verification performance 
are :- 
 
i) F
​ AR (False Acceptance Rate) :-​ This error rate denotes the capability of our biometric 
system to detect Imposters. 
 
Denoted by :- 

 
 
where Ia = Imposter Claims Accepted and I = Total number of imposter Claims 
 
 
ii) ​FRR (False Rejection Rate) :-​ This error Rate Denotes the possibility of our biometric 
system to reject a Valid customer claim. 
 
Denoted by :- 

 
where Cr = Total number of valid Customer claims Rejected and C = Number of client 
claims 
 
 
We can see that for high Security applications such as the military our main aim is to 
Minimize False acceptance rate while for low security Applications such as offices we need 
to reduce FRR. 
 
 

 
10 
 
 

How will we FUSE audio and Video ? 

Various Techniques were Discussed in Seminar which i will talk briefly about in the Report 

Pre Mapping Fusion (Early Integration) :- 

Pre Mapping fusion can be divided into sensor data level and feature level fusion. In ​sensor 
data level fusion​ , the sensor data obtained from different sensors of the same modality is 
combined. Weighted summation and mosaic construction are typically utilized in order to 
enable sensor data level fusion. In the weighted summation approach the data is first 
normalized, usually by mapping them to a common interval, and then combined utilizing 
weights. 

For example, weighted summation can be utilized to combine multiple visible and infrared 
images or to combine acoustic data obtained from several microphones of dif-ferent types 

 
11 
 
 

and quality. Mosaic construction can be utilized to create one image from images of parts 
of the face obtained by several different cameras. 

Feature level fusion​ represents the combination of the features obtained from different 
sensors. Joint feature vectors are obtained either by weighted summation (after 
normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector 
with normalization in order to obtain a joint feature vector). The features obtained by the 
concatenation approach are usually of high dimensionality, which can affect reliable 
training of a classification system (Curse of dimensionality[) and ultimately recognition 
performance. In addition, concatenation does not allow for the modeling of the reliability of 
individual feature streams. For example, in the case of audio and visual streams, it cannot 
take advantage of the information that might be available, about the acoustic or the visual 
noise in the environment. Furthermore, audio and visual feature streams should be 
synchronized before the concatenation is performed. 

Midst-Mapping Fusion (Intermediate Integration) :- 

With midst-mapping fusion, information streams are processed during the procedure of 
mapping the feature space into the opinion or decision space. 

It exploits the temporal dynamics contained in different streams, thus avoiding problems 
resulting from vector concatenation, such as the Curse of dimensionality[ and the 
requirement of matching rates. 

Furthermore, stream weights can be utilized in midst-mapping fusion to account for the 
reliability of different streams of information. 

Postmapping Fusion (Late Integration) :- 

Postmapping fusion approaches are grouped into decision and opinion fusion (also 
referred to as score-level fusion). With d
​ ecision fusion ​classifier decisions are combined in 

 
12 
 
 

order to obtain the final decision by utilizing majority voting, or combination of ranked lists, 
or and and or operators. In majority voting the final decision is made when the majority of 
the classifiers reaches the same decision. 

The number of classifiers should be chosen carefully in order to prevent ties (e.g., for a two 
class problem, such as speaker verification, the number of classifiers should be odd). 

In ranked list combination fusion ranked lists provided by each classifier are combined in 
order to obtain the final ranked list. There exist various approaches for combination of 
ranked lists . 

In A
​ ND fusion​ the final decision is made only if all classifiers reach the same decision. This 
type of fusion is typically used for high-security applications where we want to achieve very 
low FA rates, by allowing higher FR rates. 

On the other hand, when the ​OR fusion​ method is utilized, the final decision is made as 
soon as one of the classifiers reaches a decision. This type of fusion is utilized for low- 
security applications where we want to achieve lower FR rates and prevent causing 
inconvenience to the registered users by allowing higher FA rates. 

Unlike decision fusion, in O


​ pinion (score-level) fusion methods​ the experts do not 
provide final decisions but only opinions (scores) on each possible decision.  

The opinions are usually first normalized by mapping them to a common interval and then 
combined utilizing weights (e.g., either weighted summation or weighted product fusion). 
The Weights are determined based on the discriminating ability of the classifier and the 
quality of the utilized features (usually affected by the feature extraction method and/or 
presence of different types of noise). 

For example, when audio and visual information are employed, the acoustic SNR and/or 
the quality of the visual feature extraction algorithms are considered in determining 
weights. After the opinions are combined, the class that corresponds to the highest opinion 
is chosen. 

 
13 
 
 

Conclusions And Challenges 


In contrast to the abundance of audio-only databases, there exist only a few databases 
suitable for AV biometric research. This is because the field is relatively young, but also due 
to the fact that AV corpora pose additional challenges concerning database collection, 
storage, distribution, and privacy. 
 
Most commonly used databases in the literature were collected by few university groups or 
individual researchers with limited resources, and as a result, they usually contain a small 
number of subjects and 
have relatively short duration. 
 
AV databases usually vary greatly in the number of speakers, vocabulary size, number of 
sessions, non ideal acoustic and visual conditions, and evaluation measures. This makes 
the comparison of 
different visual features and fusion methods, with respect to the overall performance of an 
AV biometric system, difficult.  
 
There is, therefore, a great need for new, standardized databases and evaluation measures 
that would enable fair comparison of different systems and represent realistic nonideal 
conditions. Experiment protocols should also be defined in a way that avoids biased results 
and allows for fair comparison of different person recognition systems​. 
 
 
 
 
 
------------------------------- END of REPORT ------------------------------------------ 

 
14 
 
 

 
15 

Вам также может понравиться