Indian Institute of Technology Kharagpur: Report On AV Biometrics

INDIAN INSTITUTE OF TECHNOLOGY
KHARAGPUR
Report on AV Biometrics
Under The Guidance of Prof. KS Rao
Seminar and Report By:- Shirish Kumar Shukla

Roll no :- 19CS60R54
Table of Contents

Overview 2

Introduction and Need of AV Biometrics 3

Why is it Required ? 5

How does an AV biometric System Work 6

Extraction of Visual Features 7

How do we Recognize a Speaker ? 8

Various types of Speaker Recognition Systems 8

Working of a Speaker Recognition System 9

Calculation of a Performance of a Speaker Recognition System 10

How will we Fuse Audio and Video ? 11

Pre-Mapping Fusion 11

Midst-Mapping Fusion 12

Post-Mapping Fusion 12

Conclusions and Challenges 14

Audio Visual Biometrics

Overview
Biometrics are the most widely used security systems nowadays.
In this seminar we went through special types of Biometrics that are Audio Visual
Biometrics.
This Report contains various concepts that were discussed in Seminar along with
Explanation of key ideas of the project.
We will go through introduction and need of AV biometrics , then moving onto their
potential in current security systems, then we will look at the description on how Audio
visual biometrics work , we will see how various information (i.e Audio and Visual data) is
extracted , also we will have a look at the various fusion methods that aim to combine both
audio and visual data for final Processing.

2

In end we will look at various challenges we are facing currently while developing an Anti
spoofing Audio and visual data Biometric.
Introduction and Need of AV Biometrics

As we all know that security is very important nowadays for almost every organisation and
it's a crucial debate going on in every organisation to choose the best secure Security
measure for their organization.
Every Security System has 2 tasks :-
Identification of the Person from a set of persons who can demand access to that
particular Security System
Verification of the person by taking Some information and classifying the person into an
imposter or an authentic user.
Information used for verification can be knowledge based information such as Token
,OTP ,Passwords ,PIN etc.
Problem with Such a knowledge based system is that verification information can be stolen
and duplicated hence is not advisable for a stronger security system.
Another type of information that is widely used for verifying individuals are Biometrics.
In such a system various sensors are used to get a better and less error Prone Design.

3

There are thousands of Sensors in the market available list of which were shown in
Seminar. Similar image is shown here.
As we can see there are quite a lot of sensors to choose from and deciding on a particular
one depends on the various characteristics a particular company is interested in.
Most widely used sensors are Speech and Face Recognition Systems.
BUT as we saw in Seminar both of them alone can not get the job done because Speech
recognition systems are sensitive to microphone types , acoustic environment, channel
noise and complexity of Scenario.
While Visual data is affected by extreme lighting changes, shadowing , changing

background etc.
Even both of the systems are susceptible to imposter attacks as one can have both the
photograph and audio recording of the person.
To solve this Problem , Researchers have come up with an advanced concept of AV
Biometrics.
It Requires us to obtain Both visual and Sensory data using our sensors and combine both
data in order to verify the authenticity of the person.
A schematic Diagram For a audio Visual Biometric is as follows :-

4

Why is it Required?
This is an important question. One reason for Such has already been discussed is inability
of speech and Face sensors to be immune to imposter attacks along with their inability to
adjust in noisy environments.
Having an AV biometric system Utilizes both the sensors hence giving us an added layer of
security hence negating the possibility of an imposter.
Also having a noisy environment for both the sensors altogether is a realistically impractical
scenario.
Also we know that Although great progress has been achieved over the past decades in
computer processing of speech, it still lags significantly compared to human performance
levels especially in noisy environments. On the other hand,humans easily accomplish
complex communication tasks by utilizing additional sources of information whenever
required, especially visual information
Many researchers have Experimented that articulatory and vocal tract data is somewhat
related.

5

For example, Yehiaet al.[55] investigated the degrees of this correlation. They measured the
motion of markers,which were placed on the face and in the vocal tract. Their results show
that 91% of the total variance observed in the facial motion could be determined from the
vocal tract motion, using simple linear estimators.
This shows that AV biometrics can be a better person recognition model.
How does an AV Biometric System Work ?

Preprocessing and feature extraction are performed in parallel for the two modalities.

The preprocessing of the audio signal under noisy conditions includes signal
enhancement,tracking environmental and channel noise, feature estimation, and
smoothing.

The preprocessing of the video signal typically consists of the challenging problems of
detecting and tracking of the face and the important facial features.

Fusion requires Synchronization of Audio and Video Streams of sensor data(using
Interpolation) .

Diagrammatically :-

6

Extraction Of Visual Features :-
Visual Characteristics can be either S

tatic (like an image of a person) used for face
recognition or it can be Dynamic which contains additional information such as changes
in mouth region (also known to be visual Speech).
While extracting Visual features we need to look for three types of Visual Features.

i) Appearance-based features, such as transformed vectors of the face or mouth region
pixel intensities using, for example, image compression techniques.

ii) Shape-based features, such as geometric or model-based representations of the face or
lip contours

iii) Features that are a combination of both appearance and shape features in 1) and 2)
In the Seminar we saw how we extract the Following Features ,Let us go over Briefly on
them again.
To Extract Appearance Based Features w

e first detect faces using traditional face
detection Algorithms such as edge detection or thresholding , color segmentation etc. Then
we use hierarchical based techniques to detect special features such as mouth Corners ,
eyes , nostrils, chin etc. these features are used to extract and Normalized Region of
interest (ROI) in visual speech information.
To Extract Shape Based Features we focus more on lips and contour region , we extract
geometrical features, parametric model features and statistical model features of the
region.

An example of model-based visual features is represented by the facial animation
parameters (FAPs) of the outer- and inner-lip contours. FAPs describe facial movement and
are used in the MPEG-4 AV object-based video representation standard to control facial
animation, together with the so-called facial definition parameters (FDPs) that describe the
shape of the face.

7

Following Procedure was explained in Seminar which is shown diagrammatically here
How do we Recognize a Speaker ?
Speaker Recognition Goes through two Phases :-

i) Training Phase :- In the training phase, the speaker is asked to utter certain phrases in
order to acquire the data to be used for training. In general, the larger the amount of
training data, the better the performance of a speaker recognition system.

ii) Testing Phase :- In the testing phase, the speaker utters a certain phrase, and the
system accepts or rejects his/her claim (speaker authentication) or makes a decision on the
speaker’s identity (speaker identification). The testing phase can be followed by the
adaptation phase during which the recognized speaker’s data is used to update the models
corresponding to
him/her.

Various types of Speaker Recognition Systems :-

As discussed in Seminar there are various Speaker recognition systems such as Text
Dependent and Text independent Systems
Text-D
ependent Speaker Recognition. Text-d ependent speaker recognition
characterizes a speaker recognition task, such as verification or identification, in which the
set of words (or lexicon) used during the testing phase is a subset of the ones present

8

during the enrollment phase

Text Dependent Systems can use Either Fixed Phrases (called Fixed phrases Text
Dependent Systems) or can prompt a phrase to be said by the user and are called
Prompted Phase Text Dependent Systems.
Text independent systems use words aparts from words used in the Testing phase making
them more Secure than the Text Dependent Systems.
Working Of a Speaker Recognition System

The speaker recognition problem is a classification problem. A set of classes needs to be
defined first, and then based on the observations one of these classes is chosen.

Let C denote the set of all classes. In speaker identification systems, C typically consists of
the enrolled subject population possibly augmented by a class denoting the unknown
subject.

We calculate Posterior Probability of the input stream belonging to a class c belonging to
set of Classes C.

On the other hand, in speaker authentication systems, C reduces to a two-member set,
consisting of the class corresponding to the user and the general population (impostor

9

class).

Where we maximize the Posterior Probability of the input stream belonging to a class c
belonging to set of Classes C.

Calculation of Performance of an AV biometric System

As discussed in Seminar Two Commonly Used error measures for verification performance
are :-

i) F
AR (False Acceptance Rate) :- This error rate denotes the capability of our biometric
system to detect Imposters.

Denoted by :-

where Ia = Imposter Claims Accepted and I = Total number of imposter Claims

ii) FRR (False Rejection Rate) :- This error Rate Denotes the possibility of our biometric
system to reject a Valid customer claim.

Denoted by :-

where Cr = Total number of valid Customer claims Rejected and C = Number of client
claims

We can see that for high Security applications such as the military our main aim is to
Minimize False acceptance rate while for low security Applications such as offices we need
to reduce FRR.

10

How will we FUSE audio and Video ?
Various Techniques were Discussed in Seminar which i will talk briefly about in the Report
Pre Mapping Fusion (Early Integration) :-
Pre Mapping fusion can be divided into sensor data level and feature level fusion. In sensor
data level fusion , the sensor data obtained from different sensors of the same modality is
combined. Weighted summation and mosaic construction are typically utilized in order to
enable sensor data level fusion. In the weighted summation approach the data is first
normalized, usually by mapping them to a common interval, and then combined utilizing
weights.
For example, weighted summation can be utilized to combine multiple visible and infrared
images or to combine acoustic data obtained from several microphones of dif-ferent types

11

and quality. Mosaic construction can be utilized to create one image from images of parts
of the face obtained by several different cameras.
Feature level fusion represents the combination of the features obtained from different
sensors. Joint feature vectors are obtained either by weighted summation (after
normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector
with normalization in order to obtain a joint feature vector). The features obtained by the
concatenation approach are usually of high dimensionality, which can affect reliable
training of a classification system (Curse of dimensionality[) and ultimately recognition
performance. In addition, concatenation does not allow for the modeling of the reliability of
individual feature streams. For example, in the case of audio and visual streams, it cannot
take advantage of the information that might be available, about the acoustic or the visual
noise in the environment. Furthermore, audio and visual feature streams should be
synchronized before the concatenation is performed.
Midst-Mapping Fusion (Intermediate Integration) :-
With midst-mapping fusion, information streams are processed during the procedure of
mapping the feature space into the opinion or decision space.
It exploits the temporal dynamics contained in different streams, thus avoiding problems
resulting from vector concatenation, such as the Curse of dimensionality[ and the
requirement of matching rates.
Furthermore, stream weights can be utilized in midst-mapping fusion to account for the
reliability of different streams of information.
Postmapping Fusion (Late Integration) :-
Postmapping fusion approaches are grouped into decision and opinion fusion (also
referred to as score-level fusion). With d
ecision fusion classifier decisions are combined in

12

order to obtain the final decision by utilizing majority voting, or combination of ranked lists,
or and and or operators. In majority voting the final decision is made when the majority of
the classifiers reaches the same decision.
The number of classifiers should be chosen carefully in order to prevent ties (e.g., for a two
class problem, such as speaker verification, the number of classifiers should be odd).
In ranked list combination fusion ranked lists provided by each classifier are combined in
order to obtain the final ranked list. There exist various approaches for combination of
ranked lists .
In A
ND fusion the final decision is made only if all classifiers reach the same decision. This
type of fusion is typically used for high-security applications where we want to achieve very
low FA rates, by allowing higher FR rates.
On the other hand, when the OR fusion method is utilized, the final decision is made as
soon as one of the classifiers reaches a decision. This type of fusion is utilized for low-
security applications where we want to achieve lower FR rates and prevent causing
inconvenience to the registered users by allowing higher FA rates.
Unlike decision fusion, in O

pinion (score-level) fusion methods the experts do not
provide final decisions but only opinions (scores) on each possible decision.
The opinions are usually first normalized by mapping them to a common interval and then
combined utilizing weights (e.g., either weighted summation or weighted product fusion).
The Weights are determined based on the discriminating ability of the classifier and the
quality of the utilized features (usually affected by the feature extraction method and/or
presence of different types of noise).
For example, when audio and visual information are employed, the acoustic SNR and/or
the quality of the visual feature extraction algorithms are considered in determining
weights. After the opinions are combined, the class that corresponds to the highest opinion
is chosen.

13

Conclusions And Challenges

In contrast to the abundance of audio-only databases, there exist only a few databases
suitable for AV biometric research. This is because the field is relatively young, but also due
to the fact that AV corpora pose additional challenges concerning database collection,
storage, distribution, and privacy.

Most commonly used databases in the literature were collected by few university groups or
individual researchers with limited resources, and as a result, they usually contain a small
number of subjects and
have relatively short duration.

AV databases usually vary greatly in the number of speakers, vocabulary size, number of
sessions, non ideal acoustic and visual conditions, and evaluation measures. This makes
the comparison of
different visual features and fusion methods, with respect to the overall performance of an
AV biometric system, difficult.

There is, therefore, a great need for new, standardized databases and evaluation measures
that would enable fair comparison of different systems and represent realistic nonideal
conditions. Experiment protocols should also be defined in a way that avoids biased results
and allows for fair comparison of different person recognition systems.

------------------------------- END of REPORT ------------------------------------------

14

15

Indian Institute of Technology Kharagpur: Report On AV Biometrics

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Indian Institute of Technology Kharagpur: Report On AV Biometrics

Загружено:

Авторское право:

Доступные форматы

INDIAN INSTITUTE OF TECHNOLOGY

Under The Guidance of Prof. KS Rao

Seminar and Report By:​- ​Shirish Kumar Shukla

Audio Visual Biometrics

Biometrics are the most widely used security systems nowadays.

Introduction and Need of AV Biometrics

Every Security System has 2 tasks :-

While Visual data is affected by extreme lighting changes, shadowing , changing

A schematic Diagram For a audio Visual Biometric is as follows :-

This shows that AV biometrics can be a better person recognition model.

How does an AV Biometric System Work ?

Extraction Of Visual Features :-

Visual Characteristics can be either S

To Extract Appearance Based Features w

Following Procedure was explained in Seminar which is shown diagrammatically here

How do we Recognize a Speaker ?

Speaker Recognition Goes through two Phases :-

during the enrollment phase

Working Of a Speaker Recognition System

How will we FUSE audio and Video ?

Pre Mapping Fusion (Early Integration) :-

Midst-Mapping Fusion (Intermediate Integration) :-

Postmapping Fusion (Late Integration) :-

Unlike decision fusion, in O

Conclusions And Challenges

Вам также может понравиться

Seminar and Report By:- Shirish Kumar Shukla