Вы находитесь на странице: 1из 150

FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

A Dissertation

Submitted to the Graduate School


of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of

Doctor of Philosophy

by

Deborah Thomas,

Kevin W. Bowyer, Co-Director

Patrick J. Flynn, Co-Director

Graduate Program in Computer Science and Engineering


Notre Dame, Indiana
July 2010

c Copyright by

Deborah Thomas
2010
All Rights Reserved

FACE RECOGNITION FROM SURVEILLANCE-QUALITY VIDEO

Abstract
by
Deborah Thomas
In this dissertation, we develop techniques for face recognition from surveillancequality video. We handle two specific problems that are characteristic of such
video, namely uncontrolled face pose changes and poor illumination. We conduct
a study that compares face recognition performance using two different types of
probe data and acquiring data in two different conditions. We describe approaches
to evaluate the face detections found in the video sequence to reduce the probe
images to those that contain true detections. We also augment the gallery set using synthetic poses generated using 3D morphable models. We show that we can
exploit temporal continuity of video data to improve the reliability of the matching
scores across probe frames. Reflected images are used to handle variable illumination conditions to improve recognition over the original images. While there
remains room for improvement in the area of face recognition from poor-quality
video, we have shown some techniques that help performance significantly.

CONTENTS

FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TABLES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 1: INTRODUCTION . . . . .
1.1 Description of surveillance-quality
1.2 Overview of our work . . . . . . .
1.3 Organization of the dissertation .

. . . .
video
. . . .
. . . .

v
viii

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
1
3
5

CHAPTER 2: PREVIOUS WORK . . . . . . . . .


2.1 Current evaluations . . . . . . . . . . . . .
2.2 Pose handling . . . . . . . . . . . . . . . .
2.3 Illumination handling . . . . . . . . . . . .
2.4 Other issues . . . . . . . . . . . . . . . . .
2.5 How this dissertation relates to prior work

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

6
6
8
13
20
28

. . . .
. . . .
. . . .
NDSP
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

29
29
30
30
31
31
33
33
35
38
43
43
43
45
45

CHAPTER 3: EXPERIMENTAL SETUP . . .


3.1 Sensors . . . . . . . . . . . . . . . . . .
3.1.1 Nikon D80 . . . . . . . . . . . .
3.1.2 Surveillance camera installed by
3.1.3 Sony IPELA camera . . . . . .
3.1.4 Sony HDR Camcorder . . . . .
3.2 Dataset . . . . . . . . . . . . . . . . .
3.2.1 NDSP dataset . . . . . . . . . .
3.2.2 IPELA dataset . . . . . . . . .
3.2.3 Comparison dataset . . . . . . .
3.3 Software . . . . . . . . . . . . . . . . .
3.3.1 FaceGen Modeller 3.2 . . . . . .
3.3.2 IdentityEXPLORER . . . . . .
3.3.3 Neurotechnoligija . . . . . . . .
3.3.4 PittPatt . . . . . . . . . . . . .

ii

3.4

3.5

3.3.5 CSUs preprocessing and PCA


Performance metrics . . . . . . . . .
3.4.1 Rank one recognition rate . .
3.4.2 Equal error rate . . . . . . . .
Conclusions . . . . . . . . . . . . . .

software
. . . . .
. . . . .
. . . . .
. . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

46
47
47
47
49

CHAPTER 4: A STUDY: COMPARING RECOGNITION PERFORMANCE


WHEN USING POOR QUALITY DATA . . . . . . . . . . . . . . . .
4.1 NDSP dataset: Baseline performance . . . . . . . . . . . . . . . .
4.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Comparison dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50
51
51
51
53
53
54
65

CHAPTER 5: HANDLING POSE VARIATION IN SURVEILLANCE DATA


5.1 Pose handling: Enhanced gallery for multiple poses . . . . . . . .
5.2 Score-level fusion for improved recognition . . . . . . . . . . . . .
5.2.1 Description of fusion techniques . . . . . . . . . . . . . . .
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 NDSP dataset . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.2 IPELA dataset . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66
66
69
69
73
74
74
80
83

CHAPTER 6: HANDLING VARIABLE ILLUMINATION IN SURVEILLANCE DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


6.1 Acquisition setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Reflecting images to handle uneven illumination . . . . . . . . . .
6.2.1 Averaging images . . . . . . . . . . . . . . . . . . . . . . .
6.3 Comparison approaches . . . . . . . . . . . . . . . . . . . . . . . .
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Test dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Face detection . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85
86
86
91
94
96
96
98
98
99
101

CHAPTER 7: OTHER EXPERIMENTS . . . . . . . . . . . . . . . . . . .


7.1 Face detection evaluation . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 Background subtraction . . . . . . . . . . . . . . . . . . .

103
103
105

iii

7.1.2
7.1.3
7.2

Approach to pick good frames: Gestalt clusters . . . . . . 108


Results: Comparing performance on entire dataset and datasets
pruned using background subtraction and gestalt clustering 111
Distance metrics and number of eigenvectors dropped . . . . . . . 116
7.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

CHAPTER 8: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . .

119

APPENDIX A: GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . .

121

APPENDIX B: POSE RESULTS . . . . . . . . . . . . . . . . . . . . . . .

123

APPENDIX C: ILLUMINATION RESULTS . . . . . . . . . . . . . . . . .

130

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

iv

FIGURES

1.1

Example showing the problem of variable illumination . . . . . . .

1.2

Example showing the variable pose in two frames of a video clip .

1.3

Example showing the low resolution of the face in the frame, when
the subject is too far from the camera . . . . . . . . . . . . . . . .

1.4

Example showing the face to be out of view of the camera . . . .

3.1

Camera to capture gallery data: Nikon D80 . . . . . . . . . . . .

30

3.2

Surveillance camera: NDSP camera . . . . . . . . . . . . . . . . .

31

3.3

Surveillance camera: Sony IPELA camera . . . . . . . . . . . . .

32

3.4

High-definition camcorder: Sony HDR-HC7

. . . . . . . . . . . .

32

3.5

Gallery image acquisition setup . . . . . . . . . . . . . . . . . . .

34

3.6
3.7
3.8
3.9
3.10

Example frames from the NDSP camera . . . . . . . . . . . . . .


Example frames from the IPELA camera . . . . . . . . . . . . . .
Example frames from IPELA camcorder for the Comparison dataset
Example frames from the Sony HDR-HC7 camcorder . . . . . . .
FaceGen Modeller 3.2 Interface . . . . . . . . . . . . . . . . . . .

36
37
39
40
44

3.11 Example of CMC curve . . . . . . . . . . . . . . . . . . . . . . . .

48

3.12 Example of ROC curve . . . . . . . . . . . . . . . . . . . . . . . .

49

4.1

Baseline performance for the NDSP dataset . . . . . . . . . . . .

52

4.2

Detections on surveillance video data acquired indoors . . . . . .

57

4.3

Detections on surveillance video data acquired outdoors . . . . . .

58

4.4

Detections on high-definition video data acquired indoors . . . . .

59

4.5

Detections on high-definition video data acquired outdoors . . . .

60

4.6

Results: ROC curve comparing performance when using high-definition


and surveillance data (Indoor video) . . . . . . . . . . . . . . . .
61

4.7

Results: ROC curve comparing performance when using high-definition


and surveillance data (Outdoor video) . . . . . . . . . . . . . . .
62
v

4.8

Results: CMC curve comparing performance when using high definition and surveillance data (Indoor video) . . . . . . . . . . .

63

Results: CMC curve comparing performance when using high definition and surveillance data (Outdoor video) . . . . . . . . . .

64

Frames showing the variable pose seen in a video clip (the black
dots mark the detected eye locations) . . . . . . . . . . . . . . . .

67

5.2

Synthetic gallery poses . . . . . . . . . . . . . . . . . . . . . . . .

68

5.3

Change in rank matrix for a new incoming image . . . . . . . . .

71

5.4

Results: Comparing rank one recognition rates when adding poses


of increasing degrees of off-angle poses . . . . . . . . . . . . . . .

75

Results: Comparing rank one recognition rates when using frontal,


+/-6 degree and +/-24 degree poses . . . . . . . . . . . . . . . . .

76

Results: Comparing rank one recognition rate when using fusion


techniques to improve recognition . . . . . . . . . . . . . . . . . .

78

5.7

Examples of poorly performing images . . . . . . . . . . . . . . .

81

6.1

Setup to acquire probe data and resulting illumination variation on


the face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6.2

Comparison of gallery and probe images . . . . . . . . . . . . . .

88

6.3
6.4

Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Example images: original image, reflected left and reflected right .

89
92

6.5

Average intensity of each column . . . . . . . . . . . . . . . . . .

93

6.6
6.7

Reflection algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Example images: original image and averaged image . . . . . . . .

94
95

6.8

Example images: original image and quotient image . . . . . . . .

97

7.1

Example eye detections . . . . . . . . . . . . . . . . . . . . . . . .

104

7.2

Structuring element used for erosion and dilation . . . . . . . . .

106

7.3

Example subject: Ground truth and Viisage locations . . . . . . .

109

7.4

Results: Rank one recognition rates when using the entire dataset

113

7.5

Results: Rank one recognition rates when using the dataset after
background subtraction . . . . . . . . . . . . . . . . . . . . . . . .

114

Results: Rank one recognition rates when using the dataset after
background subtraction and gestalt clustering . . . . . . . . . . .

115

B.1 CMC curves: Comparing fusion techniques approaches using a single frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

4.9
5.1

5.5
5.6

7.6

vi

B.2 ROC curves: Comparing fusion techniques using a single frame . .

125

B.3 CMC curves: Comparing approaches exploiting temporal continuity, using rank-based fusion . . . . . . . . . . . . . . . . . . . . .

126

B.4 ROC curves: Comparing fusion techniques exploiting temporal continuity, using rank-based fusion . . . . . . . . . . . . . . . . . . .

127

B.5 CMC curves: Comparing fusion techniques exploiting temporal


continuity, using score-based fusion . . . . . . . . . . . . . . . . .

128

B.6 ROC curves: Comparing fusion techniques approaches exploiting


temporal continuity, using score-based fusion . . . . . . . . . . . .

129

C.1 CMC curves: Comparing illumination approaches using a single


frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

C.2 ROC curves: Comparing illumination approaches using a single frame132


C.3 CMC curves: Comparing illumination approaches exploiting temporal continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

C.4 ROC curves: Comparing illumination approaches exploiting temporal continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . .

134

vii

TABLES

2.1

PREVIOUS WORK . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1

FEATURES OF CAMERAS USED . . . . . . . . . . . . . . . . .

33

3.2

SUMMARY OF DATASETS . . . . . . . . . . . . . . . . . . . . .

42

4.1

COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO


USING PITTPATT . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2

COMPARISON DATASET RESULTS: COMPARISON OF RECOGNITION RESULTS ACROSS CAMERAS USING PITTPATT . .
56

5.1

RESULTS: COMPARISON OF RECOGNITION PERFORMANCE


USING FUSION . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

RESULTS: COMPARISON OF RECOGNITION PERFORMANCE


USING FUSION ON THE IPELA DATASET . . . . . . . . . . .

82

RESULTS: COMPARING RESULTS FOR DIFFERENT ILLUMINATION TECHNIQUES . . . . . . . . . . . . . . . . . . . . . .

100

COMPARING DATASET SIZE OF ALL IMAGES TO BACKGROUND SUBTRACTION APPROACH AND GESTALT CLUSTERING APPROACH . . . . . . . . . . . . . . . . . . . . . . . .

112

5.2
6.1
7.1

7.2

RESULTS: PERFORMANCE WHEN VARYING DISTANCE METRICS AND NUMBER OF EIGENVECTORS DROPPED . . . . 118

viii

CHAPTER 1
INTRODUCTION

Face recognition from video is an important area of biometrics research today.


Most of the existing work focuses on recognition from video where the images are
of high-resolution, containing faces in a frontal pose and where the lighting conditions are optimal. However, face recognition from video surveillance has become
an increasingly important goal as more and more video surveillance cameras are
installed in public places. For example, the Metropolitan Police Department has
installed 14 pan, tilt, zoom (PTZ) cameras around the Washington D. C. area
[12]. Also, there are 2,397 cameras installed in Manhattan [30]. Face recognition using such video is a very challenging problem because of the low resolution,
lighting conditions of such video and the presence of uncontrolled movement. In
this dissertation, we focus on recognition in the presence of uncontrolled pose and
lighting in probe data.

1.1 Description of surveillance-quality video


We describe surveillance-quality video based on four different features. The
four characteristics are (1) variable illumination, (2) variable pose of the subjects
in the video, (3) the low resolution of the faces in the video and (4) obstructions
of the faces in the video.

Figure 1.1. Example showing the problem of variable illumination

Firstly, such video is affected by variable illumination. Often times, surveillance cameras are pointed toward doorways where the sun is streaming in, or the
camera may be in a poorly-lit location. This can change the intensity of the image,
even causing different parts of the image to be illuminated differently, which can
cause problems for the recognition system. In Figure 1.1, we show an example
frame of such video affected by variable illumination.
The second feature of surveillance video is the variable pose of the subject in
the video. The subject is often not looking at the camera and the camera may
be mounted to the ceiling. Therefore, the subject may not be a frontal pose in
the video. While a lot of work has been done using images where the subject is
looking directly at the camera, there is a need to explore recognition when the
subject is not looking at the camera. In Figure 1.2, we two show such examples.
Another surveillance video characteristic is the low resolution of the face. Usually, such video is of low resolution and covers a large scene. Furthermore, the

Figure 1.2. Example showing the variable pose in two frames of a video
clip

camera may be located far from the subject. Hence, the subjects face may be
small, causing the number of pixels on the subjects face to be low, making it
difficult for robust face recognition. In Figure 1.3, we show an image where the
subject is too far from the camera for reliable face recognition.
The last feature of surveillance-quality video is obstruction to the human face.
A perpetrator may be aware of the presence of a camera and try to cover their face
to prevent the camera from capturing their face. Hats, glasses and makeup can
also be used to change the appearance of the face to cause problems for recognition
systems. Sometimes, the positioning of the camera may cause the face to be out
of view of the camera frame as seen in Figure 1.4.

1.2 Overview of our work


In this dissertation, we focus on variable pose and illumination. One theme
that we exploit throughout this dissertation is temporal continuity in the surveillance video. One feature of video data that still images lack is the temporal
3

Figure 1.3. Example showing the low resolution of the face in the frame,
when the subject is too far from the camera

Figure 1.4. Example showing the face to be out of view of the camera

continuity between the frames of the data. The identity of the subject will not
change in an instant, so the multiple frames available can be used for recognition. The matching scores between a pair of probe and gallery subjects can be
made more robust by using decisions about a previous frame for the current one.
First, we compare recognition performance when using surveillance video to performance when using high-resolution video in our probe dataset. We also devise a
technique to evaluate the face detections to prune the dataset to true detections,
to improve recognition performance. We use a multi-gallery approach to make the
recognition system more robust to variable pose in the data. We generate these
poses using synthetic morphable models. We then create reflected images in order
to mitigate the effects of variable illumination.
By combining these techniques we show that we can handle some of the issues
of variable pose and illumination in surveillance data and improve recognition over
baseline performance.

1.3 Organization of the dissertation


The rest of the dissertation is organized as follows: Chapter 2 describes previous work done in the area. In Chapter 3, we describe the sensors and dataset and
the software used in our experiments. We study the effect of poor quality video on
recognition in Chapter 4. Chapters 5 through 7 describe the work we have done
in this dissertation. Finally, we end with our conclusions in Chapter 8.

CHAPTER 2
PREVIOUS WORK

In this chapter, we describe previous work that looks at face recognition from
unconstrained video. We first describe three studies that explore face recognition
from video. We look at two problems, namely uncontrolled pose and poor lighting
conditions. We then describe different approaches that have been used to handle
both of these problems.

2.1 Current evaluations


Three different studies that address face recognition from video are: the FRVT
2002 Evaluation, FRVT 2006 Evaluation and the Foto-Fahndung report. The
FRVT 2002 report describes the use of three-dimensional morphable models with
video and documents the benefits of using them for face recognition. FRVT 2006
reports on face recognition performance under controlled and uncontrolled lighting. The Foto-Fahndung report describes the face recognition performance of
three different pieces of software, when the data comes from video acquired by
a camera looking at an escalator in a German train station. We describe these
studies in further detail below.
In the FRVT 2002 Evaluation Report [32], face recognition experiments are
conducted in three new areas (three-dimensional morphable models, normalization

and face recognition from video). The first experiment compares face recognition
performance when using still images in the probe set to using 100 frames from a
video sequence, while the subject is talking with varied expression. The video is
similar to that of a mugshot with the added component of change in expression.
The gallery is a set of still images. Among all the participants in FRVT 2002,
except for DreamMIRH and VisionSphere, recognition performance is better when
using a still image rather than when using a video sequence. They observe that
if the subject were walking toward the camera, there would be a change in size
and orientation of the face that would be a further challenge to the system. In
this work, we focus on uncontrolled video, where data is captured using a surveillance camera in uncontrolled lighting conditions, hence performance is expected
to be poor. They also conclude that 3D morphable models provide only slight
improvement over 2D images.
In 2006, the FRVT 2006 Evaluation Report [34], compared face recognition
when using 2D and 3D data. It also explores face recognition when using controlled and uncontrolled lighting. When using 3D data, the algorithms were able
to meet the FRGC [33] goal of an improvement of an order of magnitude over
FRVT 2002. To test the effect of lighting, the gallery data was captured in a
controlled environment, whereas the probe data was captured in an uncontrolled
lighting environment (either indoors or outdoors). Cognitec, Neven Vision, SAIT
and Viisage outperformed the best FRGC results achieved, with SAIT having a
false reject rate between 0.103 and 0.130 at a false accept rate of 0.001. The performance of FRVT participants when using uncontrolled probe data matches that
of the FRVT participants of 2002 when using controlled data. However, they also
show that illumination condition does have a huge effect on performance.

The Foto-Fahndung report [9] evaluates performance of three recognition systems when the data comes from a surveillance system in a German railway station.
They report recognition performance in four distinct conditions based on lighting and movement of the subjects and show that while face recognition systems
can be used in search scenarios, environmental conditions such as lighting and
quick movements influence performance greatly. They conclude that it is possible
to recognize people from video, provided the external conditions are right, especially lighting. They also state that high recognition performance can be achieved
indoors, where the light does not change much. However, drastic changes in
lighting conditions affect performance greatly. They state that High recognition
performance can be expected in indoor areas which have non-varying light conditions. Varying light conditions (darkness, black light, direct sunlight) cause a
sharp decrease in recognition performance. A successful utilization of biometric
face recognition systems in outdoor areas does not seem to be very promising for
search purposes at the moment. [9] They suggest the use of 3D face recognition
technology as a way to improve performance.

2.2 Pose handling


Zhou and Chellappa [51] state that researchers have handled rotation problems
in three ways: (1) Using multiple images per person when they are available (2)
Using multiple training images but only one database image per subject when
running recognition and (3) Using a single image per subject where no training is
required.
Zhou et al. [54] apply a condensation approach to solve numerically the problem of face recognition from video. They point out that most surveillance video

is of poor quality and low image resolution and has large illumination and pose
variations. They believe that the posterior probability of the identity of a subject varies over time. They use a condensation algorithm that determines the
transformation of the kinematics in the sequence and the identity simultaneously,
incorporating two conditions into their model: (1) motion in a short time interval
depends on the previous interval, along with noise that is time-invariant and (2)
the identity of the subject in a sequence does not change over time. When they use
a gallery of 12 still images and 12 video sequences as probes, they achieve 100%
rank one recognition rate. However, the small size of the dataset may contribute
to the high accuracy.
In a later work, they extend this approach to apply to scenarios where the
illumination of probe videos is different from that of the gallery [21], which is also
made up of video clips. Each subject is represented as a set of exemplars from
a video sequence. They use a probabilistic approach to determine the set of the
images that minimizes the expected distance to a set of exemplar clusters and
assume that in a given clip, the identity of the subject does not change, Bayesian
probabilities are used over time to determine the identity of the faces in the frames.
A set of four clips of 24 subjects each walking on a treadmill is used for testing.
The background is plain and each clip is 300 frames long. They achieve 100%
rank one recognition rate on all four combinations of clips as probe and gallery.
Chellappa et al. build on this in [52]. They incorporate temporal information
in face recognition. They create a model that consists of a state equation, an
identity equation (containing information about the temporal change of the identity) and an observation equation. Using a set of four video clips with 25 subjects
walking on a treadmill, (from the MoBo [16] database), they train their model

on one or two clips per subject and use the remaining for testing. They are able
to achieve close to 100% rank one recognition rate overall. They expand on this
work [53] to incorporate both changes in pose within a video sequence and the
illumination change between a gallery and probe. They combine their likelihood
probability between frames over time which improves performance overall. In a
set of 30 subjects, where the gallery set consists of still images, they achieved 93%
rank one recognition rate.
Park and Jain [31] use a view synthesis strategy for face recognition from
surveillance video, where the poses are mainly non-frontal and the size of the faces
is small. They use frontal pose images for their gallery, whereas the probe data
contains variable pose. They propose a factorization method that develops 3D
face models from 2D images using Structure from Motion (SfM). They select a
video frame in which the pose of the face is the closest to a frontal pose, as a
texture model for the 3D face reconstruction. They then use a gradient descent
method to iteratively fit the 3D shape to the 72 feature points on the 2D image.
On a set of 197 subjects, they are able to demonstrate a 40% increase in rank one
recognition performance (from 30% to 70%).
Blanz and Vetter [10] describe a method to fit an image to a 3D morphable
model to handle pose changes for face recognition. Using a single image of a person, they automatically estimate 3D shape, texture and illumination. They use
intrinsic characteristics of the face that are independent of the external conditions
to represent each face. In order to create the 3D morphable model, they use a
database of 3D laser scans that contains 200 subjects from a range of demographics. They build a dense point-to-point correspondence between the face model and
a new face using optical flow. Each face is fit to the 3D shape using seven facial

10

feature points (tip of nose, corners of eyes, etc.). They try to minimize the sum of
squared differences over all color channels from all pixels in the test image to all
pixels in the synthetic reconstruction. On a set of 68 subjects of the PIE database
[40], they achieve 95% rank one recognition rate when using the side view gallery.
Using the FERET set, with 194 subjects, they achieve 96% rank one recognition
when using the frontal images as gallery and the remaining images as probes.
Huang et al. [48] use 3D morphable models to handle pose and illumination
changes in face video. They create 3D face models based on three training images
per subject and then render 2D synthetic images to be used for face recognition.
They apply a component-based approach for face detection that uses 14 independent component classifiers. The faces are rotated from 0 to 34 in increments of
2 using two different illuminations. At each instance, an image is saved. Out of
the 14 components detected, nine are used for face recognition. The recognition
system consists of second degree polynomial Support Vector Machine classifiers.
When they use 200 images of six different subjects, they get a true accept rate of
90% at a false accept rate of 10%.
Beymer [8] uses a template based approach to represent subjects in the gallery
when there are pose changes in the data. He first applies a pose estimator based
on the features of the face (eyes and mouth). Then, using the nose and the eyes,
the recognition system applies a transform to the input image to align the three
feature points with a training image. When using 930 images for training the
detector and 520 images for testing, the features are correctly detected 99.6% of
the time. For recognition, a feature-level set of systems is used for each eye, nose
and mouth. The probe images are compared only to those gallery images closest
to its pose. Then he uses a sum of correlations of the best matching eye, nose and

11

mouth templates to determine the best match. On the set of 62 subjects, when
using 10 images per subject with an inter-ocular distance of about 60 pixels in the
images, the rank one recognition rate is 98.39%. However, this is a relatively large
inter-ocular distance for good face recognition and not usually typical of faces in
surveillance quality video.
Arandjelovic and Cipolla [3] deal with face movement and observe that most
strategies use the temporal information of the frames to determine identity. They
propose a strategy that uses Resistor Average Distance (RAD), which is a measure
of dissimilarity between two disjoint probabilities. They claim that PCA does not
capture true modes of variation well and hence a Kernel PCA is used to map the
data to a high-dimensional space. Then, PCA can be applied to find the true
variations in the data. For recognition, the RAD between the distributions of
sets of gallery and probe points is used as a measure of distance. They test their
approach on two databases. One database contains 35 subjects and the other
contains 60 subjects. In both datasets, the illumination conditions are the same
for training and testing. They achieve around 98% rank one recognition rate on
the larger dataset.
Thomas et al. [43] use synthetic poses and score-level fusion to improve recognition when there is variable pose in the data. They show that recognition can
be improved by exploiting temporal continuity. The gallery dataset consists of
one high-quality still image per subject. Using the approach in [10] to generate
synthetic poses, the gallery set is enhanced with multiple images per subject. A
dataset of 57 subjects is used, which contains subjects walking around a corner in
a hallway. When they use the original gallery images and treat each probe image
as a single frame with no temporal continuity, they achieve a rank one recog-

12

nition rate of 6%. However, by adding synthetic poses and exploiting temporal
continuity, they improved rank one recognition performance to 21%.

2.3 Illumination handling


Zhou et al. [55] separate the strategies to handle changes in illumination into
three categories. The first set of approaches is called subspace methods. These
approaches are most commonly used in recognition problems. Some common examples of this class of approaches are PCA [44] and LDA [50]. However, the
disadvantage of such techniques is that they are tuned to the illumination conditions that they are trained on. When the gallery set consists of still images taken
indoors under controlled lighting conditions and the probe set is of surveillance
quality video acquired under uncontrolled lighting conditions, recognition performance is poor. The second set of approaches is reflectance model methods. A
Lambertian reflectance model is used to model lighting. The disadvantage of this
approach is that it is not as effective an approach when the subjects in the testing
set are not encountered in the training set. The third set of approaches uses 3D
models for representation. These models are robust to illumination effects. However, they require a sensor that can capture such data or the data needs to be
built based on 2D images.
Adini et al. [2] describe four image representations that can be used to handle illumination changes. They divide the approaches to handle illumination into
three categories: (1) Gray level information to extract a three-dimensional shape
of the object (2) A stored model that is relatively insensitive to changes in illumination and (3) A set of images of the same object under different illuminations.
The third approach may be not be realistic given the experiment and the setup.

13

Furthermore, one may not be able to fully capture all the possible variations in
the data. While it has been shown theoretically that a function invariant to illumination does not exist, there are representations that are more robust than
others [2]. The four representations they consider are (1) the original gray-level
image, (2) the edge map of the image, (3) the image filtered with 2D Gabor like
filters and (4) the second-order derivative of the gray level image [2]. Some edges
of the image can be insensitive to illuminations whereas others are not. However,
an edge map is useful in that it is a compact representation of the original image. Derivatives of the gray level image are useful because while ambient light
will affect the gray level image, under certain conditions it does not affect the
derivatives. In order to make the images more robust, they divide the face into
two sub parts by creating subregions of the eyes area and the lower part of the
face. They show that in highly variable lighting, the error rate is 100% on raw
gray level images, where there are changes in illumination direction. Performance
improves when using the filtered images. They also show that even though the
filtered images do not resemble the original face, they encode information to improve recognition. However, they conclude that no one representation is sufficient
to overcome variations in illumination. While some are robust to changes along
the horizontal axis, others are more robust along the vertical axis. Hence, the
different approaches need to be combined to exploit the benefits of each of them.
Zhao and Chellappa [51] use 3D models to handle the problems of illumination in face recognition. They create synthesized images acquired under different
lighting and viewing conditions. They develop a 2D prototype image from a 2D
image acquired under variable lighting using a generic 3D model, rather than a
full 3D approach that uses accurate 3D information. For the generic 3D model,

14

a laser scanned range map is used. They use a Lambertian model to estimate the
albedo value, which they determine using a self-ratio image, which is the illumination ratio of two differently aligned images. Using a 3D generic model, they
bypass the 2D to 3D step, since the pose is fixed in their dataset. When they test
their approach using the Yale database, on a set of 15 subjects with 4 images each
they obtain 100% rank one recognition rate, which was an improvement of about
25% improvement over using the original images (about 75% rank one recognition
rate).
Wei and Lai [47] describe a robust technique for face recognition under varying lighting conditions. They use a relative image gradient feature to represent
the image, which is the image gradient function of the original intensity image,
where each pixel is scaled by the maximum intensity of its neighbors. They use
a normalized correlation of the gradient maps of the probe and gallery images to
determine how well the images match. On the CMU-PIE face database [40], which
contains 22 images under varying illuminations of 68 individuals, they obtain an
equal error rate of 1.47% and show that their approach outperforms recognition
when using the original intensity images.
Price and Gee [36] also propose a PCA-based approach to address three issues
that could cause problems in face recognition, namely illumination, expression and
decoration (specifically, glasses and facial hair). They use an LDA-based approach
to handle changes in illumination and expression. They note that subregions of
the face are less sensitive to expression and decoration than the full face. So they
break the face into modular subregions: the full face, the region of the eyes and
the nose and then just the eyes. For each region, they independently determine
the distance from that region to each of the corresponding images in the database.

15

Hence, they have a parallel system of observations, one for each region mentioned
above. They then use a combination of results as their matching score to determine
the best match. They use a database of 106 subjects with varied illumination,
expression and decoration, where 400 still images are used for training and 276
for testing. When they combine the results from the three observers, using PCA
and LDA, they achieve a rank one recognition rate of 94.2%.
Hiremath and Prabhakar [18] use interval-type discriminating features to generate illuminant invariant images. They create symbolic faces for each subject
in each illumination type based on the maximum and minimum value found at
each pixel for a given dataset. While this is an appearance-based approach, it
does not suffer the same drawbacks as other approaches because it uses interval
type features. Therefore, it is insensitive to the particular illumination conditions
in which the data is captured within the range of illuminations in the training
data. They then use Factorial Discriminant Analysis to find a suitable subspace
with optimal separation between the face classes. They test their approach using
the CMU PIE [40] database and get a 0% error rate. This approach is advantageous in that it does not require a probability distribution of the image gradient.
Furthermore, it does not use any complex modeling of reflection components or
assume a Lambertian model. However, it is limited by the range of illuminations
found in the training data. Therefore, it may not be applicable in cases where
there is a difference in the illuminations between the gallery and probe sets.
Belhumeur et al. [6] use LDA to produce well-separated classes for robustness
to lighting direction and facial expression, and compare their approach to using
eigenfaces (PCA) for recognition. They conclude that LDA performs the best
when there are variations in lighting or even simultaneous changes in lighting

16

and expression. They also state that In the [PCA] method, removing the first
three principal components results in better performance under variable lighting
conditions. [6] Their experiments use the Harvard database [17] to test variation
in lighting. The Harvard database contains 330 images from 5 subjects (66 images
each). The images are divided into five subsets based on the direction of the light
source (0, 30, 45, 60, 75 degrees). The Yale database consists of 16 subjects with
10 images each taken on the same day but with variation in expression, eyewear
and lighting. They use a nearest neighbor classifier for matching, though the
measure used to determine distance was not specified. The variation in expression
and lighting is tested using a leave-one-out error estimation strategy on all 16
subjects. They train the space on nine of the images and then tested it using the
image left out and achieve a 0.6% recognition error rate using LDA and a 19.4%
recognition error rate using PCA, with the first three dimensions dropped. They
do mention that the databases are small and more experimentation using larger
databases is needed.
Arandjelovic and Cipolla [5] handle variation in illumination and pose using
clustering and Gamma intensity correction. They create three clusters per subject corresponding to different poses and use locations of pupils and nostrils to
distinguish between the three clusters. Illumination is handled using Gamma intensity correction. Here, the pixels in each image are transformed so as to match
a canonically illuminated image. Pose and illumination are combined by performing PCA on variations of each persons images under different illuminations
from a given persons mean image and using simple Euclidean distance as their
distance measure. In order to match subjects to a novel image, they use the ratio
of the probability that three clusters belong to the same subject over the proba-

17

bility that they belong to a different subject. Their dataset consist of 20 subjects
for training and 40 others for testing, where each subject has 20-100 images in
random motion. They achieve 95% rank one recognition rate using this approach.
Arandjelovic and Cipolla [4] evaluate strategies to achieve illumination invariance when there are large and unpredictable illumination changes. In these
situations, the difference between two images of the same subject under different
illuminations is larger than that of two images under the same illumination but
of different subjects. Hence, they focus on ways to represent the subjects face
and put more emphasis on the classification stage. They show that both the high
pass filter and the self quotient image operations on the original intensity image
show recognition improvement over the raw grayscale representation of the images,
when the imaging conditions between the gallery and probe set are very different.
However, they also note that while they improve recognition in the difficult cases,
they actually reduce performance in the easy cases. They conclude that Laplacian of Gaussian representation of the image as described in [2] and a quotient
image representation perform better than using the raw image. They demonstrate
a rank one recognition rate improvement from about 75% using the raw images,
to 85% using the Laplacian of Gaussian representation, to about 90%, using quotient images. Since we are dealing with conditions which change drastically and
where the conditions for gallery and probe data differ, we use these approaches to
improve recognition in this work.
Gross and Brajovic [15] use an illuminance-reflectance model to generate images that are robust to illumination changes. Their model makes two assumptions:
human vision is mostly sensitive to scene reflectance and mostly insensitive to illumination conditions and secondly that human vision responds to local changes

18

in contrast rather than to global brightness levels. [15] Since they focus on preprocessing the images based on the intensity, there is no training required. They
test their approach using the Yale database, which contains 10 subjects acquired
under 576 lighting conditions. When using PCA for recognition, they improve
the rank one recognition rate from 60% to 93%, when using reflectance images instead of the original intensity images. Since we are dealing with conditions which
change drastically and where the conditions for gallery and probe data differ, we
use these approaches to improve recognition in this work.
Wang et al. [46] expand on the approach in [15] and used self-quotient images
to handle the illumination variation for face recognition. The Lambertian model
of an image can be separated into two parts, the intrinsic and extrinsic part. If
one can estimate the extrinsic part based on the lighting, it can be factored out
of the image to retain the intrinsic part for face recognition. The image is found
by using a smoothing kernel and dividing the image pixels by this filter. Let F
be the smoothing filter and I the original image, then the self-quotient image, Q
is defined as

I
.
F [I]

They demonstrate their approach on the Yale and PIE dataset

and show improvement over using the intensity images for recognition, from about
50% to about 95% rank one recognition rate.
Nishiyama et al. [25] show that self-quotient images [46] are insufficient to handle partial cast shadows or partial specular reflection. They handle this weakness
by using an appearance-based quotient image. They use photometric linearization
to transform the image into the diffuse reflection. A linearized image is defined as
a linear combination of three basis images. In order to generate the basis images
to find the diffuse image, different images from other subjects are used. They
acquired images under fixed pose with a moving light source. The reflectance

19

image is then factored out using the estimated diffuse image. They compare their
algorithm to the self-quotient image and the quotient image and show that on the
Yale B database and show that they achieve a rank one recognition rate of 96%,
whereas self-quotient images achieve 87% rank one recognition rate and Support
Retinex images [37] achieves a rank one recognition rate of 93%.

2.4 Other issues


Howell and Buxton [19] propose a strategy for face recognition when using
low-resolution video. Their goal is to capture similarities of the face over a wide
range of conditions and solve the problem for just a small group (less than 100)
of subjects. The environment is unconstrained in that there are no restrictions on
movement. They use the temporal information of the frames linked by movement
information to match the frames. This allows them to make the assumption
that between two consecutive frames, the identity of the subject will not change
instantly. They use a two-layer, hybrid learning network with a supervised and
unsupervised layer and adjust weights using the Widrow-Hoff delta learning rule.
The network is trained to include the variation that they want their system to
tolerate. From a set of 400 images of 40 people, using 5 images per subject, and
discarding frames that do not include a face, they are able to achieve 95% rank
one recognition rate.
Lee et al. [22] discuss an approach to handle low resolution video using support
vector data description (SVDD). They project the input images as feature vectors
on the spherical boundary of the feature space and conduct face recognition using
correlation on the images normalized based on the inter-ocular distance. They use
the Asian Face database for their experiments and different resolutions, ranging

20

from 16 x 16 pixels to 128 x 128 pixels and achieve a rank one recognition rate of
92% when using the lowest resolution images.
Lin et al. [23] describe an approach to handle face recognition from video of low
resolution like those found in surveillance. They use optical flow for registration
to handle issues of non-planarity, non-rigidity, self-occlusion and illumination
and reflectance variation. [23] For each image in the sequence, they interpolate
between the rows and columns to obtain an image that is twice the size of the
original image. They then compute optical flow between the current frames and
the two previous and two next images and register the four adjacent images using displacements estimated by the optical flow. Then they compute the mean
using the registered images and the reference images. The final step is to apply a
deblurring Wiener deconvolution filter to the super resolved image. They tested
their approach on the CUAVE database, which contains 36 subjects. When they
reduce the images to 13x18 pixels, their approach (approximately 15% FRR at
1% FAR) performs slightly better than bilinear interpolated images and far outperforms nearest neighbor interpolation. They expand on this work in [24] and
compare their approach to a hallucination approach (assumes a frontal view of
face and works well when faces are aligned exactly). They conclude that while
there is some improvement gained over using over the lower resolution images,
a fully automated recognition system is currently impractical, given the performance. Hence, they relax their constraint to a rank ten match and can achieve
87.3% rank ten recognition rate on XM2VTS dataset that contains 295 subjects.
In Table 2.1, we summarize the different approaches along with their assumptions, dataset size and performance. We divide up the works based on the problem
they are trying to solve: (1) Variable pose (2) variable illumination and (3) other

21

problems, such as low resolution on the face. Performance is reported in rank one
recognition rate, unless otherwise specified. Some of the results are reported in
terms of equal error rate (or EER). Also, the results must be viewed in light of
the difficulty of the dataset (data features) and dataset size.

22

23

Data features

View synthesis strategies, Non


3D face models using SfM
faces
Template-based approach,
feature-level system combined using sum of correlations

Park and Jain: [31]


Beymer: [8]

Pose changes

62

frontal 193

3D models based on 2D 2
different 6
training images, render syn- illuminations,
thetic poses for recognition faces rotated

194

12

98.39%

70%

90%

96%

100%

Dataset Perforsize
mance1

Weyrauch, et al : [48]

Vetter:

3D morphable model, point- Variable pose


to-point correspondence for
similarity

and

Blanz
[10]

Constrained
video

Posterior probability over


time

APPROACHES TO HANDLE VARIABLE POSE

Basic idea

Zhou: [54]

Authors: Title

PREVIOUS WORK

TABLE 2.1

24

Use Kernel PCA to reduce


the dimensionality of images
to nearly linear. Apply RAD
to calculate distances between two sets of images

Arandjelovic,
Cipolla: [3]

State of identity equation,


temporal continuity

Chellappa:

Chellappa:

Zhou:
[52]
Zhou,
[53]

Likelihood probability be- Subject


on 30
tween frames over time
treadmill,
frontal video

Subject
on 25
treadmill,
frontal video

Exemplar clusters to rep- Subject


on 24
resent subjects, Bayesian treadmill,
probabilities over time to frontal video
evaluate identity

Krueger, Zhou: [21]

93%

About
100%

100%

About
97%

Dataset Perforsize
mance

Face
move- 36
ment;
No
illumination
change

Data features

APPROACHES TO HANDLE VARIABLE ILLUMINATION

Basic idea

Authors: Title

Continued

TABLE 2.1

25

Data features

68

Symbolic
interval
type Different ilfeatures to represent face luminations,
classes, Factor Discrimi- still images
nant Analysis to reduce
dimensions

Hiremath,
hakar: [18]

Prab-

LDA based approach using Varied


il- 106
subregions of the face
lumination,
expression,
decoration

Price and Gee: [36]

68

Modified
image
inten- Varying illusity function, normalized minations
correlation for matching

0%
EER

94.2 %

1.47%
EER

100%

Dataset Perforsize
mance

Chellappa: Synthetic images acquired Varied illumi- 15


under different lighting, use nation
a Lambertian model to handle the albedo

Basic idea

Wei and Lai: [47]

Zhao,
[51]

Authors: Title

Continued

TABLE 2.1

26
and

Variable light- 100


ing

Create Gaussian clusters Varied illumi- 96


corresponding to various nation
poses; apply Gamma intensity correction; Euclidean
distances
to
determine
difference

Use image filters since the illumination is unpredictable

APPROACHES TO HANDLE LOW - RESOLUTION VIDEO

Arandjelovic,
Cipolla: [5]

Aranjelovic
Cipolla: [4]

LDA for recognition, trained Variable light- 5


on multiple samples per sub- ing
ject with varying illumination

73.6%

0.6%
EER

Dataset Perforsize
mance

Belhumeur et al.: [6]

Data features

Basic idea

Authors: Title

Continued

TABLE 2.1

27
Support vector data descrip- Low
tion, correlation for recogni- tion
tion

Use optical flow for registra- Subject talk- 36


tion, create single SR frame ing
from 5 registered frames, use
PCA with MahCosine distance for recognition

Lee, et al : [22]

Lin et al.: [23]

resolu-

Unconstrained 40
environment
and
movement

Exploit temporal information of frames, two-layer hybrid learning network, adjusted using the WidrowHoff learning rule

FAR

92%

95%

15

Dataset Perforsize
mance

Howell, Buxton: [19]

Data features

Basic idea

Authors: Title

Continued

TABLE 2.1

2.5 How this dissertation relates to prior work


In this dissertation, we focus on the variations in pose and uncontrolled lighting.
To handle the variations in pose, we use a multi - gallery approach to represent all the poses in the dataset. We create synthetic poses to represent those
that may be present in our probe set. We then use score - level fusion. This approach requires no training and thus is useful for datasets where the poses in the
probe and gallery sets differ. On a dataset of 57 subjects, we achieve a rank one
recognition rate of 21%, which is an improvement of 6% rank one recognition rate
achieved using the baseline approach. The baseline approach used for comparison
is described in Section 4.1. This work is described in Chapter 5.
To handle the lighting conditions, we use an appearance-based model, which
doesnt require any training data or knowledge of the model. We create reflected
images by using one half of the face and reflecting it over the other half. We
then use score-level fusion to combine the two sets of results. We demonstrate
an improvement relative to the self-quotient image and quotient image approach,
which assumes a Lambertian model. On a dataset of 26 subjects, we show a rank
one recognition rate of 49.88% and an equal error rate of 18.27%, whereas baseline performance using the original images 38.62% rank one recognition rate and
19.27% equal error rate. Here, baseline performance is the performance achieved
when using the original images as obtained from the surveillance cameras, with
no preprocessing. This work is described in Chapter 6.

28

CHAPTER 3
EXPERIMENTAL SETUP

In this chapter, we describe the sensors, data sets and software used in our
experiments. We acquire three different datasets for our experiments. We label
them the NDSP, IPELA and Comparison dataset. The first dataset is used to
show baseline performance and used in our pose and face detection experiments.
The IPELA dataset is used for our reflection experiments to handle pose and
illumination variation. Finally, the Comparison dataset is used to compare face
recognition performance when using high-quality data acquired on a camcorder
and when using data acquired on a surveillance camera.
The rest of the chapter is organized as follows: Section 3.1 describes the different sensors we use in our experiments. We then describe our datasets in Section
3.2. Finally, the software we use is described in Section 3.3.

3.1 Sensors
We capture data using four different sensors. The first camera is a Nikon D80,
used to acquire the gallery data used in the experiments. The second camera is
a PTZ camera installed by the Notre Dame Security Police. The third camera is
a Sony IPELA camera with PTZ capability. The fourth camera is a Sony HDR
camcorder used to capture data as a comparison to the surveillance-quality data.
We describe each in detail below.
29

Figure 3.1. Camera to capture gallery data: Nikon D80

3.1.1 Nikon D80


The gallery data is acquired on a Nikon D80 [28]. It is a digital single-lens
reflex (SLR) camera. The resolution of the images is 3872 x 2592 pixels. The
camera is shown in Figure 3.1.

3.1.2 Surveillance camera installed by NDSP


The probe data is acquired using a surveillance video camera with PTZ (pan,
tilt, zoom) capability. The camera is part of the NDSP security system and is
attached to the ceiling on the first floor of Fitzpatrick Hall, as seen in Figure
3.2. The resolution of this camera is 640 x 480 pixels. The data is captured in
30

Figure 3.2. Surveillance camera: NDSP camera

interlaced mode.

3.1.3 Sony IPELA camera


We also acquire data on a surveillance camera called Sony SNC RZ25N surveillance camera [41]. The resolution of this camera is 640 x 480 pixels and the data
is captured in interlaced mode. In Figure 3.3, we show an image of such a camera.

3.1.4 Sony HDR Camcorder


For our comparison dataset, we also acquire high quality data on a Sony HDR
camcorder [42]. The video was captured at a frame rate of 29.97 frames per second
in interlaced mode. In Figure 3.4, we show an image of this camcorder.
In Table 3.1, we compare all the cameras used in this dissertation.

31

Figure 3.3. Surveillance camera: Sony IPELA camera

Figure 3.4. High-definition camcorder: Sony HDR-HC7

32

TABLE 3.1
FEATURES OF CAMERAS USED

Features

Camera

Name used

Nikon D80

NSDP camera

IPELA

Sony HD

Model

Nikon D80

Not available

Sony
RZ25N

Resolution

2592x3872

640x480

640x480

1920x1080

Image size

3,732 kb

40kb

52kb

466 kb

Interlaced

No

Yes

Yes

Yes

SNC

Sony HDRHC

3.2 Dataset
We describe three datasets. They are named NDSP, IPELA and Comparison
dataset based on the camera used to acquire them and the experiments for which
they are used.

3.2.1 NDSP dataset


We use two kinds of sensors to acquire data for this dataset. The gallery
data containing high quality still images is acquired using the Nikon D80 camera.
The subject is sitting about two meters from the camera in a controlled well-lit
environment, in front of a gray background. The inter-ocular distance is about
230 pixels, with a range of between 135 and 698 pixels. In Figure 3.5, we show
the set up and two of the images acquired for the gallery.
The probe data is acquired using the NDSP surveillance camera, located on
the first floor of Fitzpatrick Hall. The video consists of a subject entering through

33

(a) Acquisition setup for gallery data

(b) Example gallery images

Figure 3.5. Gallery image acquisition setup

34

a glass door, walking around the corner till they are out of the camera view.
Each video sequence consists of between 50 and 150 frames. In Figure 3.6 we
show 10 frames acquired from this camera. We see that the illumination is highly
uneven due to the glare of the sun on the subject. The inter-ocular distance is
about 40 pixels on average. The pan, tilt and zoom are not changed during data
acquisition but could vary from day to day since the camera is part of a campus
security system. There are 57 subjects in this dataset. The time lapse between
the probe and gallery data varies from two weeks to about six months.

3.2.2 IPELA dataset


The gallery images are acquired using the Nikon D80, in a well-lit room under
controlled conditions. The subject is sitting about two meters from the camera in
front of a gray background, with a neutral expression, as seen in Figure 3.5. Since
these images are acquired indoors, the illumination is controlled. The inter-ocular
distance is about 300 pixels.
The probe data is acquired from the IPELA camera. The zoom on this camera
is set so that the inter-ocular distance of the subject is about 50 pixels, starting at
about 30 pixels when the subject enters the scene and is farthest from the camera
to about 115 pixels when the subject is closest to the camera. It is mounted on
a tripod set at a height of about five and a half feet. The camera position is not
changed during a day of capture, but may vary slightly from day to day. Each
clip consists of the subject walking around a corner until they are out of the view
of the camera. Therefore, we capture data of the subject in a variety of poses and
face sizes. Each video sequence is made up of 100 to 200 frames. In Figure 3.7,
we show 10 example frames acquired from one subject from one of the clips. The

35

Figure 3.6: Example frames from the NDSP camera

36

Figure 3.7: Example frames from the IPELA camera

illumination is also uncontrolled. This dataset consists of 104 subjects. The time
lapse between the probe and gallery data is between two weeks and six months.
Splitting up the dataset: In order to test our approach when using the
surveillance data, we use four-fold cross validation. We split up the dataset into
four disjoint subsets, where each set contains 26 subjects. The sets are subjectdisjoint. For our experiments, we train the space on three subsets and test on
the remaining subset. We use the average of the four scores as our measure of

37

performance of the different approaches.

3.2.3 Comparison dataset


For each of the subjects in this dataset, we have one high quality still image, one high-quality video sequence and one video clip acquired from the IPELA
surveillance camera. The gallery data is acquired in a well-lit room under controlled conditions. The subject is sitting about 2 meters from the camera against
a gray background. We show an example image in Figure 3.5.
The IPELA camera and HD Sony camcorder are set up to acquire data in
the same setting. The zoom on the IPELA camera is set so that the inter-ocular
distance of the subject is about 40 pixels on average. It is mounted on a tripod
set at a height of about five and one half feet. The camera position is not changed
during a day of capture, but may vary slightly from day to day. In each clip, the
subject walks toward the left, picks up an object and then walks towards the right
of the frame. Therefore, we capture data of the subject in a variety of poses and
face sizes. We acquired data on three consecutive days. Each video sequence was
made of between 100 and 300 frames. We show examples of each of these images
in Figures 3.8 to 3.9.
The Sony HDR camcorder is also mounted on a tripod set at a height of about
five and a half feet and adjusted according to the height of the subject. This
is captured simultaneously with the surveillance video, and thus consists of the
subject walking to the left, picking up an object and then walking towards the
right of the frame. The interocular distance of this dataset is about 45 pixels with
a range of about 15 pixels to 110 pixels. In Figure 3.9, we show 10 example frames
acquired from one subject from one of the clips.

38

Figure 3.8: Example frames from IPELA camcorder for the Comparison dataset

39

Figure 3.9: Example frames from the Sony HDR-HC7 camcorder

40

This dataset contains 176 subjects. Out of the 176 subjects, 78 are acquired
indoors in a hallway on the first floor of Fitzpatrick Hall. One half of the face
is partially lit by the sun. The remaining 98 subjects are acquired outdoors
in uncontrolled lighting conditions. We separate out these datasets to compare
recognition performance when using data acquired indoors rather than outdoors.
The probe and gallery data in this dataset are acquired on the same day. This
dataset partly overlaps with the data of the Multi - Biometric Grand Challenge
(or MBGC) dataset [29], but also includes surveillance data that is not part of
the MBGC dataset.
In Table 3.2, we summarize the details of the datasets we use in this dissertation.

41

TABLE 3.2
SUMMARY OF DATASETS

Features

Dataset

Name

NDSP
Dataset

IPELA
Dataset

Gallery data
source

Nikon D80

Nikon D80

Nikon D80

Nikon D80

Probe
source

NDSP
installed
surveillance
camera

Sony IPELA
camera

Sony IPELA
camera

Sony
HD
camcorder

Number subjects

57

104

176

176

Number
images
per
gallery subject

Number
images
per
probe subject

50
frames

Acquisition
environment
of probe data

Fitzpatrick
Hallway

Activity

Time
lapse
between
probe
and
gallery data

data

150

100 frames

Comparison Dataset

100 frames

300

300 frames

450

Fitzpatrick
Hallway

Indoor
outdoor

and

Indoor
outdoor

and

Subject enters through a


glass door and
walks around
a corner

Subject walks
around a corner and down
a hallway and
out of view of
the camera

Subject picks
up an object
and
walks out of
camera-view

Subject picks
and
object
and walks out
of camera view

2 weeks to 6
months apart

2 weeks to 6
months apart

Same day

Same day

42

300

3.3 Software
We use a variety of software for our work: FaceGen Modeller 3.2, Viisage
IdentityExplorer, Neurotechnologija, PittPatt and CSUs PCA code. They are
described in further detail below:

3.3.1 FaceGen Modeller 3.2


For each gallery image, we create a 3D model using the Nikon image as input
and then rotate the model to get different poses. In order to create the models, we
use the FaceGen Modeller 3.2 Free Version manufactured by Singular Inversions
[26]. The software is based on the work by Vetter et al. [10].
This modeler creates a 3D model using the notion of an average 3D face and
a still frontal image. It is trained on a set of subjects from various demographics
such as age, gender and ethnicity. It requires eleven points to be marked on the
face: centers of the eyes, edges of the nose, the corners of the mouth, the chin,
the point at which the jaw line touches the face visually and the points at which
the ears touch the face. Once the 3D model is rendered using the still image,
different parameters, such as gauntness of cheeks and the jaw line can be tweaked
to represent the particular subject in the 2D image more accurately. The synthetic
3D face can then be rotated to get different views of the face. A screen shot of
the software is shown in Figure 3.10.

3.3.2 IdentityEXPLORER
Viisage manufactures an SDK for multi-biometric technology, called IdentityEXPLORER. It provides packages for both face and fingerprint recognition. It
is based on Viisages Flexible Template Matching technology and a new set of

43

Figure 3.10. FaceGen Modeller 3.2 Interface

44

powerful multiple biometric recognition algorithms, incorporating a unique combination of biometric tools [45]. We use it for detection and recognition:
1. Detection: It gives the centers of the eyes and the mouth, with an associated
confidence measure in the face localization, ranging from 0.0 to 100.0.
2. Recognition: It takes two images and gives a matching score between the
faces in the two images. The scores range from 0.0 to 100.0, where a higher
score implies a better match.

3.3.3 Neurotechnoligija
Neurotechnology [27] manufactures an SDK for face and fingerprint biometrics.
The face recognition package is called Neurotechnologija Verilook. It includes face
detection and face recognition capability. The face detection gives the eye and
mouth locations. The software also includes recognition software, which gives the
matching score between two faces in two images.

3.3.4 PittPatt
PittPatt manufactures a face detection and recognition package [35] that we
use in our comparison experiments. The face detection component is robust to
illumination and pose changes in the data and to a variety of demographics. Along
with its detection capability it can determine the pose of the face. It is able to
capture small faces, such as faces with an inter-ocular distance of eight pixels. The
face recognition component is also robust to a variety of poses and expressions by
using statistical learning techniques. By combining face detection and tracking,
PittPatt can also be used to recognize humans across video sequences.

45

3.3.5 CSUs preprocessing and PCA software


In order to form a template image of the face that is found in the image, we
use CSUs preprocessing code [13]. We create images that are 65x75 pixels in
size, based on the eye locations found by Viisage, because the subjects face in the
surveillance video has an average inter-ocular distance of about 40 pixels. In the
normalization stage, the images are first centered, based on eye locations, and the
mean image of the set is subtracted from each image in the set.
The CSU software also includes an implementation of Principal Component
Analysis for face recognition [44]. We use this software when using reflectance
images as input for recognition to handle illumination effects [15]. The basic PCA
algorithm is described by Turk and Pentland [44]. The process consists of two
parts, the offline training phase and the online recognition phase.
In the offline phase, the eigenspace is created. Each image is unraveled into
a vector and each vector becomes a column in an M xN matrix, where N is the
number of images and M is the number of pixels. Then the covariance matrix Q
is defined as the outer product of this metric.
The next step is to calculate the eigenvalues and eigenvectors of the matrix Q
and then keep the k eigenvectors with the k largest eigenvalues (which correspond
to the dimensions of highest variation). This defines a k-dimensional eigenspace
into which new images can be projected. In the recognition phase, the normalized
images are projected along their eigenvectors into the k-dimensional face space
and the projected gallery image closest to a projected probe image is the best
match.

46

3.4 Performance metrics


In this dissertation, we use two different performance metrics to evaluate recognition performance. They are called rank one recognition rate and equal error
rate. They are shown graphically using cumulative matching curves and receiver
operating curves respectively. We describe each metric in further detail below.

3.4.1 Rank one recognition rate


When a image is probed against a set of gallery images, the gallery image
that has the highest matching score to that probe image is considered its rank
one match. Rank one recognition rate is then defined as the ratio of set of probe
images where the rank one match of each is its true match.
A CMC curve plots the change in recognition rate as the rank of acceptance is
increased. The x-axis ranges from 1 through M , where M is the number of unique
gallery subjects and the y - axis ranges from 0 to 100%. In Figure 3.11, we show
an example of such a curve. In this example, there are 26 subjects in the gallery
set.
Assume that there are n images the probe set and m images in the gallery set.
Let p be the number of probe images for which the rank one match is its true
match, then the rank one recognition rate R is defined as in Equation 3.1.

R=

p
100 n

(3.1)

3.4.2 Equal error rate


Another metric to measure the equal error rate of the receiver operating (ROC)
curve. An ROC curve plots the false accept rate against the true accept rate. The

47

Figure 3.11. Example of CMC curve

48

Figure 3.12. Example of ROC curve

rate at which the true accept rate equals the false accept rate is called the equal
error rate.
An ROC curve plots the change in false accept rate versus the true accept
rate. At each point on the graph, the threshold of acceptance as a true match is
varied. In Figure 3.12, we show an example of such a curve.

3.5 Conclusions
In this chapter, we discussed the sensors and datasets used in our experiments.
We also described the software we used to support our work. Finally, we closed
with a discussion about the metrics used to evaluate performance.
49

CHAPTER 4
A STUDY: COMPARING RECOGNITION PERFORMANCE WHEN USING
POOR QUALITY DATA

The sensor used to capture data used for face recognition can affect recognition
performance. Low quality cameras are often used for surveillance, which can
result in poor recognition because of the poor video quality and low resolution
on the face. In this chapter, we conduct two sets of experiments. The first
set of experiments demonstrate baseline performance using the NDSP dataset.
This dataset is captured indoors, where the sunlight streaming through the doors
affects the illumination of the scene. Then we show recognition experiments using
the Comparison dataset, where video data acquired from two different sources:
a high-quality camcorder and a surveillance camera. We also capture data both
indoors and outdoors to compare performance when acquiring data in different
acquisition settings. We then compare recognition performance when using each
of these two sources of video data as our probe set and show that performance
falls drastically when we use poor quality video and when we move from indoor
to outdoor settings.
The rest of the chapter is organized as follows: First, we describe baseline
performance for the NDSP dataset in Section 4.1. Then, Section 4.2 describes
the experiments we run to compare performance and in Sections 4.2.2 and 4.3, we
describe our results and conclusions.
50

4.1 NDSP dataset: Baseline performance


We first define baseline performance using the NDSP dataset to show the
difficulty of this dataset. While there is significant research done in the area of
face recognition using high - quality video where the subject is looking directly
at the camera, research using poor - quality data with off-angle poses is also
needed. So we define baseline performance for this dissertation to show that it is
a challenging problem.

4.1.1 Experiments
For each subject in the NDSP probe set, we compare each frame of their
probe video clip to the set of gallery images of the same subject. We describe
how we generate the multiple gallery images per subject in Section 5.1. For each
subject, we predetermine the best single probe video frame to use for that person.
We do this by picking the frame that gives us the highest matching score to
this corresponding set of gallery images. This gives us a new image set of 57
images (one image per subject), where each image represents the highest possible
matching score of that subject to the gallery images of the same subject. We use
this oracle set of probe video frames as our probe set. This is an optimistic
baseline, in that a recognition system would not necessarily be able to find the best
frame in each probe video clip. We then run recognition using this set of images
as probes and report the rank one recognition rate as our baseline performance.

4.1.2 Results
In Figure 4.1, we show the rank one recognition rates using this set of 57
images, when the images in the gallery set correspond to an off-angle degree from

51

Figure 4.1. Baseline performance for the NDSP dataset

the frontal position of 0 , +/- 6 , +/-12 , +/-18 in the yaw angle. The face is
also rotated to +/- 6 in pitch angle.
We see that performance steadily increases as we increase the range of poses
available in the gallery set. We determine the best frame per subject based on its
matching score to all 17 poses. This explains why performance peaks when we use
all 17 poses, since we use all 17 poses to pick the frames that make up the oracle
probe set. This shows that this is a challenging dataset, where performance is
poor even when we pick out the probe frame with the best matching score to its

52

gallery image. Secondly, we demonstrate that using a variety of poses increases


recognition performance.
We show that performance increases as we increase the number of poses, till we
stop at 17 poses. So the question arises as to whether or not performance would
continue to increase if we were to increase the off-angle of the poses and add more
images to our gallery. However, we generated additional synthetic poses, but the
face detection system was unable to handle poses that were greater than 18 from
a frontal position. So, those poses were not used by the recognition system and
even if they were, would have not been useful for recognition. Furthermore, if the
video contained images of the subject in a strictly frontal position, the additional
poses would not be useful for recognition. However, as we showed in [43], multiple
images can be used to be improve recognition even in instances where the subject
is in a frontal pose.

4.2 Comparison dataset


For comparison, we run recognition experiments using the Comparison dataset
described in Section 3.2. This set contains high quality still images as gallery data
and two sets of probe data, one acquired on a high - quality camcorder and the
other on a surveillance camera. This dataset also contains data acquired indoors
and outdoors. This shows how the change in lighting can also affect recognition
performance.

4.2.1 Experiments
For our experiments, we use PittPatts detector and recognition system. Once
we have detected all the faces in the probe and gallery data, we create a single

53

gallery of all the gallery images, with minimal clustering to ensure that each image
is considered a unique subject. Then for each video sequence, we create a gallery
and cluster it so that they all correspond to the same subject. We then run
recognition of each set of videos against the gallery of high-quality still images.
We report results using rank one recognition rate and equal error rate.
Since we cluster the video frames to correspond to one subject, distances are
reported between one sequence and a gallery image. So results are reported per
video sequence, rather than per frame. Our experiments are grouped into four
categories, depending on the sensor used and the acquisition condition in which
the data is acquired.

4.2.2 Results
In this section, we describe the detection and recognition results when we run
recognition experiments described in 4.2.
Detection results: In Table 4.1 we show the results of the face detection
and how many faces were detected in the video sequences. The number of faces
detected in the outdoor video is far fewer than that in the indoor video. We
also notice that the number of faces detected reduces as we move from highquality video to outdoor video. With the high - quality video indoors, detection is
about 50% and falls to less than 5% when we move outdoors, using a surveillance
camera. So we see that the type of camera and the acquisition condition affects
face detection performance.
In Figures 4.2 through 4.5, we show an example frame from each acquisition
and camera. We also show some of the thumbnails created after we run detection
on the surveillance and high-definition video (both indoor and outdoor video). We

54

TABLE 4.1
COMPARISON DATASET RESULTS: DETECTIONS IN VIDEO
USING PITTPATT

Performance
metric

Indoor video

Outdoor video

Highresolution
video

Surveillance
video

Highresolution
video

Surveillance
video

78

78

98

98

Total number
of frames

59,894

27,564

78,268

32,112

Number
of
faces detected

29,688

5,508

14,653

1,335

Percentage of
faces detected

49.57%

19.98%

18.72%

4.16%

Number
subjects

of

55

TABLE 4.2
COMPARISON DATASET RESULTS: COMPARISON OF
RECOGNITION RESULTS ACROSS CAMERAS USING PITTPATT

Performance

Indoor video

Outdoor video

metric

Highresolution
video

Surveillance
video

Highresolution
video

Surveillance
video

Rank
one
recognition
rate

82.05%

43.28%

68.24%

42.86%

Equal
rate

3.45%

14.48%

6.89%

16.82%

error

see that even with the poor quality video, PittPatt is able to successfully pick out
faces in the sequence.
Recognition results: In Table 4.2, we show the rank one recognition rate and
equal error rate. We also show the number of sequences in each category. The
highlighted column (Surveillance video data acquired indoors), is the scenario
which we will focus on for the experiments in the rest of the dissertation.
In Figures 4.6 through 4.9, we show ROC and CMC curves for each of the
experiments. For each graph, we show performance when keeping the acquisition
condition (indoor or outdoor) constant.
PittPatts system can accommodate the recognition and detection well when
using the high-quality data, indoors. However, the recognition rate decreases when
we move from indoors to outdoors (from about 82% to 68%). Also, performance

56

(a) Example video frame

(b) Example detections

Figure 4.2. Detections on surveillance video data acquired indoors

57

(a) Example video frame

(b) Example detections

Figure 4.3. Detections on surveillance video data acquired outdoors

58

(a) Example video frame

(b) Example detections

Figure 4.4. Detections on high-definition video data acquired indoors

59

(a) Example video frame

(b) Example detections

Figure 4.5. Detections on high-definition video data acquired outdoors

60

Figure 4.6. Results: ROC curve comparing performance when using


high-definition and surveillance data (Indoor video)

61

Figure 4.7. Results: ROC curve comparing performance when using


high-definition and surveillance data (Outdoor video)

62

Figure 4.8. Results: CMC curve comparing performance when using


high - definition and surveillance data (Indoor video)

63

Figure 4.9. Results: CMC curve comparing performance when using


high - definition and surveillance data (Outdoor video)

64

drops significantly when we move from the high-definition data to the surveillance
data. We also see that the drop between indoor and outdoor recognition is not as
large when we use the surveillance video. However, recognition is already poor to
begin with (43% indoors), so the decrease in performance is not as large.

4.3 Conclusions
We have shown that even using an oracle dataset, where the images used in the
probe set have the highest matching score to their gallery image, performance is
still poor. Similarly, we have also shown that there is a significant effect on recognition performance when we vary two factors in the acquisition setup. Firstly, the
lighting variations between indoors and outdoors cause a drastic drop in recognition performance. Secondly, the camera used to acquire the data also affects
performance. Using a high - resolution camera of higher quality results in higher
- quality video, which causes recognition performance to improve. However, data
from surveillance cameras is becoming more common, so there is a need to improve
performance even when the quality of the data is poor. This drives the need to
develop strategies to handle poor quality video for face recognition.
In the rest of the dissertation, we focus on using data acquired by a surveillance
camera, indoors, where the illumination is uncontrolled. This scenario reflects the
scenario in most real - world situations where the camera used may be of poor
quality and may be pointed towards a door, where lighting can affect the images
acquired.

65

CHAPTER 5
HANDLING POSE VARIATION IN SURVEILLANCE DATA

In this chapter, we describe the problem of variable pose in the probe data
that can cause problems for the recognition system. Most of the work in face
recognition focuses on frontal video, however surveillance video contains many
non-frontal faces. So there needs to be an emphasis on handling variable pose in
the probe data.
We use synthetic poses to enhance our gallery set. By using score-level fusion
across poses, we can improve recognition over using a single frame. We also
describe two fusion techniques to exploit temporal continuity in the probe data.
The rest of the chapter is organized as follows: Section 5.1 describes the techniques we use to handle variable pose. In Section 5.2 we describe the fusion
techniques used. Then we describe our experiments in Section 5.3 and results and
conclusions in Sections 5.4 and 5.5.

5.1 Pose handling: Enhanced gallery for multiple poses


Since we are dealing with surveillance video, the subject in view is not guaranteed to be in the frontal pose as seen in Figure 5.1. However, the gallery set
is made up of still images of the subjects in a frontal pose. Therefore, we will
be comparing non-frontal poses to frontal poses for recognition. This can cause

66

Figure 5.1. Frames showing the variable pose seen in a video clip (the
black dots mark the detected eye locations)

problems for recognition, because the initial alignment of the probe and gallery
images is a critical step in matching.
One strategy to address this is to add images of the gallery subjects taken in
a non-frontal pose. However, since we do not have such images, we create them
synthetically using 3D morphable models as in [10] and [48]. In this approach, a
PCA space is trained using a set of 3D models of faces from different demographics
(gender, ethnicity and age). Then a 2D image is aligned to the 3D face using a
set of points marked on the face. By adjusting the parameters of the PCA space,
a synthetic 3D face can be generated.
For each gallery image, we create a 3D model using the digital SLR image and
the FaceGen software described in Chapter 3. Once the model has been created,
it can be rotated to get different poses, which more accurately represent the poses
in the probe dataset. For each subject, we create a set of 17 poses using FaceGen.
We used 17 poses to represent the poses present in the probe dataset. Poses at a
67

(a) Original image

(b) Eight images with zero pitch angles

(c) Eight images with non-zero pitch and non-zero yaw angles

Figure 5.2. Synthetic gallery poses

further angle from a frontal position could not be tolerated by Viisage. Hence, the
reason we stopped by at 17 poses. In Figure 5.2, we show the 17 gallery images
for one subject: eight with nonzero yaw angles and eight with nonzero yaw and a
nonzero pitch angles. These images cover a range of about +/-24 . Once we have
these images, we can compare each probe to this enhanced gallery set to find the
pose that best matches the pose in the probe image.

68

5.2 Score-level fusion for improved recognition


In [43], we show that multi-frame recognition from video outperforms singleframe recognition in video data, when using the average of all matching scores.
Therefore, we apply a similar approach in this work and exploit temporal continuity to improve performance. Since we have multiple images per subject in
the gallery (after generating synthetic images in variable poses), we can combine
matching scores across all gallery images to get more reliable results as in [43].
Similarly, we can exploit the temporal continuity between frames of surveillance
video. This property allows us to increase the confidence in the matching scores
based on the adjacent frames, since we know that the identity of the subject will
not change between successive frames.

5.2.1 Description of fusion techniques


We consider two approaches that exploit temporal continuity to improve recognition. As a matter of comparison, we compare performance to treating each frame
separately (where we do not exploit the temporal continuity between frames). We
call this approach a single-frame approach because each frame is treated as an
individual image with no relation to the preceding and succeeding frames. The
first temporal approach accumulates the ranks of the previously seen frames to
weight the current frame scores. The second temporal approach uses a cumulative
score computed from previously seen frames to make a decision about the current
frame. These approaches are described in further detail below:
1. Single frame score: We use this as a baseline approach. For this approach,
there is no fusion between frames of probe data. For each probe frame, we
obtain a set of matching scores between that frame and each of the poses
69

for a particular subject. The matching score for that probe frame to that
particular gallery subject then is the average of all the matching scores of the
poses of that subject. If M (i, Pj (S)) is the matching score between probe
frame i and the j th pose P of subject S, Then the matching score between
probe frame i and subject S, M(i, S) is defined as in Equation 5.1

M (i, Pj (S)) = jJ M (i, Pj (S))

(5.1)

We did compare this fusion technique to using the highest of the matching
scores between a frame and the poses of a given subject. However, using the
average of matching scores was more robust to noise introduced by matching
two different poses than using the highest of the matching scores. We call
this approach Single frame.
2. Rank-based average over time: Each set of frames from one subject is called
a probe set, P. For each set P, the frames are numbered starting at zero.
Each probe set has a rank matrix R where each entry is initialized to zero.
For an image I in a probe set, each match score to a particular gallery
subject is weighted by a rank array based on frames already seen. A higher
rank implies a better match. Similarly, a higher match score implies a closer
match.
For all frames seen before the current frame, we first sort the matching scores
of that frame to each of the gallery subjects. For the xth subject in this order,
we assign it a rank of N x, where N is the number of gallery subjects.
So the ranks now range from 0 to N , where N corresponds to the highest
ranked subject and 0 to the lowest rank subject. For each gallery subject x,

70

Rank
Previous rank Subject 1
matrix
Subject 2
Subject 3
Subject 4
Current
rankings
New rank
matrix

1
A
E
I
M

2
B
F
J
N

3
C
G
K
O

4
D
H
L
P

Current rankings of new frame


Subjects Subject 1 Subject 2 Subject 3 Subject 4
Rankings
1
4
3
2
Rank
Subject
Subject
Subject
Subject

1
2
3
4

1
A+ 1
E
I
M

2
B
F
J
N+1

3
C
G
K+1
O

4
D
H+1
L
P

Figure 5.3. Change in rank matrix for a new incoming image

we increment position R(x, y), where x is the gallery subject position and
y is the rank assigned. Assume, we have 4 subjects in our gallery set and
for the current frame the sorted scores of the frame to the gallery set are as
follows: (1,4,3,2). This means that subject 1 has the highest matching score
to the frame while subject 2 has the lowest score. In Figure 5.3, we show
the values of the rank matrix before and after we update it with these new
values.
The current set of scores for a probe image to a gallery subject is then
weighted by the number of times that gallery subject has had a particular
rank among frames already seen. These values are found using the rank

71

matrix. We also update the rank matrix based on the rank recognition of
the current frame and the set of gallery subjects, to be used for the next
incoming frame. In Algorithm 1, we show the details of the approach. We
call this approach Rank-based fusion.

Algorithm 1 Rank-based fusion approach


1: I = Set of probe images
2: P = Current probe set
3: S = Set of gallery subjects
4: N = Number of gallery subjects
5: R = Rank Matrix, where R(i, j) is the number of probe images of P, where
Rank of Gallery image i is j
6: for i P do
7:
M(i, j) = Matching score of image i with subject j
8:
for j S do
9:
for r = 0 to N do
10:
WM(i, j) = M (i, j) R(j, r) (r N )
11:
end for
12:
end for
13:
for i = 0 to N do
14:
r(i) = Rank of Gallery image i for current frame
15:
Increment R(i, r(i))
16:
end for
17: end for
18: Return WM

3. Score-based average over time: In this approach, we add up the scores of


each probe in a clip and then make a decision based on these scores instead
of each frame alone. Again, we create probe images sets, P , where each
set corresponds to a probe clip. Each probe set has an array of cumulative
scores for each of the N subjects in the gallery set. We use this cumulative

72

score to determine the identity of the current frame. The details of the
algorithm are shown in Algorithm 2. We called this approach Score-based
fusion.

Algorithm 2 Score-based fusion approach


1: I = Set of probe images
2: P = Current probe set
3: S = Set of gallery subjects
4: N = Number of gallery subjects
5: A = Score array, where A(i) is the matching score of the set of frames seen so
far from P, where i S
6: for i P do
7:
M(i, j) = Matching score of image i with subject j
8:
A(j) = A(j) + M(i, j)
9: end for
10: Return A

5.3 Experiments
In order to measure improvement, we run recognition experiments using the
synthetic poses in our gallery and also exploiting temporal continuity. We demonstrate our approach on the NDSP dataset and the IPELA dataset.
We vary two aspects of the experiment to show performance:
1. Number of poses used to represent the subject in gallery: We also vary the
number of poses used to represent each subject. We start with the original
high-quality image (frontal pose) and add a pose to the right and then to
the left with each iteration.

73

2. Approach to combine scores across poses: (a) Average score: Average score
across poses (b) Rank-based fusion: Using ranks across frames to improve
recognition and (c) Score-based fusion: Using scores across frames to improve recognition.

5.4 Results
In this section, we show the results in face recognition performance when we
vary the number of poses used in the gallery set and when we exploit temporal
continuity. First, we discuss the results using the NDSP dataset and then the
results using the IPELA dataset.

5.4.1 NDSP dataset


In Figure 5.4, we show the change in rank one recognition rate as we add a
set of images corresponding to a higher range of poses with regards to the angle
of the face. We start out with using the original image, and with each iteration
add four poses (two each corresponding to +/- 6 additional degrees from a frontal
position).
In Figure 5.5, we show the CMC curves when using the frontal image, images
with up to +/-6 degrees from a frontal position and using images with up to +/18 degrees from a frontal position. The highest recognition rate is achieved when
using 5 poses per subject (+/- degrees from a frontal position) for this dataset,
with a rank one recognition rate of 7.49%.
We see that the highest recognition performance for this dataset is achieved
when we use the original image and one set of images of off-angle poses (abou t +/6 from a frontal position). This shows that while performance can be improved

74

Figure 5.4. Results: Comparing rank one recognition rates when adding
poses of increasing degrees of off-angle poses

75

Figure 5.5. Results: Comparing rank one recognition rates when using
frontal, +/-6 degree and +/-24 degree poses

76

by adding synthetic poses, the benefit from using such poses de creases as we
move further and further away from a frontal position. Furthermore, the highest
performance achieved is about 7.5%. However, as shown in Se ction 4.1, this is
a challenging dataset, with images acquired in uncontrolled lighting. Therefore,
recognition performance is expected to be poor. As we will discuss later, there are
other factors besides variable pose that affect recognition, since this is a dataset
acquired in uncontrolled conditions.
Exploiting temporal continuity: In Figure 5.6, we show the rank one
recognition rate when using between 1 and 17 gallery images per subject, with
all gallery images generated with morphable models. We also show performance
when using each of the three fusion techniques.
In Table 5.1, we compare the best performance achieved using each of the three
fusion techniques when using the baseline set and the test set for our experiments.
We see that we can improve performance by enhancing the gallery set by using
artificial poses generated using a 3D morphable model. This improves performance
over using a single high-quality image. Therefore, even though the morphed images are not of the same quality as the original image (as seen in Figure 5.2, they
help improve performance. Another aspect that we exploit is the temporal continuity between frames in a video sequence. By weighting the scores based on the
scores of frames seen already, we improve recognition performance over treating
each frame as a single image. This is an advantage of using a video sequence
rather than a set of still images.
Even though we cannot pick out the best frame (baseline), without using
prior knowledge of the dataset, we improve performance over the baseline approach. So by incorporating temporal continuity, we can outperform the baseline

77

Figure 5.6. Results: Comparing rank one recognition rate when using
fusion techniques to improve recognition

78

TABLE 5.1
RESULTS: COMPARISON OF RECOGNITION PERFORMANCE
USING FUSION

Dataset

Number Number Fusion


of sub- of
technique
jects
probe
frames

Rank one
recognition
rate

Number of
poses

Baseline

57

Average

15.79%

17

Average

2.62%

Rankbased

16.49%

Scorebased

21.06%

Test
dataset

57

57

3233

79

performance.
Reasons for poor recognition: Overall, performance is poor (best recognition rate is about 21%). However, based on baseline performance, we see that this
is a very challenging dataset. One of the main reasons for poor performance is
illumination. We can see this in Figure 5.7 (a), where the subjects face is washed
out due to the glare of the sun. Therefore, even though the eyes are picked out
correctly, there is insufficient variation on the face for good recognition. Another
reason is the actual location of the eyes as seen in (b) of Figure 5.7. Even though
the locations are on the face of the subject, they are not precise and hence the
localization of the image is poor. A third reason for this is seen in (c) of Figure 5.7,
where the subjects face is angled down so much that it is hard for the recognition
system to match it to the right gallery image. Finally, the inter-ocular distance
is about 40 pixels, whereas Viisage requires about 60 pixels between the eyes for
robust detection and recognition. So the conditions are not ideal for recognition
using Viisage. Therefore, even when we do select the eyes correctly, performance
is poor.

5.4.2 IPELA dataset


In Table 5.2, we show results when using the IPELA dataset. We show performance when using the single high-quality still image in the gallery set and
when the gallery set is augmented with 14 poses (corresponding to 14 poses that
the detector can handle). We also show performance when exploiting temporal
continuity.
We show the corresponding ROC and CMC curves for each of these results in
Appendix B.

80

(a) Poor illumination

(b) Inaccurate eye location

(c) Pose not ideal for face recognition

Figure 5.7. Examples of poorly performing images

81

TABLE 5.2
RESULTS: COMPARISON OF RECOGNITION PERFORMANCE
USING FUSION ON THE IPELA DATASET

Fusion
technique

Number
poses
gallery

Single
frame

Rankbased
fusion

Scorebased
fusion

of
in

Rank one
recognition
rate

Equal error
rate

27.99%

26.10%

14

33.88%

26.63%

30.49%

29.47%

14

35.91%

29.72%

31.11%

20.91%

14

41.74%

22.82%

82

Similar to the NDSP dataset, we show that recognition performance increases


as we add more poses to the gallery set, when we look at rank one recognition
rate. However, the equal error rate is slightly worse in moving from 1 pose to
14 poses. However, the benefit gained when looking at the rank one recognition
rate is significant, while in general, the increase in equal error rate is not significant. Furthermore, when we exploit temporal continuity, we can improve rank
one recognition rate to about 42% and reduce the equal error rate to about 23%
when using score-based fusion, but using rank-based fusion degrades the equal
error rate.

5.5 Conclusions
We have shown that adding synthetic poses can be beneficial for face recognition. Even when they do not retain all the details of the face (such as skin
texture), they can be used to augment the gallery set. Whereas it would be more
beneficial to have actual images of the subjects in a range of poses, even in instances where those images are not available, we can create synthetic images to
accomplish the same purpose. We have also shown that while adding images help
recognition performance to a point, they can also hinder performance as the poses
more further and further away from a frontal pose. So ideally, using between 2
and 5 images (within a small range of variation from a frontal pose) is useful for
recognition.
We have also shown that temporal continuity can be exploited in multiple ways,
using either rank-based fusion or score-based fusion. Again, some approaches are
more useful than others (in our experiments, we showed that score-based fusion
performs the best overall). Score-based fusion is shown to outperform rank-based

83

fusion in other works such as [1] and [14]. When using video data as input, the
temporal continuity between frames should be exploited to improve the reliability
of the matching scores for improved face recognition.

84

CHAPTER 6
HANDLING VARIABLE ILLUMINATION IN SURVEILLANCE DATA

Face recognition is highly sensitive to variations in face illumination. The


gallery data is often acquired in well-lit conditions. When the probe data is
captured on a surveillance camera, it is often acquired in uncontrolled lighting.
The Foto-Fahndung report [9] discusses experiments using a surveillance system in a German railway station to evaluate performance. They conclude that it is
possible to recognize people provided the external conditions are right, especially
lighting. However, while correct rank one recognition performance is around 60%
during the day, it falls to 10-20% at night.
In this chapter, we illustrate the effect of uncontrolled lighting, when using
surveillance-type data. Our dataset is captured in a hallway where one half of
the face is illuminated more strongly than the other half. We present a method
for reflecting images across the vertical facial midline to mitigate the change in
illumination between the two halves of the face. Finally, we run recognition experiments using these preprocessed images to improve recognition performance over
that obtained from the unevenly lit faces.
The rest of the chapter is organized as follows: Section 6.1 describes the setup,
which explains why the lighting is uneven. We describe the illumination problem
and our technique for preprocessing the images in Section 6.2. Section 6.3 describes the quotient and self-quotient images that we compare our approach to.
85

In Section 6.4 we describe our dataset used in our experiments. Finally, Sections
6.5 and 6.6 describe our results and our conclusions.

6.1 Acquisition setup


In Figure 6.1, we show the position of the surveillance camera used to capture
the probe data. This shows where the light streams in, which illustrates the
lighting problem encountered when capturing video in such a setting. We also
show an example frame at a higher resolution, where it is easier to see that the
left half of the face (right half of image) is washed out because of the glare of the
sunlight on the face. This illumination does not resemble the illumination in the
gallery images. While intensity normalization ensures the range of values of the
pixels are the same for all images (0 to 255), the change in the intensity values
differs between probe and gallery images. In Figure 6.2, we see this difference
between the gallery image and a few probe images. We also show an inverted
view of these images to make it easier to see the glare on the right side of the
image. Using these images as input can affect the recognition results, since the
intensity variation on the faces in the probe images does not match the change we
see in the gallery images.

6.2 Reflecting images to handle uneven illumination


In order to handle the uneven illumination between the left and right halves
of the face, we reflect all the images (both probe and gallery) across the vertical
midline of the face. We can reflect images in two directions, to the left or to the
right.
For the Reflected left images, we retain the left half of the image as is. We

86

(a) Acquisition setup

(b) Example frame

Figure 6.1. Setup to acquire probe data and resulting illumination


variation on the face

then replace every pixel on the right half of the image by a pixel from the left
half, so that the right half is a mirror image of the left half of the image. Let W
be the width of the image. If p(i, j) is the intensity value of the pixel in row i
and column j, and r(i, j) is the corresponding pixel in the Reflected left image,
then r(i, j) is determined as shown in Equation 6.1.

r(i, j)

= p(i, j)if j W/2


= p(i, W j)otherwise

87

(6.1)

(a) Gallery image

(b) Gallery image: inverted

(c) Example probe images

(d) Example probe images: inverted

Figure 6.2. Comparison of gallery and probe images

88

Figure 6.3: Reflection algorithm

For the Reflected right image, the equations used are in Equation 6.2.

r(i, j)

= p(i, j)if j W/2


= p(i, W j)otherwise

(6.2)

In Figure 6.3, we show a graphical representation of the reflection process.


Each column is either copied or retained as is, depending on whether we are
running a Reflect left or Reflect right operation and whether or not it is
located on the left half or the right half of the image.
We now have three datasets that are used in our experiments. Each of them
is described in detail below:
1. Original dataset: This is our baseline probe set. For training and testing,
we use the original images in this dataset. The images are cropped and
aligned based on eye locations. The face images are cropped to the same
89

size (130 150 pixels) and then normalized based on gray value intensity to
have a dynamic range of 0 to 255.
2. Reflect left dataset: For the left reflected images, the cropped and aligned
images are reflected left over right. We retain the left side of the image as is
and reflect those pixels across the vertical mid-axis so that the right half of
the image is a mirror image of the left half, as defined by Equation 6.1. The
reflected left images are used both for training and recognition for this set.
3. Reflect right dataset: In this set, the cropped and aligned images are
reflected right over left. Similar to the Reflect left dataset, we reflect the
right half of the image over the left and retain the right half of the image as
is, using Equation 6.2. In the Reflect right datasets, the images reflected
right over left are used to train the face space and used for recognition.
In Figure 6.4, we show the resulting images. The first column is the original
image. The middle column is the corresponding image from the Reflect left
dataset and the third column contains the corresponding image from the Reflect
right dataset. We also inverted the images so that it is easier to see the lighting
changes. The gallery images are illuminated with a frontal light source, so the
left and right halves of the face are illuminated equally. However, in the original
probe images, the left side of the face is illuminated differently from the right
side. This illumination is evened out when the images are reflected left over right
or when the images are reflected right over left. However, when we compare the
third column of images in Figure 6.4 to the image in Row 1 of Figure 6.4, the
light and dark parts of the image are inverted, since the light source is not frontal
but rather coming in from the side of the face. This can cause problems for a
recognition system. We also notice a bright line down the midline (corresponds
90

to dark region in original image). This is because of the shadow formed down the
middle of the face due to the direction of the lighting.
We further illustrate this effect by looking at the average illumination of each
column in the images. We take the average of each column for each image and
then take the average over all the images. We separate out the probe and the
gallery images. We display these values in a graph in Figure 6.5, and see that the
gallery images are more evenly lit from left to right (though the two halves are not
identical). The average intensity of each column of the gallery image differs from
the probe images. When we reflect the images left over right, the average intensity
per column of the probe images more closely resemble the gallery images, because
the light sources appears to be coming from the same direction, with respect to the
left half of the image. However, when the images are reflected right over left, the
average intensities of the probe and gallery images are also vastly different. This is
because the gallery images are illuminated from the front, while the probe images
are illuminated from the right side, so the light source appears to be coming from
different directions.

6.2.1 Averaging images


Another preprocessing technique is to take the average of the pixels from the
left and and right halves of the images. For each pixel in the image, we average
the corresponding pixels from the left and right of the image as shown in Equation
6.3.

91

(a) Gallery images

(b) Example 1

(c) Example 2

(d) Example 3

(e) Example 4

(f) Example 5

(g) Example 6

(h) Example 7

(i) Example 8

Figure 6.4. Example images: original image, reflected left and reflected
right
92

(a) Original images

(b) Images reflected left over right

(c) Images reflected right over left

Figure 6.5. Average intensity of each column


93

Figure 6.6: Reflection algorithm

r(i, j) =

p(i,j)+p(i,W/2j)
2

r(i, W/2 j) =

p(i,j)+p(i,W/2j)
2

(6.3)

We show a figure of this preprocessing in Figure 6.6. In these images, the left
half of the image is a mirror image of the right half, reflected across the vertical
midline of the image.
In Figure 6.7, we show the resulting images when we create new images by
taking the average of the left and right pixels.

6.3 Comparison approaches


In this section, we describe two techniques, namely quotient images by Gross
and Brajovic [15] and self-quotient images by Wang et al. [46].
Gross and Brajovic [15] use an illuminance-reflectance model to generate images that are robust to illumination changes as described in Chapter 2. Since
the images are preprocessed based on the intensity, there is no training required.

94

(a) Gallery images

(b) Example 1

(c) Example 2

(d) Example 3

(e) Example 4

(f) Example 5

(g) Example 6

(h) Example 7

(i) Example 8

Figure 6.7. Example images: original image and averaged image


95

Wang et al. [46] used self-quotient images to handle the illumination variation
for face recognition. The Lambertian model of an image can be separated into
two parts, the intrinsic and extrinsic part. If we can estimate the extrinsic factor
based on the lighting, it can be factored out of the image to retain the intrinsic part. This can then be retained for face recognition. The image is found by
using a smoothing kernel and dividing the image pixels by this filter. Let F be
the smoothing filter and I the original image, then the self-quotient image, Q is
defined as

I
.
F [I]

We create both quotient images and self-quotient images and then

use them as input for PCA to create a facespace based on these images and then
recognize the probe images. In Figure 6.8, we show the images shown in 6.4 after
they are processed to retrieve the quotient image and the self-quotient image.

6.4 Experiments
In this section, we describe the dataset used for our experiments and also the
experiments we conduct.

6.4.1 Test dataset


We use the IPELA dataset for this experiment. We want to ensure that the
problem is an illumination problem. For our gallery, we use one image per subject
in our gallery set, giving us a set of 104 images. For our probe set, we visually
inspect the faces in the video and choose the frames where the subject is seen in
a frontal view. This dataset contains 802 probe images. The probe and gallery
datasets are separated in time from about a week to about six months.

96

(a) Gallery images

(b) Example 1

(c) Example 2

(d) Example 3

(e) Example 4

(f) Example 5

(g) Example 6

(h) Example 7

(i) Example 8

Figure 6.8. Example images: original image and quotient image


97

6.4.2 Face detection


We use Neurotechnologijas [27] face detection software to detect faces in both
the probe and gallery data. The software returns the eye centers and the mouth
locations. All the locations are inspected to ensure that they are correctly marked.
These eye locations are then used to crop the faces to be used in the recognition
experiments.

6.4.3 Experiments
In order to test our approach when using the surveillance data, we use four-fold
cross validation. We split up the probe dataset into four subject-disjoint subsets,
where each set contains 26 subjects.
We use CSUs PCA code [13] for our matching experiments. We train the
space on three subsets of probe-type data and all the gallery images and use
the remaining dataset for testing. Three eigenvectors corresponding to the three
largest eigenvalues are dropped. We use covariance distance as a matching score
between two images. Finally, since we may not know which side of the face is
robust to illumination, we also use the average of the Reflect left and Reflect
right scores for recognition. This gives us three recognition scores:
1. Reflect left results: Performance when using the images reflected left over
right.
2. Reflect right results: Performance when using the images reflected right
over left.
3. Average score results: Performance when using score-level fusion to combine
scores for Reflect left and Reflect right scores for each probe frame.

98

We then use the average of the four scores (one for each testing set) as our
measure of performance of the different approaches. We also tried using the minimum of matching scores, however we achieve higher performance when we use the
average of matching scores. We report rank one recognition rate and equal error
rate.

6.5 Results
In Table 6.1, we show the rank one recognition rate and equal error rate of
recognition when using each of the three sets of images, as described in Section
6.2. We also compare our performance to using the approaches in [15] and [46].
Finally, we compare recognition performance when treating each frame separately
and when we exploit temporal continuity between the frames.
In the single-frame approach, we see that the highest rank one recognition rate
(49.88%) is achieved when using the images reflected left over right. We also see
that the lowest rank one recognition rate (22.48% recognition rate) is achieved
when using the images reflected right over left. When we look at equal error rate,
using score fusion to combine the Reflect left and Reflect right scores has the
best rate at 18.80%. We also see that using image-level averaging, performance is
similar to using the original images. Thus, score-level fusion performs better than
using the image-averaging approach.
The poor performance of the Reflect right images can be explained by the
fact that the illumination on the face is not similar to the illumination seen on
the gallery images. This shows that uneven illumination on the face does affect
recognition performance. But if we can preprocess the probe images to have
a similar illumination as those in the gallery set, we can improve recognition.
99

TABLE 6.1
RESULTS: COMPARING RESULTS FOR DIFFERENT
ILLUMINATION TECHNIQUES

Single frame
Approach

Rank
one
recognition
rate

Equal
rate

Original
images

38.62%

Quotient images

Exploiting temporal continuity


Rank
one
recognition
rate

Equal
rate

19.45%

39.67%

15.66%

32.51%

27.11%

40.82%

18.32%

Self-quotient
images

21.90%

32.75%

27.27%

24.61%

Reflect left
images

49.88%

18.27%

54.03%

14.13%

Reflect
right images

22.48%

30.65%

21.65%

28.60%

Average
images

38.63%

22.52%

41.74%

16.38%

Average
of
left and right
scores

44.24%

18.80%

49.91%

13.83%

100

error

error

Furthermore, we can use score-level fusion of left and right reflected images to
improve recognition over using the original images. However, it may not always
give the highest performance overall, but will show improvement over using the
original images.
Using the quotient images performs better than self-quotient images, but it
still does not improve over using the original images. These techniques focus on
single light sources and not on instances of uncontrolled lighting as seen in this
video. Thus they fail to correctly model the lighting effect on these images.
Furthermore, we improve recognition when we exploit temporal continuity. We
improve rank one recognition rates to about 54% when using the Reflected left
images and to about 50% when we use the average of the left and right scores.
Also, the equal error rate is improved to 13.83% when using the average of left
and right images. The improvement when using quotient images and moving from
single frame to score-level fusion across frames is large. When exploiting temporal
continuity, using quotient images shows slight rank one recognition improvement
over using the original images.

6.6 Conclusions
We have shown both visually and through experimentation, that the change
in illumination results in poor recognition. By creating reflected images, where
the illumination of the probe images resembles that of the gallery images, we
can improve recognition performance. In cases where the direction of the light
source is unknown, we can reflect images both ways and use score-level fusion to
improve recognition over using the original images. We have also shown that in
such a dataset techniques such as the self-quotient and quotient image may hurt

101

performance, as they do not generalize to uncontrolled lighting.


We show that illumination plays a significant role in face recognition performance. This is an important reason for poor recognition performance when using
surveillance data. More strategies need to be developed to handle such video,
where the light sources are uncontrolled.

102

CHAPTER 7
OTHER EXPERIMENTS

In this chapter, we discuss some other issues that we explored for this dissertation. First, we describe a technique used to evaluate the detections found by the
face detection using background subtraction and gestalt clustering. Then we include a discussion on how the distance metric used and the number of eigenvectors
dropped can affect recognition performance.

7.1 Face detection evaluation


The first step in a face recognition system is to run all the frames through
the face detector. This gives us the x and y coordinates of the features such as
centers of the eyes and the mouth used in this work. However, some of these
detections are flawed because either the incoming frame may not contain a face
or the detector may miss the face completely. In Figure 7.1, we show four such
detections, including three cases of false detections: (1) there is no subject in view;
(2) there is a subject in view but the detector misses it; (3) the detector finds the
body of the subject.
Riopka and Boult [38] show the importance of accurate eye locations in face
recognition. They show the change in performance when the locations of the
eyes (as found by FaceIt) are perturbed in 17 different ways. They also look at

103

(a) Area on the wall - no subject

(b) Area on the wall - with subject

(c) Subjects body

(d) Correct eye locations

Figure 7.1. Example eye detections

104

weathered images, i.e. images that are taken during various conditions like
precipitation, snow and fog and show that poor eye localization clearly degrades
face recognition performance. This is a significant source of error in surveillance
video face detection and recognition. Hence, we further reduce the images used
as probes by evaluating the accuracy of the face detections. We describe a twolevel approach to prune the probe image set to those images in which the face is
correctly picked out. The two approaches are described in detail below.

7.1.1 Background subtraction


Once we have the eye and mouth centers using Viisages face detector, the
next step is to choose accurate face detections using background subtraction.
Since we are using a stationary camera to capture the data, we can easily acquire
the background of the image. Therefore, using the difference between a known
background frame and the current frame, we can determine whether or not there
is a subject in view and if so, where they are located in the image. Using this
information, we can eliminate those detections that are found in the background.
First, we subtract the probe frames from a background frame and create a
difference image. Then we threshold the difference image, such that any pixel
whose intensity value is below 30 is set to zero, otherwise it is set to 1. The
difference of 30 is picked based on empirical data. We then erode the image twice
and then dilate the resulting image twice to get rid of stray shadows, using the
structuring element shown in Figure 7.2. We call this image Di , where i is the
frame number within a sequence.
Viisages face detector generates a feature vector of coordinates for the eyes
and mouth centers of the face found in the image. Let F (I) be the feature vector

105

Figure 7.2. Structuring element used for erosion and dilation

of coordinates of the I th frame in a video clip. If Lx , Ly , Rx and Ry are the x


and y coordinates of the left and right eye locations found by the face detector,
respectively, and Mx and My the coordinates of the mouth center, then, Fi is
defined in Equation 7.1.

F (I) = (Lx , Ly , Rx , Ry , Mx , My )

(7.1)

We find a difference image Di using the corresponding frame and a background


frame and threshold the image as described above. Using Di and F (I), we determine if all three locations (left eye, right eye and mouth center) are located on
pixels marked as 1. If so, this detection is located on the body of the subject and
not on the background of the image.
The details of this step are laid out in Algorithm 3. At this stage, (a) and (b)
in Figure 7.1 are removed from the dataset. From the original dataset of about
25,000 frames of probe video, we reduce the dataset down to 4373 images.

106

Algorithm 3 Merging background subtraction and Viisages face detector


1: I = current frame
2: O = background frame
3: F(I) = (Lx , Ly , Rx , Ry , Mx , My )
4: D = Array of current image after background subtraction, erosion and dilation
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:

if O(x, y) I(x, y) < 30 then


D(x,y) = 0, if same as background image
else
D(x,y) = 1, if different from background image
end if
if D(Lx , Ly ) = 0 then
return Not on subject
end if
if D(Rx , Ry ) = 0 then
return Not on subject
end if
if D(Mx , My ) = 0 then
return Not on subject
end if
return Is On subject

107

7.1.2 Approach to pick good frames: Gestalt clusters


Zahn [49] describes an approach to find Gestalt clusters using graph theoretic
techniques. The goal is to find a clustering that does not depend on random
choices in the detection algorithm and is not sensitive to the order in which points
of S are scrutinized (here, S is the set of points to be clustered [49]). Since
the goal is to find a clustering that favors nearest neighbors, he uses minimal
spanning trees to form clusters. A spanning tree of a graph G is a tree composed
of all vertices of G. A minimal spanning tree of graph G is a spanning tree
whose weight is minimum among all spanning trees of G. [49] Once the minimal
spanning tree is created, subtrees are pruned if they are connected to the main
subtree by an edge with a weight larger than a threshold value T . Each resulting
subtree corresponds to a cluster.
The smaller the distance between two nodes, the more likely they are to belong
to the same cluster. Thus, the likelihood that the two points are clustered together
does not depend on their individual distance to a third point (e.g. the centroid of
a cluster), but rather their distance from each other.
Our adaptation of the approach: In Figure 7.3, we show the points corresponding to the manually verified left eye locations and the left eye points found
by the face detector. All the eye locations are marked on one single frame to show
their location in the scene relative to the eye locations in the other frames. We see
that the correct eye locations form a curved line rather than a circular cluster of
points. Hence, a technique such as k-means clustering would fail because it looks
for circular clusters based on a points distance from a centroid. The incorrect
locations tend to form clusters separated from this curve (outlined with an oval).
Hence, we should be able to pick out the continuous curved line as the correct

108

Figure 7.3. Example subject: Ground truth and Viisage locations

eye locations, separate from the incorrect eye locations (enclosed in the oval). A
technique such as a minimal spanning tree (MST) as described above, will choose
the continuous line since the inter-node distance between a node and its nearest
neighbor decides whether or not two points are in the same cluster.
In our approach, we use the coordinates of the detected eye locations and
sequence number of the respective frame within a subject clip as the features of
our dataset. If Lx , Ly , Rx and Ry are the x and y coordinates of the left and
right eye locations, respectively and t is the time stamp for that image within the

109

video clip of one subject, then the feature vector V for a frame I is described in
Equation 7.2.

V (I) = (Lx , Ly , Rx , Ry , t)

(7.2)

We treat each subject probe set separately. A subject probe set is one clip that
contains the same subject in all the frames. We create a graph from each probe set.
Each frames feature vector corresponds to one node in the graph. The distance
between every pair of nodes (eye locations and time stamp) is the Euclidean
distance between them in five-dimensional space, where the five dimensions are
defined in Equation 7.2.
We use an implementation [39] of Prims algorithm [20] to find the MST in
the graph. The running time of Prims Algorithm is O(kV k2 + E), where kV k is
the number of nodes in the graph, and kEk is the number of edges. We modify
the code slightly to work with our application.
We run this algorithm on each set of the 57 subject probe sets. We set the
threshold of 40 pixels. This determines whether two nodes are in the same cluster
or not. Our initial node is the rightmost point picked out to be the eyes from a
set of frames for one subject. At this point in the video clip, the subject closest to
the frontal position, hence the likelihood of the face detector picking out the face
correctly is high. After this step, we reduce the set of 4373 probe images to a set
of 3233 images. This corresponds to removing images such as (c) in Figure 7.1.

110

7.1.3 Results: Comparing performance on entire dataset and datasets pruned


using background subtraction and gestalt clustering
In Table 7.1, we show the change in dataset size as we reduce the size of
each dataset using background subtraction and gestalt clustering. We also show
recognition results when we use the high-quality still images in our gallery set.
This shows that background subtraction and gestalt clustering does help prune
the dataset to those images in which the face was correctly detected. We do not
need to use the entire dataset in order to achieve optimal performance and false
detections can actually hurt performance overall. Furthermore, the time taken
to probe the gallery decreases since we are using a smaller set of images. While
the improvement does not seem significant, it becomes more evident when we use
temporal continuity to determine the true matches. This is discussed further in
the next section. In addition, we are unable to approach the peak of the baseline
performance. This shows that this is a difficult dataset, since we are unable to
approach baseline performance when we do not use pre-determined information of
the dataset. The reason for this is that even though the face is correctly picked out
in these frames, there are other factors that contribute to the poor performance,
such as the face pose and the illumination in the frames. However, we show that
we can improve performance when we incorporate temporal continuity.
Similarly, in Figure 7.5, we show recognition performance when using background subtraction to eliminate bad detections found on the background. Finally,
in Figure 7.6, we show performance when eliminating bad detections using background subtraction and Gestalt clustering.

111

TABLE 7.1
COMPARING DATASET SIZE OF ALL IMAGES TO
BACKGROUND SUBTRACTION APPROACH AND GESTALT
CLUSTERING APPROACH

Dataset

Size of dataset

Fusion
nique

Baseline

57

Original
dataset

Background
subtraction

Gestalt clustering

25962

4373

3233

Rank
one
recognition
rate

Number
poses

Average

15.79%

17

Average

2.62%

Rank-based

3.07%

Score-based

2.3%

Average

6.63%

Rank-based

10.63%

Score-based

12.99%

Average

7.70%

Score-based

16.49%

Score-based

21.06%

112

tech-

of

Figure 7.4. Results: Rank one recognition rates when using the entire
dataset

113

Figure 7.5. Results: Rank one recognition rates when using the dataset
after background subtraction

114

Figure 7.6. Results: Rank one recognition rates when using the dataset
after background subtraction and gestalt clustering

115

7.2 Distance metrics and number of eigenvectors dropped


In our experiments, we use Covariance distance metric and also drop the eigenvectors corresponding to the three largest eigenvalues. However, there are many
measures that can be used to measure distance and the number of eigenvectors
dropped can be varied.
Some of the metrics that can be used to measure distance are Euclidean distance, covariance or correlation and MahCosine. Beveridge et al. [7] examined
the effect of different distance measures on performance and compared the best
variant of PCA with the best variant of LDA to see whether there was a statistically significant difference in recognition rates. They used the FERET dataset
for their experiments. They reported that when using PCA for recognition, the
Mahalanobis distance performed better than other measures.
We noticed that in our experiments, Euclidean distance performs the worst.
Covariance and correlation distance perform about the same. In Section 7.2.1, we
show performance when using two of the measures: MahCosine and Covariance
[13]. We experiment with two different distance metrics, namely Mahalanobis
cosine (MahCosine) distance [13] and covariance. MahCosine is the cosine of
the angle between two feature vectors in the space. Covariance is dot product of
the vectors that have been normalized to be of unit length. For evaluation, we
examine the rank one recognition rate and equal error rate.
Another feature that can be varied is the number of eigenvectors corresponding
to the highest values dropped. Since there is illumination variation in the data,
the first few vectors may correspond to illumination variation. In [11], Chawla and
Bowyer show that dropping as many as eight vectors from the front can improve
recognition accuracy. We drop 0, 1, 3 and 5 vectors corresponding to the highest

116

eigenvalues, which is a good range to show how the number of vectors dropped
affects performance. After dropping more than five vectors, performance begins
to decrease.

7.2.1 Experiments
We conduct a variety of experiments. We use four-fold cross validation to test
each combination. We report results when using the MahCosine and covariance
metrics for distance and also when we vary the number of eigenvectors dropped
from the front and the distance metrics used. We also show results when we
exploit temporal continuity using this dataset.

7.2.2 Results
In Table 7.2, we show performance when we vary the distance metric used and
the number of eigenvectors dropped.
We see that in this dataset using the covariance distance metric performs
better than using the MahCosine distance. We also see that dropping eigenvectors
corresponding to the highest eigenvalues, performance increases. Hence, in our
experiments we use Covariance distance and drop three eigenvectors when training
our recognition model.

117

TABLE 7.2
RESULTS: PERFORMANCE WHEN VARYING DISTANCE
METRICS AND NUMBER OF EIGENVECTORS DROPPED

Distance

Number

Single

Exploiting

metric

eigenvectors

frame

temporal continuity

dropped

Rank one
recognition
rate

Equal error rate

Rank one
recognition
rate

Equal error rate

0 vectors

16.98

34.95

28.29

25.88

MahCosine 1 vector

17.12

34.95

28.17

25.77

distance

3 vectors

16.98

35.3

27.15

26.44

5 vectors

16.46

35.31

27.08

26.63

0 vectors

17.41

30.18

21.24

26.36

Covariance 1 vector

26.6

26.7

33.05

21.83

Distance

3 vectors

27.99

26.1

31.11

20.91

5 vectors

28.07

25.49

33.84

19.34

118

CHAPTER 8
CONCLUSIONS

We have shown in this dissertation that even though face recognition from
surveillance quality data is poor, there are strategies we can use to improve recognition. We have also explored different strategies to improve recognition performance. We list our contributions below:
1. Comparison: We compare recognition performance when using two different
sensors, a high-definition camcorder and a surveillance camera. We also
compare data acquired in two different environments. We show that using
surveillance data for recognition causes recognition performance to drop.
Also, when we move from indoors to outdoors, recognition performance is
hurt.
2. Handling pose variation in surveillance data: We use synthetic poses to
enhance our gallery set and then use score-level fusion to combine scores
across poses to improve recognition over a single-frame approach. While it
seems like in this case, the multiple poses account for little improvement in
recognition, we have to consider the other problems in the datasets such as
illumination changes and low resolution. We also exploit temporal continuity
of the video data to further improve recognition.

119

3. Handling illumination variable in surveillance data: We create reflected images and use score-level fusion to make face recognition performance more robust to varying illumination. We also compare our approach to self-quotient
images and quotient images and show that we can outperform these approaches for this dataset.
4. Face detection evaluation: We devise a strategy using background subtraction and gestalt clustering to prune a set of face detections to the set of true
detections for improved recognition.
There is still a huge room for improvement in the area of face recognition
from surveillance-quality video. We need more systems that are robust to lighting
variation which has a huge effect on face recognition. There also needs to be more
exploration in the dealing with other areas of real-world face recognition such as
techniques to deal with obstructions and with low resolution on the face.

120

APPENDIX A
GLOSSARY

Definitions of terms:

Cumulative Match Characteristic curve: Graphs that show the correct recognition rates versus rank
False Accept Rate: The ratio of the false positives (see below) to the total
number of positive examples
False Negative (fn): Number of outcomes predicted incorrectly to be negative over all positive outcomes in the set
False Positive (fp): Number of outcomes predicted incorrectly to be positive
over all negative outcomes in the set
False Reject Rate: The ratio of the false negatives to the total number of
negative examples
Identification: One-to-many comparative process of a biometric sample or a
code derived from it against all of the known biometric reference templates
on file
Receiver Operating Characteristic curve: A graph that show the rate of
change in the verification rate (true positive rate) as the false accept rate
increases
Verification: The process of comparing a submitted biometric sample against
a single biometric reference of a single enrollee whose identity or role is being
claimed
Signature: An image that represents a subject

121

True Negative (tn): Number of outcomes predicted correctly to be negative


over all negative outcomes in the set
True Positive (tp): Number of outcomes predicted correctly to be positive
over all positive outcomes in the set
Verification Rate: The ratio of the true positives to the total number of
positive examples
Equal error rate: The rate at which the False Accept Rate equals the False
Reject Rate
Rank one recognition rate: The percentage of probe images that have the
lowest matching score to their corresponding gallery image.

122

APPENDIX B
POSE RESULTS

123

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.1. CMC curves: Comparing fusion techniques approaches using


a single frame

124

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.2. ROC curves: Comparing fusion techniques using a single


frame

125

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.3. CMC curves: Comparing approaches exploiting temporal


continuity, using rank-based fusion

126

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.4. ROC curves: Comparing fusion techniques exploiting


temporal continuity, using rank-based fusion

127

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.5. CMC curves: Comparing fusion techniques exploiting


temporal continuity, using score-based fusion

128

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure B.6. ROC curves: Comparing fusion techniques approaches


exploiting temporal continuity, using score-based fusion

129

APPENDIX C
ILLUMINATION RESULTS

130

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure C.1. CMC curves: Comparing illumination approaches using a


single frame

131

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure C.2. ROC curves: Comparing illumination approaches using a


single frame

132

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure C.3. CMC curves: Comparing illumination approaches exploiting


temporal continuity

133

(a) Set 1

(b) Set 2

(c) Set 3

(d) Set 4

Figure C.4. ROC curves: Comparing illumination approaches exploiting


temporal continuity

134

BIBLIOGRAPHY

1. A. Abaza and A. Ross. Quality based rank-level fusion in multibiometric


systems. In BTAS09: Proceedings of the 3rd IEEE international conference
on Biometrics: Theory, applications and systems, pages 459464. IEEE Press,
2009.
2. Y. Adini, Y. Moses, and S. Ullman. Face recognition: The problem of compensating for changes in illumination direction. In IEEE Transactions on Pattern
Analysis and Machine Intelligence, volume XIX:7, pages 721732, 1997.
3. O. Arandjelovic and R. Cipolla. Face recognition from face motion manifolds
using robust kernel resistor-average distance. In Computer Vision and Pattern
Recognition, volume V, page 88, 2004.
4. O. Arandjelovic and R. Cipolla. Face Recognition, chapter Achieving illumination invariance using image filters. Tech Education and Publishing, 2007.
ISBN 978-3-902613-03-5.
5. O. Arandjelovicc and R. Cipolla. An illumination invariant face recognition
system for access control using video. In Proceedings on British Machine
Vision Conference, pages 537546, 2004.
6. P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs. Fisherfaces:
Recognition using class specific linear projection. In IEEE Transactions on
Pattern Analysis and Machine Intelligence, volume VII:19, pages 711720,
1997.
7. R. Beveridge, K. She, and B. Draper. A nonparametric statistical comparison
of Principal Component and Linear Discriminant subspaces for face recognition. In Computer Vision and Pattern Recognition, pages I:535542, 2001.
8. D. J. Beymer. Face recognition under varying pose. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA, 1993.
9. BKA. Final Report Foto-Fahndung, 2006. URL http://www.bka.de.

135

10. V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable


model. In IEEE Transactions on Pattern Analysis and Machine Intelligence,
volume XXV:9, pages 10631074, 2003.
11. N. V. Chawla and K. W. Bowyer. Random subspaces and subsampling for
2-d face recognition. In CVPR 05: Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR05)
- Volume 2, pages 582589, Washington, DC, USA, 2005. IEEE Computer
Society.
12. M. P. Department. MPD surveillance cameras. http://www.notbored.org/dcpolice.html, 2002.
13. C. S. U. Department of Computer Science. Evaluation of face recognition
algorithm. http://www.cs.colostate.edu/evalfacerec, 2003.
14. D. Frank Hsu and I. Taksa. Information Retrieval, 8(3):449480, 2005.
15. R. Gross and V. Brajovic. An image preprocessing algorithm for illumination
invariant face recognition. In 4th International Conference on Audio- and
Video-Based Biometric Person Authentication. Springer, June 2003.
16. R. Gross and J. Shi. The CMU motion of body (MoBo) database, 2001. URL
citeseer.ist.psu.edu/article/gross01cmu.html".
17. P. Hallinan. A low-dimensional representation of human faces for arbitrary
lighting conditions. In Computer Vision and Pattern Recognition, pages 995
999, 1994.
18. P. S. Hiremath and C. J. Prabhakar. Symbolic factorial discriminant analysis
for illumination invariant face recognition. In International Journal of Pattern
Recognition and Artificial Intelligence, volume XXII, pages 371387, 2008.
19. J. Howell and H. Buxton. Towards unconstrained face recognition from image
sequences. In Automatic Face and Gesture Recognition, pages 224229, 1996.
20. A. Kershenbaum and R. V. Slyke. Computing minimum spanning trees efficiently. In Proceedings of the ACM annual conference, pages 518527, New
York, NY, USA, 1972. ACM. doi: http://doi.acm.org/10.1145/800193.569966.
21. V. Kr
uger and S. Zhou. Exemplar-based face recognition from video. In Face
and Gesture Recogntion, pages 182187, 2002.
22. S. W. Lee, J. Park, and S. W. Lee. Low resolution face recognition based on
support vector data description. In Pattern Recognition, volume XXXIX:9,
pages 18091812, 2006.
136

23. F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Face recognition from


super-resolved images. In Proceedings of the Eighth International Symposium
on Signal Processing and Its Applications, pages 667670, 2005.
24. F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Super-resolved faces for
improved face recognition from surveillance video. In International Conference
on Biometrics, pages 110, 2007.
25. M. Mishiyama, T. Kozakaya, and O. Yamaguchi. Illumination Normalization using Quotient Image-based Techniques, chapter 8, pages 97108. I-Tech
Education and Publishing, 2008.
26. F. Modeller. Singular inversions, 2007. URL http://www.facegen.com/
index.htm.
27. Neurotechnology. Neurotechnologija verilook, 2009.
neurotechnology.com/verilook.html.

URL http://www.

28. Nikon. Nikon D80. URL http://www.nikonusa.com/Find-Your-Nikon/


Product/Digital-SLR/25412/D80.html.
29. NIST. Multiple Biometric Grand Challenge, 2008. URL http://face.nist.
gov/mbgc/.
30. NYC.
NYC
Surveillance
Camera
http://www.mediaeater.com/cameras/info.html, 1998.

Project.

31. U. Park and A. K. Jain. 3D model-based face recognition in video. In International Conference on Biometrics, pages 10851094, 2007.
32. J. Phillips, P. Grother, D. Blackburn, E. Tabassi, and M. Bone. FRVT 2002
Evaluation Report. http:www.frvt.org/FRVT2002/documents.htm, 2002.
33. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman,
J. Marques, J. Min, and W. Worek. The 2005 ieee workshop on face recognition
grand challenge experiments. In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, page 45,
Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2372-23. doi: http://dx.doi.org/10.1109/CVPR.2005.585.
34. J. Phillips, T. Scruggs, A. J. OToole, K. W. B. Patrick J. Flynn,
C. L. Schott, and M. Sharpe.
FRVT 2006 Evaluation Report.
http://face.nist.gov/frvt/frvt2006/frvt2006.htm, 2006.
35. Pittpatt. Face detection and recognition, 2004. URL http://www.pittpatt.
com.
137

36. J. Price and T. Gee. Towards robust face recognition from video. In Proceedings on Applied Image Pattern Recognition, pages 94102, 2001.
37. Z. Rahman and G. Woodell. A multiscale retinex for bridging the gap between
color images and the human observation of scenes. In IEEE Transactions on
Image Processing, volume VI, pages 965976, 1997.
38. T. Riopka and T. Boult. The eyes have it. In Proceedings of the 2003 ACM
SIGMM Workshop on Biometrics methods and applications, pages 916, 2003.
39. A. Scvortov. Dzone snippets. URL http://snippets.dzone.com/user/
scvalex/tag/minimum+spanning+tree.
40. T. Sim, S. Baker, and M. Bsat. The CMU Pose, Illumination, and Expression
(PIE) Database. In Proceedings IEEE International Conference on Automatic
Face and Gesture Recognition, page 53, 2002.
41. Sony.
Sony SNCRZ50N, .
URL http://pro.sony.com/bbsc/ssr/
cat-securitycameras/cat-ip/product-SNCRZ50N/.
42. Sony.
HDR-HC7 high definition handycam, .
URL http:
//www.sonystyle.com/webapp/wcs/stores/servlet/ProductDisplay?
catalogId=10551&storeId=10151&productId=11039061&langId=-1.
43. D. Thomas, K. W. Bowyer, and P. J. Flynn. Multi-frame approaches to
improve face recognition. In Proceedings of the IEEE Workshop on Motion
and Video Computing, page 19, 2007.
44. M. Turk and A. Pentland. Face recognition using eigenfaces. In Computer
Vision and Pattern Recognition, pages 586590, 1991.
45. Viisage. IdentityEXPLORER SDK, 2006. URL http://www.viisage.com.
46. H. Wang, S. Z. Li, Y. Wang, and J. Zhang. Self quotient image for face
recognition. In ICIP, pages 13971400, 2004.
47. S. D. Wei and S. H. Lai. Robust face recognition under lighting variations. In
International Conference on Pattern Recognition, pages 354357, 2004.
48. B. Weyrauch, B. Heisele, J. Huang, and V. Blanz. Component-based face
recognition with 3D morphable models. In Computer Vision and Pattern
Recognition Workshop, volume V, page 85, 2004.
49. C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt
clusters. In IEEE Transactions on Computers, volume XX:1, pages 6886,
1971.

138

50. W. Zhao, R. Chellappa, and A. Krishnamurthy. Discriminant analysis of


principal components for face recognition. In Computer Vision and Pattern
Recognition, pages 336341, 1998.
51. W. Y. Zhao and R. Chellappa. 3D model enhanced face recognition. In
International Conference on Image Processing, 2000.
52. S. K. Zhou and R. Chellappa. Probabilistic human recognition from video. In
European Conference on Computer Vision, pages III: 681698, 2002.
53. S. K. Zhou and R. Chellappa. Simultaneous tracking and recognition of human
faces in video. In International Conference on Acoustics, Speech, and Signal
Processing, pages III:225228, 2003.
54. S. K. Zhou, V. Krueger, and R. Chellappa. Face recognition from video: A
condensation approach. In Automatic Face and Gesture Recognition, pages
212217, 2002.
55. S. K. Zhou, R. Chellappa, and W. Zhao. Uncontrolled face recognition: Preliminaries and Reviews, chapter 2, pages 1744. Springer, 2006.

139

This document was prepared & typeset with pdfLATEX, and formatted with
nddiss2 classfile (v3.0[2005/07/27]) provided by Sameer Vijay.

140