An SVM-Based Face Detection System

An SVM-based Face Detection System using
Multiplicative Updates
Clyde Shavers, Robert Li, Gary Lebby
North Carolina A and T State University, Department of Electrical Engineering, Greensboro, North Carolina
Abstract - This paper implements an approach to face –1 for any sample that is a non-face object. We use 80% of
detection based on support vector machines (SVM). Our the examples from the dataset to train the SVM and the
approach uses the multiplicative updates algorithm remaining 20% are used for testing.
proposed in [1][2] instead of the conventional approach that
Figure 1 provides an overview of the various
uses the Lagrangian to determine the λ-coefficients of the
approaches to face detection. For a detailed discussion of
SVM decision function. We implement the simplest SVM,
the these approaches see [4][5] . Section II provides some
i.e. the hyperplane decision surface is constrained to pass
background on SVMs describing geometrically how SVMs
through the origin. This SVM implementation does not
work and a description of the proposed multiplicative
incorporate the sum constraint or box constraint for hard
updates algorithm.
and soft margin classifiers that do not pass through the
origin, respectively. Even so, our results yield a ninety
percent (or higher) detection rate for each trial.
Edges
Gray -levels
I Introduction
Low -level
The face detection system implemented is based on Analysis
Color
support vector machines (SVM). Our approach is an image-

Motion
based approach rather than a feature-based approach. More
specifically, our approach is a statistical one (see Figure 1) Generalized
based in the statistical learning theory developed by Measure
Vladimir N. Vapnik. Our approach uses the multiplicative

updates algorithm proposed in [1] (instead of the Feature
conventional approach using the Lagrangian/quadratic Feature -based Feature
Searching
Approaches Analysis
programming) to determine the λ-coefficients. Our goal is Constellation
Analysis
to simply detect whether an image presented to the system
contains a face object or not.
Snakes
Facial images from the Olivetti Research Lab (ORL)
database are used to train the system to detect faces. No Face Detection Active Shape Deformable
Models Templates
preprocessing (i.e. illumination correction and histogram
equalization) is performed on the images as observed in Point Distribution
Models ( PDMs )
other face detection studies such as [3]. A data set
composed of 200 face examples is constructed from the Linear Subspace
Methods
ORL database. An example from the dataset is composed
of: Image -based
Approaches
Neural
Networks
a sample x (that represents the image) Statistical

Approaches
a target value, y (represents face or non-face)
A 20 x 20 window of pixels (400 dimension vector) is Figure 1 Face Detection Approaches
extracted from the image to create a sample. Pixel gray-
scale values range from 0 to 255.
During the training phase, the target y is set to +1 for a
sample that represents a facial object. The target y is set to
II Support Vector Machines B. VC Dimensions and Risk Bounds
In this application, the support vector machine (SVM) VC dimensions h (named for Vapnik-Chervonenkis) is
is a two-class classifier system. The SVM may also be a measure of the complexity of a group of nested
employed as a regression technique. The SVM can be approximator functions Sn, i.e. SVM decision function. VC
trained to classify both linearly separable and non-linearly dimensions is an indicator of the order or degree of the
separable data. The SVM locates the most influential data SVM decision function and weight vector w. An SVM
samples called support vectors. These are samples (from decision function with high h is subject to overfitting,
both classes under consideration) that are closest to the which means the SVM does not generalize well and
decision surface being constructed. generalization error C(m,h;η) is increased. The bounds on
the actual risk is minimized by the choice of VC dimension
The idea behind the SVM is to create a hyperplane h. In other words, there is an optimal choice of VC
decision surface located equidistance between two decision dimensions h that minimizes the actual risk.
boundaries. Each boundary is specified by the location of
[
support vectors that satisfy y j w T x j + b ≡ 1 , j=1, NSV ] There are two SRM strategies for minimizing the
bound on the actual risk. One strategy is to keep the
(where NSV is the number of support vectors). Locating the confidence interval fixed and minimize empirical risk.
decision surface in such a way produces an optimal However, SVMs implement a slightly different strategy.
hyperplane decision surface between the samples of the two SVM strategy is based on keeping the empirical risk fixed
classes, and therefore minimum error in classification. (i.e. close to zero misclassification error) and minimize
The SVM is trained to classify samples of only one generalization error (a.k.a. the confidence interval) using
specific class-of-interest at a time. All other samples which VC dimensions h [7].
are not classified as the class-of-interest are considered to The empirical risk is maintained close to zero by the
be in a class outside the class-of-interest. very structure of the SVM. From a geometrical perspective,
empirical risk is minimized by minimizing w . This is
σ
( ∑ ) output
because minimizing w maximizes the margin between
support vectors. Maximizing the margin improves
α1
yiλ α2
yiλ αn
yiλ coefficients classification performance. This implication is supported in
the next section.
(. ) (. ) (. ) K(x,xi)
Minimizing generalization error then becomes a
φ(x1) φ(x2) φ(x) φ(xn) mapped vecotrs φ(xi), φ(x) function of finding the optimal value for VC dimension h
Figure 3 shows an upper bound on the actual risk as a result
7 4 1 support vectors of choosing the optimal value VC dimension h to be a
subset of the set of approximator functions Sn. The bound
1 Input test vector on actual risk is the sum of the empirical risk and the
generalization error [8].
Figure 2 SVM Architecture [6]
Figure 2 shows the architecture of an SVM used to C(m,h; η )

Classification Error
underfitting overfitting
realize the linear SVM decision function in equation (1).
True risk R(α )
A. SVMs and Statistical Learning Theory

SVMs are an extended realization of the empirical risk Empirical risk
minimization (ERM) inductive principle called structural Remp (α )
risk minimization (SRM). SRM is an inductive principles h1 h* hn h

based in statistical learning theory. SRM seeks to bound
actual risk (i.e. expected risk or true risk) associated with a S1 S* Sn
set of approximator functions Sn by minimizing the
confidence interval as shown in Figure 3.
Figure 3 Upper bound on true risk [7]
SVMs are trained as function approximators that can f (x) = sgn ( w, x + b ) ( 1),
be used for prediction/classification or interpretation (i.e.
regression) when one wants to understand the structure of where w is a weight vector normal to the hyperplane; x is a
an underlying system. sample from the training set and b is a scalar.
ŷ
Data Sample x During training, the SVM finds support vectors (i.e. x1,
Generator SVM Learning
Learning Machine
Machine
(Function Approximator)
[
x2 and x3) that satisfy y j < wx j > + b ≡ 1 , j = 1, NSV (see ]
Figure 5). The SVM must also satisfy the constraint
yi [< wx i > + b] ≥ 1 , i= 1, l, where l indicates the number of
System
System y training samples.
f(x,λ)
The idea behind the SVM is to find those samples (i.e.

Figure 4 SVM – Learning Machine Model support vectors from the two classes) closest to the
hyperplane, but with the largest margin of separation
SVMs are based on statistical learning theory where between them. This largest margin criteria is fundamental
learning machine models are constructed with the goal of to the SVM’s optimality. The margin itself is given by
finding a function approximator that estimates the unknown
M =2 w .
input-output dependency of a system f(x,λ) based on a set
of observable samples. Once an SVM has been trained to In order to find the optimal separating hyperplane the
capture the underlying system structure, it may be used to
SVM minimizes w , while satisfying the constraint given
classify new unknown samples. Multi-class classification is
possible by specifying a different class as the class-of- by yi [< wx i > + b] ≥ 1 . The problem therefore becomes a
interest each time the technique is applied. In our face nonlinear optimization problem with constraints, i.e. a
detection application there are only two classes to be quadratic programming (QP) problem. This type problem
discriminated, i.e. faces versus non-faces. can be solved by formulating the primal Lagrangian as
given below.
C. Geometrical View of How SVM Works
l
∑ λ {y [w ] }
1 T
Linearly Separable Case: L p (w , b, λ ) = w w− i i
T
x − b −1 ( 2)
2 i =1
Statistical learning involves finding a set of functions
that best approximate the unknown input-output We locate the optimal saddle point of the primal
dependency of a system based on a limited number of Lagrangian by taking its derivative with respect to variables
observations. w and b and set equal to zero. We can then rewrite the
x2 primal Lagrangian as the dual Lagrangian. Notice that the
<wx> + b = +1
Lagrangian is now written in terms of the λi - coefficients.
<wx> + b = 0
l l
∑ ∑y y λλ x
1
<wx> + b = -1
D2 Ld (λ ) = λi − i j i
T
j i xj ( 3)
D1 x1 i =1
2 i =1
Class 1, y = +1
x2 The λi - coefficients may be found via quadratic
w margin M
x3
programming (QP) or other traditional method. Instead of
Class 2, y = -1 the traditional method, this paper proposes the
x1 multiplicative update technique described in section E to
locate these coefficients.
Figure 5 Linearly Separable Samples
Assuming the λi - coefficients have been determined
The SVM is an implementation of such set of functions
(using either the traditional approach or the proposed
known as hyperplanes. These hyperplanes form a decision
multiplicative updates algorithm), the decision function can
surface used to classify samples of two distinct classes.
be expressed as
More specifically, SVM implements the decision function
that is used to construct these hyperlanes (see Figure 5). In m
the linearly separable case, this decision function is given
by:
f ( x) = ∑ λ y Φ ( x ) ⋅ Φ ( x) + b
i =1
i i i ( 4)
where Φ is a mapping function that maps the samples from
input space to a higher dimension feature space. The
Another solution considered in the nonlinearly
mapping function used is generally one of the three kernel
separable case is to add a slack variable to the linear
functions described in section D. The decision function may
decision function (i.e. linear SVM). Figure 7 shows an
be rewritten as
example of how the decision function behaves when a slack
m variable is added.
f ( x) = ∑ λ y K (x, x ) + b
i =1
i i i ( 5)
For the linearly separable case the polynomial kernel

function of degree 1 is used. For classification, the SVM
decision function can be written as an indicator function,
i.e.
 m 
i F (x) = sgn ( f (x) ) = sgn 
 ∑ λ y K (x, x ) + b 
i i i ( 6).
 i =1  
If the selected kernel function can support the bias term b,
then the indicator function can be rewritten as Figure 7 Non-separable overlapping samples
 m 
i F (x) = sgn ( f (x) ) = sgn 
 ∑ λ y K (x, x ) 
i i i ( 7)
 i =1   D. Kernel Function
Selection of an appropriate kernel function is used to
A pictorial representation of this function is given as
map the samples to a higher dimensional feature space
the SVM architecture shown in Figure 2.
where the samples may be separated by a linear hyperplane.
These kernel functions are designated as K (x i , x j ) ,
Nonlinearly Separable Case:
and Φ(x i ) ⋅ Φ (x j ) = K (x i , x j ) .
In the nonlinearly separable case, the samples cannot
be properly classified (or separated) with a linear decision Three general kernel functions are:
function given in (3.1). However, samples are not
Polynomial (of degree d)
necessarily overlapping (i.e. inter-mingled) as they are in
the non-separable case discussed later.
[( ) ]
K ( x, x i ) = x T x i + 1
d
( 8)
A solution to the nonlinearly separable case is to use an
appropriate kernel function (e.g. Gaussian RBF or a high Gaussian RBF (Radial Basis Function)
degree polynomial, i.e. d > 1) to map the samples to a
 x−x 2 
K (x, x i ) = exp 
higher dimensional feature space where they are linearly i
 ( 9)
separated (see Figure 6). Kernel functions are given in 2σ 2
section D.  
Sigmoidal
[( ) ]
K (x, x i ) = tanh x T x i + b ( 10)
x =Φ(x )
i i
E. Multiplicative Updates
The SVM implements the decision
function f (x) = sgn( K (w*, x) + b) , where w* is a solution
vector (or coefficient vector). The solution vector is re-
written in terms of the λi - coefficients, i.e.
Figure 6 Nonlinearly Separable Samples
w* = ∑λ y x
i
i i i .
There are various methods for calculating these in the kernel function. Specifically, any test vector found to
coefficients. In this paper, the coefficients are calculated closely resemble a support vector is scaled by the non-zero
using the proposed multiplicative update algorithm. The coefficient given to that support vector. Otherwise the test
algorithm is given as: vector is scaled by a coefficient with value zero. Figure 11
shows the nonzero λi - coefficients corresponding to the
 1 + 1 + 4( A + λ ) ( A − λ ) 
support vectors.
λi ← λ i  
i i
( 11)
 +
2( A λ ) i 
 
III Experiment Results
+  Aij if Aij > 0
A + = A ij =  ( 12)
0 otherwise A. Overview of Face Detection Process
1. Input face and non-face images (i.e. sample
−  A if Aij < 0
A − = A ij =  ij (13) data).
 0 otherwise
2. Scale the images
Aij = yi y j K (x i , x j ) ( 14) 3. Extract a 20 x 20 subimage.
where K(xi,xj) is a dot-product kernel function. 4. Create vectors from subimages
In this experiment we have chosen the simplest SVM 5. Add target labels to vectors (i.e. designate as
implementation where the decision surface (i.e. separating face or non-face)
hyperplane) is constrained to pass through the origin as in 6. Partition sample data into “training samples”
[1]. The SVM in this experiment is implemented as: and “test samples”.
l 7. Input the “training samples” to SVM program
f ( x) = ∑ λ y K (xx ) + b ,
i =1
i i i ( 15)
8. Train the SVM
where the hyperplane decision surface passes through the 9. Calculate the Lagrangian coefficients
origin. 10. Randomize the Test Data
During SVM training, input samples (i.e. vectors) from 11. Input Test Data and Lagrangian coefficients to
the training set are input to the multiplicative update the SVM algorithm.
algorithm. The algorithm processes the samples and
generates a set of λ-coefficients that are used as scaling 12. Run the SVM to detect faces
factors. A coefficient is generated for each corresponding 13. Record results.
training vector. Only the vectors that are support vectors
receive a non-zero valued coefficient. The coefficients for B. Preliminary Setup
all other vectors are determined to be equal to a value of
The data set for this application is composed of one-
zero. The support vectors themselves become the
hundred and sixty face images and forty non-face images.
archetypes for facial images. Test vectors (i.e. vectors not
The facial images were obtained from the ORL database.
previously observed) presented to the SVM are essentially
Images were initially read into the Mathcad program and
compared to these archetypes for classification.
labeled M1 thru M41 (only the first 3 of the face images are
The support vectors are the vectors closest to the shown below).
decision surface. These vectors influence the contours of
the decision surface.
During SVM testing, test vectors are presented to the
SVM. The kernel function calculates a correlation-like
value that indicates the resemblance between the test vector
and training vectors. The larger the value, the greater the
resemblance between the test vector and the support vector. M1 M2 M3
The λi - coefficients associated with each training Figure 8 Original Images

vector during the training phase is used to scale the product
The original images were scaled by a factor of three in 10
order to extract a 20 x 20 pixel sub-image. The 20 x 20 sub- AlphaV00i 5

image yield a 400 dimension vector. An additional element
was appended to each 400 dimensional vector to produce a 0
0 5 10 15 20 25 30
vector with dimension of 401. This additional element is the i
target value used to designate the image as face or non-face. 10

This element is set to +1 for a face and –1 for non-face.
AlphaV01i
These vectors form the dataset. Eighty percent (80%) of 5
the dataset is designated as training examples and the 0

0 5 10 15 20 25 30
remaining 20% is designated as test examples. Some of the i
scaled images are shown in Figure 9. Corresponding 20 x 20

20 sub-images are shown in Figure 10.
AlphaV02i 10
0
0 5 10 15 20 25 30
i
20
AlphaV04i 10
0
0 5 10 15 20 25 30
M1a M2a M3a i
Figure 9 Scaled Images Figure 11 λ-Coefficients
IV Conclusion
For each set of images tested, 90% or better tested
correctly in each trial. We found that it is important to
include among the training examples non-face examples.
M1b M2b M3b
The work is based on previous work done by Sung and
Figure 10 Subimages (20 x 20) Poggio which resulted in a well-known successful
implementation of a face detection system. In their work, a
bootstrapping method is used where false positive
C. SVM training
detections are inserted into the training samples as non-face
Eighty percent of the dataset is used for training the examples. This step helps to more accurately train the SVM
SVM. Training resulted in a set of coefficients. A graphical to recognize non-face images. Their work also included
representation of the coefficients are shown in Figure 11. preprocessing, where variation in illumination brightness is
Notice that as expected the coefficients tend to zero for compensated. Our work did not include these steps. It is
non-support vectors. possible that preprocessing similar to that done by Sung and
Poggio would allow the misclassified images to be correctly
classified. This work proved the application of the
multiplicative updates approach to be a viable alternative in
determining the λ-coefficients.
1 Fei Sha, Lawrence K. Saul, and Daniel D. Lee,

“Multiplicative Updates for Nonnegative Quadratic
Programming in Support Vector Machines”, Technical
Report MS-CIS-02-19, University of Pennsylvania.
2 Fei Sha, Lawrence K. Saul, and Daniel D. Lee,
“Multiplicative Updates for Large Margin Classifiers”.
3 E. Osuna, R. Freund, F. Girosi, “Training Support
Vector Machines: an Application to Face Detection”,
Proceeedings of CVPR’97, June 17, 1997.
4 E. Hjelmas, B. K. Low, “Face Detection: A Survey”,
Computer Vision and Understanding, April 17,2001.
5 Rama Chellappa, “Human and Machine Recognition
of Faces: A Survey”, Proceedings of IEEE, Vol. 83, No. 5,
May 1995.
6 M. A. Hearst, “Support Vector Machines”, Trends
and Controversies.
7 Harry Wechsler, et al., Face Recognition – From
Theory to Applications, Springer – Verlag Berlin
Heidelberg, 1998.
8 Vladimir N. Vapnik, The Nature of Statistical
Learning Theory, Springer-Verlag New York, Inc., 2000.

An SVM-Based Face Detection System

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

An SVM-Based Face Detection System

Загружено:

Авторское право:

Доступные форматы

An SVM-based Face Detection System using

support vector machines (SVM). Our approach is an image-

Vladimir N. Vapnik. Our approach uses the multiplicative

a sample x (that represents the image) Statistical

Figure 2 shows the architecture of an SVM used to C(m,h; η )

A. SVMs and Statistical Learning Theory

risk minimization (SRM). SRM is an inductive principles h1 h* hn h

The idea behind the SVM is to find those samples (i.e.

For the linearly separable case the polynomial kernel

where K(xi,xj) is a dot-product kernel function. 4. Create vectors from subimages

The λi - coefficients associated with each training Figure 8 Original Images

order to extract a 20 x 20 pixel sub-image. The 20 x 20 sub- AlphaV00i 5

target value used to designate the image as face or non-face. 10

the dataset is designated as training examples and the 0

scaled images are shown in Figure 9. Corresponding 20 x 20

Figure 9 Scaled Images Figure 11 λ-Coefficients

1 Fei Sha, Lawrence K. Saul, and Daniel D. Lee,

Вам также может понравиться