Вы находитесь на странице: 1из 46

CHAPTER 1

INTRODUCTION

Pattern recognition approaches have achieved measurable success in the domain


of visual detection. Examples include face, automobile, and pedestrian detection [10],
[11], [13], [1], [9]. Each of these approaches use machine learning to construct a
detector from a large number of training examples. The detector is then scanned over
the entire input image in order to find a pattern of intensities which is consistent with
the target object. Experiments show that these systems work very well for the detection
of faces, but less well for pedestrians, perhaps because the images of pedestrians are
more varied (due to changes in body pose and clothing). Detection of pedestrians is
made even more difficult in surveillance applications, where the resolution of the
images is very low (e.g. there may only be 300-500 pixels on the target). Though
improvement of pedestrian detection using better functions of image intensity is a
valuable pursuit, I take a different approach.

This paper describes a pedestrian detection system that integrates intensity


information with motion information. The pattern of human motion is well known to be
readily distinguishable from other sorts of motion. Many recent papers have used
motion to recognize people and in some cases to detect them [8], [10], [7], [3]. These
approaches have a much different flavor from the face/pedestrian detection approaches
mentioned above. They typically try to track moving objects over many frames and then
analyze the motion to look for periodicity or other cues. Detection style algorithms are
fast, perform exhaustive search over the entire image at every scale, and are trained
using large datasets to achieve high detection rates and very low false positive rates.
1

1
In this paper we apply a detection style approach using information about
motion as well as intensity information. The implementation described is very efficient,
detects pedestrians at very small scales (as small as 18x36 pixels), and has a very low
false positive rate. The system is trained on full human figures and does not currently
detect occluded or partial human figures.

My approach builds on the detection work of Viola and Jones [10]. Novel
contributions of this paper include the development of a representation of image motion
which is extremely efficient and implementation of a state of the art pedestrian
detection system which is operates using AdaBoost and Haar-Like Features.

1.1 OBJECTIVE

The main goal of this project is to develop a system which is capable to detect the
pedestrians. This Project contains the study and implement AdaBooste Pedestrian
detection with Haar - Like Features and analysis the performance of the AdaBoost
system based on the different size of dataset. This project include the development of a
representation of image motion which is extremely efficient and implementation of a
state of the art pedestrian detection system which is operates using AdaBoost and Haar-
Like Features.

1.2 PROJECT SCOPE

In this work, pictorial structures that represent a simple model of a person are used
for this purpose, as was done in previous works [9] [10] [3]. However, rather than using
a simple rectangle segment template, constant colored rectangle, or learned appearance
models specific to individuals for part detection, I train a discriminative classifier for
each body segment to detect candidate segments in images.

2
An overview of the algorithm is shown in figure 1.2.1. A large number of training
examples, both positive and negative, are used to train binary classifiers for each body
segment using the Adaboost algorithm. After training, the classifiers are scanned over
new input images. The detections from the input image for each segment are passed to
an algorithm that determines the best configurations consisting of one of each segment.
The configuration cost is computed efficiently as a function of the segment cost from
the classifier and the kinematic cost of combining pairs of segments as specified by
some deformable model.

1) Input: Training examples


( xi , y i ) , i = 1..N with positive
( y i=1 ) and negative
( y i=0 )
examples.
1 1
ω1 , i= ,
2) Initialization: weights 2m 2l with m negative and l positive examples
3) For t=1,...,T:
a) Normalize all weights

b) For each feature j train classifier


hj with error
ε j =∑ i |h j ( x i )− y i|

c) Choose
ht with lowest error
εt
1−e i
d) Update weights:
ωt +1, i=ωt , i βt where
e i=0 if
xi is correctly classified

εt
βt=
and
e i=1 otherwise and
1−ε t
T T

4) Final strong classifier:


h( x )=
{1 : ∑t=1 α t h t ( x )≥0 .5 ∑t =1 α t
0 :otherwise with
Figure 1.2.1: Adaboost Algorithm
α t =log(
1
βt
)

3
Pictorial structure configurations are considered valid if their cost is below a
predetermined threshold. I examine the ability of an AdaBoost detector to find body
segments as well as the utility of enforcing kinematic constraints on pedestrian
detections. The following sections describe the details of each component of the
detection framework. Many of the ideas used in this work have been presented
previously. I cite the original authors of each algorithm, but reproduce many equations
and algorithms for completeness.

1.3 APPLICATION OF MOTION DETECTION AND TRACKING SYSTEM

Detecting and tracking a moving object in a dynamic video sequence has been a
vital aspect of motion analysis. This detecting and tracking system has become
increasingly important due to its application in various areas, including communication
(video conferencing), transportation (traffic monitoring and autonomous driving
vehicle), security (premise surveillance) and industries (dynamic robot vision and
navigation).

Specifically, the main application targeted by the proposed detecting and


tracking algorithm is for implementing an autonomous driving system. The system can
be used to automatically detect and track any moving object exist in the traffic scenes
such as moving cars in highway within the view of the moving camera.

4
1.4 THESIS OUTLINE

In general, there are 5 main chapters in this project:

Chapter 1 provides readers a first glimpse at the basic aspects of the research
undertaken, such as objectives, scopes, problem formulation and the application
targeted by the developed moving object detection and tracking system.

Chapter 2 gives an insight to the existing vision-based moving object detection and
tracking algorithms developed by the various researchers, and subjectively classify
them into four categories.

Chapter 3 elaborates on the methodology of the proposed detection and tracking


algorithm. This chapter gives an explanation for each main stages in the developed
detection system such as Haar-like features matching, integral image, and a testing for
moving object by Adaboost classification.

Chapter 4 is mainly devoted for demonstrating the experimental results and


performance of the proposed detection and tracking algorithm on some off-line image
sequences.

Chapter 5 deals with the summary and conclusions of the research. A number of
research findings obtained from the empirical results of the implemented detection and
tracking system are also discussed. Lastly, some realistic extensions as well as possible
enhancements for the research are provided.

5
CHAPTER 2

LITERATURE REVIEW

In general, the existing approaches and algorithms formulated to deal with


objects detection and tracking from the images taken by non-stationary moving camera,
can be subjectively classified into a few categories, i.e. AdaBoost classification, Haar –
like features, Correlation-based matching technique and Gradient-based technique. The
basic idea behind all these techniques is that the motion of the moving object is
different with the motion of background object.

The choice of the algorithm that will perform well depends upon many
considerations. Of primary concern is the selection of an efficient model that well suited
the requirements of the particular desired application.

2.1 Adaboost Classification

AdaBoost is used both to select the features and to train the classifiers. The
AdaBoost learning algorithm is used to boost the classification performance of a simple
learning algorithm (e.g : it might be used to boost the performance of a simple
perceptron). It does this by combining a collection of weak classification functions to
form stronger classifiers.

AdaBoost, short for Adaptive boosting, is a machine learning algorithm, It is a


meta-algorithm, and can be used in conjunction with many other learning algorithms to
improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers
built are tweaked in favor of those instances misclassified by previous classifiers.
AdaBoost is sensitive to noisy data and outliers. Otherwise, it is less susceptible to the
over fitting problem than most learning algorithms.

6
AdaBoost calls a weak classifier repeatedly in a series of rounds t=1 , … ., T .
For each call a distribution of weights Dt is updated that indicates the importance of
examples in the data set for the classification. On each round, the weights of each
incorrectly classified example are increased (or alternatively, the weights of each
correctly classified example are decreased), so that the new classifier focuses more on
those examples. Below is the boosting algorithm for learning a query online. T
hypotheses are constructed each using a single feature. The final hypotheses are a
weighted linear combination of the T hypotheses where the weights are inversely
proportional to the training error.

- Given example image( x 1 , y 1 ) …(x n , y n) where y i=0,1 for negative and


positive examples respectively.

1 1
- initialize weightsw 1 ,i= , for y i=0,1 respectively, where m and L
2m 2l
are the number of negatives and positives respectively.

- for t =1,…..T:
1. normalize the weights,
wt;i
wt;i
Pn
j=1 wt;jwt;i
w

wt ,i
wt,i← n

∑ wt , j
j=1

so that w t is a probability distribution.

2. For each feature, j, train the classifier h j which is restricted to


using a single feature. The error is evaluated with respect to:

w t , ∈ j=∑ w i|h j ( x i) − y i|
i

3. Choose the classifier,ht , with the lowest error ∈t

4. Update the weights :


w t +1 ,i=wt ,i β1−e
t
i

7
Where e i=0 if example xi is classified correctly,
e i=1 Otherwise, and

ϵt
βt =
1−ϵ t

5. The final strong classifiers is :

T T
1
{
h ( x )= 1 ,∧∑ α t ht ( ¿ x )≥
t−1
∑ α ¿ 0 ,∧Otherwise
2 t−1 t

1
Where: α t=log
βt

2.2 Haar – like Features

Historically, for the task of object recognition, working with only image
intensities (i.e. the RGB pixel values at each and every pixel of image) made the task
computationally expensive. A publication by Papageorgiou et al. discussed working
with an alternate feature set instead of the usual image intensities. This feature set
considers rectangular regions of the image and sums up the pixels in this region. The
value this obtained is used to categorize images. For example, let us say we have an
image database with human faces and buildings. It is possible that if the eye and the hair
region of the faces are considered, the sum of the pixels in this region would be quite
high for the human faces and arbitrarily high or low for the buildings.

8
The value for the latter would depend on the structure of the building, its
environment while the values for the former will be more less roughly the same. We
could thus categorize all images whose Haar-like feature in this rectangular region to be
in a certain range of values as one category and those falling out of this range in
another. This might roughly divide the set of images into ones having a lot of faces and
a few buildings and the other having a lot of buildings and a few faces. This procedure
could be iteratively carried out to further divide the image clusters [8] [11][12].

A slightly more complicated feature is to break the rectangular image to either a left
half and a right half or a top half and a bottom half and trace the difference in the sum
of pixels across these halves. This modified feature set is called 2 rectangle features.

2.3 Correlation-Based Matching Technique

The idea of this approach is to partition an image It into segmented region and to
search for each of the segmented regions the corresponding segmented regions in the
successive image It+1 at time t+1.

The detection and tracking algorithm proposed by Volker Rermann in [8][9]


utilizes both correlation and features as matching technique to detect and track moving
object in the scene intended for robotic and vehicle guidance applications. In order to
provide more robust approach, color regions that could not be matched on the feature
level are matched on the pixel level by the integration of a correlation based
mechanism.

From his finding, with feature-based matching technique, nearly 90% of the
image area can be matched in an efficient and accurate way. The correlation-based
matching technique, which requires more elaborate processing, is thus only applied to a
small fraction of the image area. It yields very accurate displacement vectors and most
of the time finds the corresponding color regions.[7]

9
2.4 Gradient-Based Technique

Gradient-based method, especially optical-flow method, has been shown to be


powerful tool for analysis of scene in the case of static camera [1][2]. The main
principle in the optical-flow method is to solve the so-called optical flow-constraint
(OFC) equation. The equation states that the temporal gradient of the moving object is
equal to the spatial gradient, multiplied by the speed of the moving object in both x and
y directions. Both of the gradients can be computed easily, leaving the speed in both
directions to be unknown.

2.5 Selection of Technique

A technique chosen for this research is Adabooste Classification and Haar - like
feature. Basically, this technique needs less computation time and is easy to implement
compared to other techniques. Although the correlation-based technique is less affected
by noise and illumination changes, it suffers from high computational complexity by
summations over the entire template. There are some researchers who use gradient-
based technique for detecting a moving object but this technique is relatively sensitive
to local or global illumination changes.

10
CHAPTER 3

METHODOLOGY

This chapter gives an overview on the equipment/development tools used, total


cost planning, and project planning in developing the entire pedestrian detection system.
In addition, this chapter also presenting in detail the methodology of the proposed
vision based pedestrian detection system. Subsequently, the pedestrian detection
modules are elaborated. Explanation of the detection module is focused on Haar-like
features prototype; integral image and classification function with the cascade of
classifiers.

3.1 Equipment Used for Project Development

The proposed moving object detection system has been implemented on a Intel®
Pentium®M processor725 2.1 GHz PC with 2 GB RAM running on Windows XP. The
off-line images were acquired through a Sony Color CCD camera. Each frame of the
acquired images is converted to the 8 bit grayscale format and finally 1 bit binary
format before undergoing further processing. Each frame is fixed to a size of 256 x 256
pixel.

The EuroCard Picolo frame grabber is utilized in the project setup. The task of the
frame grabber is to convert the electrical signal from the camera into a digital image
that can be processed by a computer. The frame grabber can be programmed to send the
image data without intervention of the central processing unit (CPU) to the memory
(RAM) of the PC where the images are processed.

The entire motion detection and tracking program has been developed using Microsoft
Visual C++ version 6.0, including the GUI (Graphical User Interface) and OpenCV. It

11
is fully integrated editor, compiler and debugger. Hence, it can be used to create a
complex software system.

3.2 Project Planning

The table diagram in Table 3.2.1 gives an overview of the main time
schedule in the developed detection system:

N
O WEEK TASK TASK FOR
WEEKS 4: 4-10FEB 2008 RESEARCH ABOUT THE PROCESS
1 IMAGE PROCESSING MUZAMMIL
RESEARCH ABOUT THE
FEATURES OF IMAGE
PROCESSING:
1)FEATURE: DIFERRENCING
TECHNIQUE
2)FEATURE:BASED MATCHING
2 WEEKS 5: 11-17FEB 2008 TECHNIQUE MUZAMMIL
RESEARCH ABOUT THE
FEATURES OF IMAGE
PROCESSING:
1)ADABOOSTE TECNIQUE
3 WEEKS 6: 18-24FEB 2008 2)HAAR-LIKE FEATURES MUZAMMIL
4 WEEKS 7: 25-29FEB 2008 SELECTION TECHNIQUE MUZAMMIL
WEEKS 8: 10-16MARCH
5 2008 SUBMIT SUMMARY MUZAMMIL
WEEKS 9: 17-23MARCH MATCHING OPERATION:
6 2008 FIRST STAGE(RESEARCH) MUZAMMIL
RESEARCH: MATCHING
WEEKS 10: 24-30MARCH OPERATION OF MOTION
7 2008 DETECTION AND TRACKING MUZAMMIL
8 WEEKS 11: 1-6APRIL 2008 PREPARATION PAPERWORK MUZAMMIL
SUBMIT PAPERWORK AND
9 WEEKS 12: 7-13APRIL 2008 PRESENTATION MUZAMMIL
FIND AND IDENTIFY THE
HARDWARE:FIRST
10 WEEKS 13: 14-20APRIL 2008 STAGE(CAMERA) MUZAMMIL
FIND AND IDENTIFY THE
HARDWARE:SECOND
11 WEEKS 14: 21-27APRIL2008 STAGE(SOFTWARE) MUZAMMIL
12 BREAK SEMESTER BREAK  
MATCHING OPERATION:
13 WEEKS 1: 14-20 JULY 2008 SECOND STAGE(HARDWARE) MUZAMMIL

12
14 SETUP SOFTWARE:
WEEKS 2: 21-27 JULY 2008 FIRST STAGE(OPEN CV AND C++) MUZAMMIL
SETUP SOFTWARE:
15 WEEKS 3: 28-31 JULY 2008 SECOND STAGE(OPEN CV AND C++) MUZAMMIL
MATCHING OPERATION:
WEEKS 4: THIRD STAGE(SOFTWARE AND
16 4-10 AUGUST 2008 HARDWARE) MUZAMMIL
WEEKS 5: TESTING PROJECT:
17 11-17AUGUST 2008 FIRST STAGE(WARM UP) MUZAMMIL
WEEKS 6: MATCHING OPERATION:
18 18-24AUGUST 2008 FOURTH STAGE(UPGRADE) MUZAMMIL
WEEKS 7: TESTING PROJECT:
19 25-29AUGUST 2008 SECOND STAGE(IN SITUATION) MUZAMMIL
MATCHING OPERATION:
20 WEEKS 8: 8-14 SEPT 2008 FIFTH STAGE(UPGRADE) MUZAMMIL
TESTING PROJECT:
THIRD STAGE(FINAL)
21 WEEKS 9: 15-21 SEPT 2008 RESULT ANALYSIS MUZAMMIL
MATCHING OPERATION:
FINAL STAGE
22 WEEKS 10: 22-28 SEPT 2008 RESULT ANALYSIS MUZAMMIL
23 WEEKS 11: 1-5 OCT 2008 PREPARATION REPORT MUZAMMIL
24 WEEKS 12: 6-12 OCT 2008 PREPARATION PRESENTATION MUZAMMIL
25 WEEKS 13: 13-19 OCT 2008 PRESENTATION MUZAMMIL
26 WEEKS 14: 20-26 OCT2008 SUBMIT REPORT MUZAMMIL

Figure 3.2.1

13
3.3. Gantt Chart.
The table diagram in Figure 3.3.1, 3.3.2, 3.3.3, 3.3.4, 3.3.5, 3.3.6 and
3.3.7 gives an overview of the Gantt-chart for the project:

Figure 3.3.1

Figure 3.3.2

14
Figure 3.3.3

Figure 3.3.4

15
Figure 3.3.5

Figure 3.3.6

16
Figure 3.3.7

17
3.4 Cost Planning

The table diagram in Table 3.4.1 gives an overview of the cost for
developed detection system:

N QUANTITY UNIT
O ITEM DESCRIPTION UNIT PRICE AMOUNT
           

1. HARDWARE WEB CAMERA size 1 RM 250.OO 250.00


    of 256 x 256 pixel.      
2. HARDWARE COMPUTER 1 NONE NONE
    Intel®Pentium®M      
processor725 2.1 GHz
    PC      
with 2 GB RAM running
    on      
    Windows XP      
3. SOFTWARE Microsoft Visual C++ 1 RM 20.00 20.00
    version 6.0      
RINGGIT MALAYSIA
TOTAL
  RM 270.00

Table 3.4.1

18
3.5 System Overview

The block diagram in Figure 3.5.1 gives an overview of the main stages
in the developed detection system. This overview is example of detection system
for pedestrian.

Figure 3.5.1

19
3.6 Haar-like Prototypes

The Haar-like features is represented by a template (shape of the features). It is


coordinate relative to the search window origin and the size of the features (its scale). A
subset of the features prototypes used is shown in figure 3.6.1.
Each feature is composed of two or three “black” and “white” rectangles joined
together, these rectangles can be up-right or rotated by 45 degrees. The Haar-like
features value is calculated as a weighted sum of two components: the pixel gray level
value sum over the black rectangle and the sum over the whole feature area (all black
and white areas). The weights of these two components are of opposite sign and for
normalization purpose, their absolute values are inversely proportional to the areas.

Figure 3.6.1

20
3.7 Integral Image
Rectangles features can be computed very rapidly using an intermediate representation
for the image which we call the integral image. The integral image at location x, y
contains the sum of the pixels above and to the left of x,y, inclusive:

ii ( x , y )= ∑ i ( x' , y' );
'
X ≤ x , y' ≤ y

Where ii(x,y) is the integral image and i(x,y) is the original image (see figure 3.6.1)
using the following pair of recurrences:

s ( x , y )=s ( x , y −1 )+ i ( x , y )
ii ( x , y )=ii ( x−1 , y ) +s ( x , y )

(where s(x,y) is the cumulative row sum, s(x,-1) =0, and ii(-1,y) =0) the integral image
can be computed in one pass over the original image. Using the integral image any
rectangular sum can be computed in four array references (see figure 3.6.2). clearly the
difference between the two rectangular sums can be computed in eight references. Since
the two-rectangle features defined above adjacent rectangular sums they can be
computed in six array references, eight in the case of the three-rectangle, and nine for
four rectangle features.
One alternative motivation for the integral image comes from the “boxlets”
work of Simard, et al[6][7][8]. The authors point out that in the case of linear operations
(e.g. f.g), any invertible linear operation can be applied to f or g if its inverse is applied
to the result. For example in the case of convolution, if the derivative operator is applied
both to the image and the kernel the result must then be double integrated:

f∗g=∬ (f '∗g ' )

21
The authors go on to show that convolution can be significantly accelerated if the
derivatives of f and g are sparse ( or can be made so ). A similar insight is that the
invertible linear operation can be applied to f if its inverse as applied to g:

( f ' ' )∗¿

Viewed in this framework computation of the rectangle sum can be expressed as


a dot product, ixr, where I is the image and r is the box car image (with value 1 within
the rectangle of interest and 0 outside). This operation can be rewritten.

i. r =¿

The integral image is in fact the double integral of the image ( first along rows and then
among column ). The second derivative of the rectangle ( first row and then column )
yields four delta functions at the corners of the rectangle. Evaluation of the second dot
product is accomplished with four array accesses.

3.8 Cascade of Classifiers.

In this section it is described an algorithm for constructing a cascade of classifiers [3]


which achieves increased detection performance while radically reducing the
computation time. The key insight is that smaller, and therefore more efficient, boosted
classifiers which reject many more of the negative sub-windows while detecting almost
all positive instances, can be constructed. Simpler classifiers are used to reject the
majority of sub-windows before more complex classifiers are called upon, in order to
achieve low false positive rates.

A cascade of classifiers is a degenerated decision three where, at each stage, classifiers


is trained to detect almost all objects interest while rejecting a certain fraction of the
non-objects patterns.

22
Each stage was trained using the AdaBoost algorithm. Adaboost is a powerful machine
learning algorithm that can learn a strong classifiers based on a (large) set of weak
classifiers by re-weighting the training samples. The feature-based classifier that best
classifies the weighted training samples is added at each round of boosting. As the stage
number increases the number of weak classifiers, needed to achieve the desired false
alarm rate at the given hit rate, also increases.

3.9 Training a Cascade

The cascade training process involves two types of tradeoff. In most cases, the
classifiers with the most features will achieve higher detection rates and lower false
positive rates. At the same time classifiers with more features require more time to
compute. In principle one could define an optimization framework in which : i) the
number of classifier stage, ii) the number of features in each stage, and iii) the threshold
of each stage, are traded off in order to minimize the expected number of evaluated
features. Unfortunately finding this optimum is a tremendously difficult problem [6][7]
[8][9].

In practice in figure 3.9.1, very simple framework is used to produce an effective


classifier which is highly efficient. Each stage in a cascade reduces the false positive
rate as well as the detection rate.

A target is selected for the minimum reduction in false positives and the maximum
decrease in detection. Each stage is trained by adding features until the target detection
and false positives rates are met (these rates are determined by testing the detector on
validation set). Stages are added until the overall target for false positive and detection
rate is met.

23
3.10 Training Process

The training process uses AdaBoost to select a subset of features and construct the
classifier. In each round the learning algorithm chooses from a heterogenous set of
filters, including the appearance filters, the motion direction filters, the motion shear
filters, and the motion magnitude filters.

The AdaBoost algorithm also picks the optimal threshold for each feature as
well as the _ and _ votes of each feature. The output of the AdaBoost learning algorithm
is a classifier that consists of a linear combination of the selected features. For details on
AdaBoost, the reader is referred to [12, 5].

The important aspect of the resulting classifier to note is that it mixes motion
and appearance features. Each round of AdaBoost chooses from the total set of the
various motion and appearance features, the feature with lowest weighted error on the
training examples. The resulting classifier balances intensity and motion information in
order to maximize detection rates.

24
Figure 3.10.1: Cascade architecture.

Figure 4.3.3: Cascade architecture. Input is passed to the first classifier with decides true or
false (pedestrian or not pedestrian). A false determination halts further computation and causes
the detector to return false. A true determination passes the input along to the next classifier in
the cascade. If all classifiers vote true then the input is classified as a true example. If any
classifier votes false then computation halts and the input is classified as false. The cascade
architecture is very efficient because the classifiers with the fewest features are placed at the
beginning of the cascade, minimizing the total required computation.

Viola and Jones [14] showed that a single classifier for face detection would
require too many features and thus be too slow for real time operation. They proposed a
cascade architecture to make the detector extremely efficient (see figure 4.3.1). We use
the same cascade idea for pedestrian detection. Each classifier in the cascade is trained
to achieve very high detection rates, and modest false positive rates. Simpler detectors
(with a small number of features) are placed earlier in the cascade, while complex
detectors (with a large number of features are placed later in the cascade).

Detection in the cascade proceeds from simple to complex. Each stage of the
cascade consists of a classifier trained by the AdaBoost algorithm on the true and false
positives of the previous stage. Given the structure of the cascade, each stage acts to
reduce both the false positive rate and the detection rate of the previous stage. The key
is to reduce the false positive rate more rapidly than the detection rate.

A target is selected for the minimum reduction in false positive rate and the
maximum allowable decrease in detection rate. Each stage is trained by adding features
until the target detection and false positives rates are met on a validation set. Stages of
the cascade are added until the overall target for false positive and detection rate is met.

25
Figure 3.10.2: A small sample of positive training examples. A pair of image patterns comprises
a single example for training.

CHAPTER 4

PROTOTYPE/PRODUCT DEVELOPMENT

26
This chapter gives an overview on the stage of prototype development and
technical material used in developing the entire pedestrian detection system.
Subsequently, the pedestrian detection (coding, diagram, flow chart) modules are
elaborated. Explanation of the pedestrian detection development is focused on Haar-like
features prototype; integral image and classification function with the cascade of
classifiers.

4.1 Training Dataset of Pedestrian

The dataset contains a collection of pedestrian and non-pedestrian images. It is


made available for download on internet for benchmarking purposes, in order to
advance research on pedestrian classification. The dataset consists of two parts which is
a base data set. The base data set contains a total of 5000 pedestrian- and 5000 non-
pedestrian samples cut out from video images and scaled to common size of 18x36
pixels. Pedestrian images were obtained from manually labeling and extracting the
rectangular positions of pedestrians in video images.  Video images were recorded at
various (day) times and locations with no particular constraints on pedestrian pose or
clothing, except that pedestrians is standing in upright position and is fully visible. As
non-pedestrian images, patterns representative for typical preprocessing steps within a
pedestrian classification application, from video images known not to contain any
pedestrians. We chose to use a shape-based pedestrian detector that matches a given set
of pedestrian shape templates to distance transformed edge images (i.e. comparatively
relaxed matching threshold).

Another part is an additional non-pedestrian images. An additional collection


of 1200 video images not containing any pedestrians, intended for the extraction of
additional negative training examples.

27
Figure 4.1.1: Sample of Training Dataset of Pedestrian

4.2 Performance Dataset of Pedestrian

I generated a performance database of grayscale pedestrian images by running


our pedestrian detector and tracker on several hours of video data and cropping a fixed
window of size 500 × 300 pixels around the pedestrian of the middle frame of each
tracked sequence. This method yielded 50 images in which people of each make and
model were of roughly the same size. The crop window was positioned such that the
pedestrian was centered in the bottom third of the image. I chose this position as a
reference point to ensure matching was done with only pedestrian features and not
background features.

Figure 4.1.2: Sample of Performance Dataset of Pedestrian

4.3 Training Performance

I used six of the sequences to create a training set from which I learned both a
dynamic pedestrian detector and a static pedestrian detector. The other two sequences

28
were used to test the detectors. The dynamic detector was trained on consecutive frame
pairs and the static detector was trained on static patterns and so only uses the
appearance filters described above. The static pedestrian detector uses the same basic
architecture as the face detector described in [14].

Figure 4.3.1: Sample frames from each of the 6 sequences we used for training. The manually
marked boxes over pedestrians are also shown.

Each stage of the cascade is a boosted classifier trained using a set of 5000
positive examples and 5000 negative examples. Each positive training example is a pair
of 18 x 36 pedestrian images taken from two consecutive frames of a video sequence.
Negative examples are similar image pairs which do not contain pedestrians. Positive
examples are shown in figure 4.3.2. During training, all example images are variance

29
normalized to reduce contrast variations. The same variance normalization computation
is performed during testing.

Each classifier in the cascade is trained using the original 5000 positive
examples plus 5000 false positives from the previous stages of the cascade. The
resulting classifier is added to the current cascade to construct a new cascade with a
lower false positive rate. The detection threshold of the newly added classifier is
adjusted so that the false negative rate is very low.

The cascade training algorithm also requires a large set of image pairs to scan
for false positives. These false positives form the negative training examples for the
subsequent stages of the cascade. We use a set of 5000 full image pairs which do not
contain pedestrians for this purpose. Since each full image contains about 50,000
patches of 20x15 pixels, the effective pool of negative patches is larger than 20 million.
The static pedestrian detector is trained in the same way on the same set of images. The
only difference in the training process is the absence of motion information. Instead of
image pairs, training examples consist of static image patches.

CHAPTER 5

RESULT

30
This section demonstrates some of the tested image sequences that are able to
highlight the effectiveness of the proposed detection system. These experimental results
are obtained using the proposed detection algorithm that has been discussed in Chapter
3 and Chapter 4. The proposed algorithm includes Haar-like features and Adaboost
Classification. The system can detect moving objects a 300 x 500 pixel image at 17
frames/s[7][8][9] on 2.0 GHz PC with 2 Giga Byte RAM running on Windows XP. The
output of the system is represented by yellow rectangle, meaning that in the search sub-
window corresponding to the red rectangle the output of the cascade of the classifier
was true, in other words, it as detected the object. Figure 4.1 and 4.2 are the examples of
output.

Figure 5-A: Output of the pedestrian Figure 5-B: Detecting pedestrians in a


detection system. indoor environment.

5.1 Stage

A Final classifier cascade consists of 11 boosted classifiers. The classifiers are


sorted by their numbers of weak classifiers so each one has equal or more weak
classifier than its previous stage. Since the computational complexity of each classifier

31
rises with its number of features, the images are first classified by classifiers with a
small size (Figure 6.4). In this figure the eficiency of the AdaBoost algorithm is
demonstrated. Due to the arranging of the boosted classifiers according to their size,
only a reduced number of samples reach higher classifiers in the cascade, which are
more expensive to compute.

5.2 Number of Classifiers

At the last of the performance evaluation a static, The result in figure 5.2.1 show that
the Number of Classifiers is different and follow the increment of dataset. The reason
for using the different dataset is to prove that the AdaBooste system which is more
dataset will be producing more number of weak classifier. If the Adabooste system has
more numbers of weak classifier, the good performance in the AdaBooste system will
produced.

Figure 5.2.1: Number of weak classifiers


5.3 Percentage Hit and Missed

Training and Test Dataset for character recognition was obtained by running our
pedestrian detector on several hours of video and extracting sequences of images for
each tracked pedestrian features depend on the Dataset used.

32
We can see in figure 5.3.1, it shows that accuracy of Dataset. The 100 data
collection produces the lower performance which is percentage of Hit equal to 11.89%
and produced the higher percentage of Missed, 85.11%. At this performance, our result
obtained by Adabooste system is non-superior to the cascade classifier with respect to
accuracy, while having the lowest performances of Machine learning.

In addition, compared with the 4500 data collection based system, the
performance classifier based system is 89.36% percentages Hits and 10.64% percentage
Missed in detection and tracking, respectively to the law of Adabooste System, while
preserving the higher performances.

Figure 5.3.1: Percentage of Hits and Missed

5.4 Percentage False Alarm

We can see in the figure 5.4.1, all the data collection produced the higher percentage
false alarm. Compared the percentages false alarm between 100 data collection and

33
4000 data collection is equal to 5.03% only. This value shows that the classifier have
the problem.

One problem with the classifier returned by AdaBoost is that the threshold is much too
low. This is because the prior probability for a pedestrian is much lower than the prior
probability for a non-pedestrian, but the AdaBoost algorithm assumes they have equal
priors.

Attempts were made to adjust the threshold automatically based on a holdout set, but
not enough negative examples were present to accurately do this. Instead, the threshold
was manually adjusted until the false positive rate was qualitatively low enough. While
the holdout set had a dismal detection rate, the performance on actual images was much
better because many windows overlap each pedestrian, giving the classifier multiple
chances to detect a given pedestrian. The thresholds for the cascade were manually
chosen so that as many negative examples were rejected as possible while still allowing
almost all positive examples though.

34
Figure 5.4.1: Percentage of False Alarm

5.5 Result in Real Application

In real application, the application target consists of three parts which is


correct detection, correct detection with false detection and Entirely False
Detection. This three criteria depends on the performance dataset which is has
been discuss before in sub chapter 5.2, 5.3 and 5.4. Figure 5.5.1, 5.5.2, and 5.5.3
gives an overview of the real time application in the pedestrian detection system.

Figure 5.5.1: Correct detection

35
Figure 5.5.2: Correct detection with false alarm

Figure 5.5.3: Entirely False Detection

5.6 Future Work

The boosted sum-of-pixel feature technique introduced by Viola and Jones has many
potential uses. One such use would be to introduce it into a particle filter context where

36
the sum-of-pixel classifier could be used to estimate a likelihood. This would enable an
extremely simple parameterization of pixel coordinates, scale, and velocity. This would
also increase the robustness of the Viola and Jones algorithm and make it faster since
full image searches would no longer be necessary. The simplicity of the
parameterization also would have a huge benefit over more complex contour based
parameterizations. Another possible application of this algorithm would be for behavior
classification. For this to work, however, longer-term motion analysis might be
necessary. Perhaps looking at N successive frames instead of 2 would improve
classification performance.

CHAPTER 6

CONCLUSION AND RECOMMENDATION

37
I have presented an approach for object detection which minimizes computation time
while achieving high detection accuracy. Preliminary experiments, which will be
described elsewhere, show that highly efficient detectors for other objects, such as
pedestrians, can also be constructed in this way.

This paper brings together new algorithms, representations, and insights which are quite
generic and may well have broader application in computer vision and image
processing. The first contribution is a new a technique for computing a rich set of image
features using the integral image. In order to achieve true scale invariance, almost all
object detection systems must operate on multiple image scales.

The integral image, by eliminating the need to compute a multi-scale image pyramid,
reduces the initial image processing required for object detection significantly. In the
domain of pedestrian detection the advantage is quite dramatic. Using the integral
image, pedestrian detection is completed before an image pyramid can be computed.
While the integral image should also have immediate use for other systems which have
used Harr-like features (such as Papageorgiou et al. [12]), it can foreseeable have
impact on any task where Harr-like features may be of value. Initial experiments have
shown that a similar feature set is also effective for the task of parameter estimation,
where the expression of a pedestrian, the position of a pedestrian, or the pose of a
pedestrian is determined.

The second contribution of this paper is a technique for feature selection based on
AdaBoost. An aggressive and effective technique for feature selection will have impact
on a wide variety of learning tasks. Given an effective tool for feature selection, the
system designer is free to define a very large and very complex set of features as input
for the learning process. The resulting classifier is nevertheless computationally

38
efficient, since only a small number of features need to be evaluated during run time.
Frequently the resulting classifier is also quite simple; within a large set of complex
features it is more likely that a few critical features can be found which capture the
structure of the classification problem in a straightforward fashion.

The third contribution of this paper is a technique for constructing a cascade of


classifiers which radically reduce computation time while improving detection
accuracy. Early stages of the cascade are designed to reject a majority of the image in
order to focus subsequent processing on promising regions. One key point is that the
cascade presented is quite simple and homogeneous in structure. Previous approaches
for attentive filtering, such as Itti et. al., propose a more complex and heterogeneous
mechanism for filtering[13]. Similarly Amit and Geman propose a hierarchical structure
for detection in which the stages are quite different in structure and processing [14]. A
homogenous system, besides being easy to implement and understand, has the
advantage that simple tradeoffs can be made between processing time and detection
performance.

Finally this paper presents a set of detailed experiments on a difficult pedestrian


detection dataset which has been widely studied. This dataset includes pedestrians under
a very wide range of conditions including: illumination, scale, pose, and camera
variation. Experiments on such a large and complex dataset are difficult and time
consuming. Nevertheless systems which work under these conditions are unlikely to be
brittle or limited to a single set of conditions. More importantly conclusions drawn from
this dataset are unlikely to be experimental artifacts.

REFERENCES

39
[1] Syed Abdul Rahman, Phd Thesis: “Moving Object Feature Detector &
Extractor Using a Novel Hybrid Technique”, University of Bradford, Bradford,
UK.1997.

[2] Wan Ayub Bin Wan Ahmad, Master Thesis: “Menjejaki Objek Yang Bergerak
Dalam Satu Jujukan Imej”, Universiti Teknologi Malaysia, Skudai,
Malaysia.2002.

[3] Yeoh Phaik Yong, Master Draft Thesis: “Integration of Projection Histograms
For Real Time Tracking Of Moving Object”, Universiti Teknologi Malaysia,
Skudai, Malaysia.2003.

[4] Bernd Heisele, W.Ritter and U.Krebel, “Tracking Non-Rigid, Moving Objects
Based on Color Cluster Flow”. Proc. Computer Vision and Pattern Recognition,
pages 253-257, San Juan, 1997

[5] Bernd Heisele, “Motion-Based Object Detection and Tracking in Color Image
Sequences”. Fourth Asian Conference on Computer Vision, pages 1028-1033,
Taipei, 2000

[6] B. Heisele, T. Serre, S. Mukherjee and T. Poggio. “Feature Reduction and


Hierarchy of Classifiers for Fast Object Detection in Video Images”.
Proceedings of 2001 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2001), Kauai, Hawaii, Vol. 2, 18-24, December
2001.

[7] Yoav Rosenberg and Michael Werman, “Real-Time Object Tracking from a
Moving Video Camera: A Software Approach on a PC”. 4th IEEE Workshop
on Applications of Computer Vision (WACV'98), New Jersey, 1998

[8] Paul Viola and Michael Jones“Rapid Object Detection using a Boosted Cascade
of simple Features”, Accepted Conference On Computer Vision and Pattern
Recognition, 2001.

[9] Goncalo Monteiro, Paulo Peixoto, and Urbano Nunes, “Vision – Based
Pedestrian Detection Haar-like Features”. Institute of Systems and Robotics
University of Coimbra Polo II,Portugal, 2002

[10] Paul Viola and Michael Jones “Robust Real –Time Object Detection”, Second
International Workshop on Statistical and Computational Theories of Vision-
Modeling, Learning, Computing and Sampling, Canada, July2001.

[11] Haar A. Zur Theorie der orthogonalen Funktionensysteme, Mathematische


Annalen, 69, pp 331-371, 1910.

40
[12] Papageorgiou, Oren and Poggio, "a general framework for object detection",
International Conference on Computer Vision, (1998).

[13] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for
rapid scene analysis. IEEE Patt. Anal. Mach. Intell., 20(11):1254–1259,
November 1998.

[14] Y. Amit, D. Geman, and K. Wilder. Joint induction of shape features and tree
classifiers,1997.

[15] Yoav Freund. Boosting a weak learning algorithm by majority. Information


and Computation, 121(2):256–285, 1995.[10] Yoav Freund andRobertE.
Schapire. Adecision-theoreticgeneralizationof online
learning and an application to boosting. Unpublishedmanuscript available
electronically (on our web pages, or by email request). An extended abstract
appeared in Computational Learning Theory: Second European Conference,
EuroCOLT ’95, pages 23–37, Springer-Verlag, 1995.

[16] Johannes F¨urnkranz and GerhardWidmer. Incremental reduced error pruning.


In Machine Learning: Proceedings of the Eleventh International Conference,
pages 70–77, 1994.

[17] GeoffreyW. Gates. The reduced nearest neighbor rule. IEEE Transactions on
Information Theory, pages 431–433, 1972.

[18] Peter E. Hart. The condensed nearest neighbor rule. IEEE Transactions on
Information Theory, IT-14:515–516,May 1968.

[19] Robert C. Holte. Very simple classification rules perform well on most
commonly used datasets. Machine Learning, 11(1):63–91, 1993.

[20] Jeffrey C. Jackson and Mark W. Craven. Learning sparse perceptrons. In


Advances in Neural Information Processing Systems 8, 1996.

[21] Michael Kearns and Yishay Mansour. On the boosting ability of top-down
decision tree learning algorithms. In Proceedings of the Twenty-EighthAnnual
ACM Symposium on the Theory of Computing, 1996.

[22] Eun Bae Kong and Thomas G. Dietterich. Error-correcting output coding
corrects bias and variance. In Proceedings of the Twelfth
InternationalConference on Machine Learning, pages 313–321, 1995.

[23] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1993.

[24] J. Ross Quinlan. Bagging, boosting, and C4.5. In Proceedings, Fourteenth

41
National Conference on Artificial Intelligence, 1996.

[25] Robert E. Schapire. The strength of weak learnability. Machine Learning,


5(2):197–227, 1990.

[26] Patrice Simard, Yann Le Cun, and John Denker. Efficient pattern recognition
using a new transformation distance. In Advances in Neural Information
Processing Systems, volume 5, pages 50–58, 1993.

APPENDIX

A-1:Coding

42
#ifndef _EiC
#include "cv.h"
#include "highgui.h"
//#include "highguid.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <math.h>
#include <float.h>
#include <limits.h>
#include <time.h>
#include <ctype.h>
#endif

#ifdef _EiC
#define WIN32
#endif

static CvMemStorage* storage = 0;


static CvHaarClassifierCascade* cascade = 0;

void detect_and_draw( IplImage* image );

const char* cascade_name =


"haarcascade_frontalface_alt.xml";
/* "haarcascade_profileface.xml";*/

int main( int argc, char** argv )


{
CvCapture* capture = 0;
IplImage *frame, *frame_copy = 0;
int optlen = strlen("--cascade=");
const char* input_name;

if( argc > 1 && strncmp( argv[1], "--cascade=", optlen ) == 0 )


{
cascade_name = argv[1] + optlen;
input_name = argc > 2 ? argv[2] : 0;
}
else
{
fprintf( stderr,
"Usage: facedetect --cascade=\"<cascade_path>\" [filename|camera_index]\n" );

43
return -1;
/*input_name = argc > 1 ? argv[1] : 0;*/
}

// Only for XML file only


//cascade = (CvHaarClassifierCascade*)cvLoad( cascade_name, 0, 0, 0 );
cascade = cvLoadHaarClassifierCascade( cascade_name,
cvSize(/*18*/18,/*36*/36) ); // SIZE!!!!!!!!

if( !cascade )
{
fprintf( stderr, "ERROR: Could not load classifier cascade\n" );
return -1;
}
storage = cvCreateMemStorage(0);

if( !input_name || (isdigit(input_name[0]) && input_name[1] == '\0') )


capture = cvCaptureFromCAM( !input_name ? 0 : input_name[0] - '0' );
else
capture = cvCaptureFromAVI( input_name );

cvNamedWindow( "result", 1 );

if( capture )
{
for(;;)
{
if( !cvGrabFrame( capture ))
break;
frame = cvRetrieveFrame( capture );
if( !frame )
break;
if( !frame_copy )
frame_copy = cvCreateImage( cvSize(frame->width,frame->height),
IPL_DEPTH_8U, frame->nChannels );
if( frame->origin == IPL_ORIGIN_TL )
cvCopy( frame, frame_copy, 0 );
else
cvFlip( frame, frame_copy, 0 );

detect_and_draw( frame_copy );

if( cvWaitKey( 10 ) >= 0 )


break;
}

44
cvReleaseImage( &frame_copy );
cvReleaseCapture( &capture );
}
else
{
const char* filename = input_name ? input_name : (char*)"lena.jpg";
IplImage* image = cvLoadImage( filename, 1 );

if( image )
{
detect_and_draw( image );
cvWaitKey(0);
cvReleaseImage( &image );
}
else
{
/* assume it is a text file containing the
list of the image filenames to be processed - one per line */
FILE* f = fopen( filename, "rt" );
if( f )
{
char buf[1000+1];
while( fgets( buf, 1000, f ) )
{
int len = (int)strlen(buf);
while( len > 0 && isspace(buf[len-1]) )
len--;
buf[len] = '\0';
image = cvLoadImage( buf, 1 );
if( image )
{
detect_and_draw( image );
cvWaitKey(0);
cvReleaseImage( &image );
}
}
fclose(f);
}
}

cvDestroyWindow("result");

return 0;
}

45
void detect_and_draw( IplImage* img )
{
int scale = 1;
IplImage* temp = cvCreateImage( cvSize(img->width/scale,img->height/scale), 8,
3 );
CvPoint pt1, pt2;
int i;

//cvPyrDown( img, temp, CV_GAUSSIAN_5x5 );


cvClearMemStorage( storage );

if( cascade )
{
//CvSeq* faces = cvHaarDetectObjects( img, cascade, storage,
// 1.1, 2, CV_HAAR_DO_CANNY_PRUNING,
// cvSize(40, 40) );
CvSeq* faces = cvHaarDetectObjects( img, cascade, storage,
1.1, 2, CV_HAAR_DO_CANNY_PRUNING,
cvSize(40, 40) );

for( i = 0; i < (faces ? faces->total : 0); i++ )


{
CvRect* r = (CvRect*)cvGetSeqElem( faces, i );
pt1.x = r->x*scale;
pt2.x = (r->x+r->width)*scale;
pt1.y = r->y*scale;
pt2.y = (r->y+r->height)*scale;
cvRectangle( img, pt1, pt2, CV_RGB(255,255,0), 3, 8, 0 );
}
}

cvShowImage( "result", img );


cvReleaseImage( &temp );
}

#ifdef _EiC
main(1,"facedetect.c");
#endif

46

Вам также может понравиться