Zhu MSC

Hand Detection and Tracking in an Active Vision System
Yuliang Zhu
A thesis submitted to the Faculty of Graudate Studies in partial fullment of the requirements for the degree of
Master of Science
Graduate Program in Computer Science York University North York, Ontario June, 2003
Copyright by Yuliang Zhu 2003
Approved by Supervising Committee:
Yuliang Zhu, M.Sc. York University, 2003
Supervisor: Prof. John Tsotsos
As the impact of modern computer systems on every day life increases, humancomputer interaction (HCI) has become more and more important in our daily lives. In fact, as the computing, communication, and display technologies progress, the existing HCI techniques, such as mice and keyboards, limit the speed and naturalness of our interaction and may become a bottleneck in the eective usage of computers. A potential avenue for natural interaction is the use of human gesture and gaze. One domain of application is video-conferencing. In most current teleconferencing or distance learning systems, the camera is xed or is controlled by an operator. However, in natural communication between people, gesture, facial expression and body language play important roles. We use a number of methods to direct the visual attention of those with whom we interact. One very common tool is to point with a nger to items of interest. Motivated by the above ideas, this thesis presents a visual hand tracker, which detects and tracks a hand in a pointing gesture by using the CONDENSATION algorithm iv
applied to image sequences. The background may be highly cluttered, and the stereo cameras move actively. By utilizing the parameters of the camera system, the 3D orientation of the hand is calculated using the epipolar geometry. The tracker estimates the translation, rotation and scaling of the hand contour in the two image sequences captured from a pair of active cameras mounted on a robotic head. It achieves a best tracking accuracy of 12dB measured by signal noise ratio. The average error in estimation of rotation in the vertical plane is less than 7 degrees. Due to the errors in calibration of the active stereo cameras, the resolution in depth is about 10cm at a distance of 1 meter.
Acknowledgments
The author wishes to thank Professor John Tsotsos, who directed and supported all the research work on my thesis, Professor Richard Wildes, Professor Minas Spetsakis and Professor Doug Crawford, who gave me helpful advice, I really appreciate the supports from all the lab members such as Kosta Derpanis, Yongjian Ye, Erich Leung, Jack Gryns, Kunhao Zhou, Markus Latzel, Marc Pomplun, Albert Rothenstein, Bill Kapralos, Yueju Liu, Andrei Rotenstein, etc. Special thanks to IRIS, NSERC and PRECARN for funding this project. A huge thanks to my friends from the online outdoor club, who made my graduate school life much more enjoyable. This thesis is dedicated to my parents. Without their support and encouragement, I would not be able to make this far.
Yuliang Zhu
York University June 2003
vi
Contents
Abstract Acknowledgments List of Figures List of Tables Chapter 1 Introduction 1.1 1.2 1.3 1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv vi x xiv 1 1 4 4 7 8 . . . . . . . . . . . . . . . 8 10 11 13 14 16 17 19 20
Chapter 2 Review of Related Work 2.1 2.2 2.3 2.4 2.5 2.6 Detecting Motion with an Active Camera
Skin Blob Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . Active Contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 2.6.2 Standard Kalman Filter . . . . . . . . . . . . . . . . . . . . . Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . .
2.7
CONDENSATION Algorithm . . . . . . . . . . . . . . . . . . . . . .
vii
2.7.1 2.7.2 2.7.3 2.7.4 2.7.5 2.8
Probability distribution
. . . . . . . . . . . . . . . . . . . . .
21 22 22 23 25 26 29 31 31 39 40 43 52 58 58 58 62 65 68 70 70 74 76 76
Stochastic Dynamics . . . . . . . . . . . . . . . . . . . . . . . Measurement Model . . . . . . . . . . . . . . . . . . . . . . . Propagation of state density . . . . . . . . . . . . . . . . . . . Factored Sampling . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 3 CONDENSATION Hand Tracker 3.1 3.2 3.3 3.4 3.5 3.6 3.7 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hand Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shape Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measurement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D Orientation of the Hand . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 3.7.2 3.7.3 3.7.4 3.8 Rening the Result . . . . . . . . . . . . . . . . . . . . . . . . Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . Correspondence . . . . . . . . . . . . . . . . . . . . . . . . . . 3D Orientation . . . . . . . . . . . . . . . . . . . . . . . . . .
System Architecture and Implementation . . . . . . . . . . . . . . . .
Chapter 4 Experiments and Discussion 4.1 4.2 4.3 Accuracy of Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results of Tracking on Real Images . . . . . . . . . . . 4.3.1 Performance of Tracker with Low Cluttered Background . . .
viii
4.3.2 4.3.3 4.4 4.5
Performance of Tracker with Lightly Cluttered Background . . Performance of Tracker with Highly Cluttered Background . . . . . . . . . . . . . . . . . . . . . . .
81 85 89 99 101
Experiments on 3D orientation
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5 Discussion and Future Work 5.1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
ix
List of Figures
1.1 1.2 1.3 2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 The binocular head (TRISH-2) in GestureCAM . . . . . . . . . . . . The degrees of freedom of the camera system . . . . . . . . . . . . . 5 6 6 27 33 34 35 35 37 37 38 38
System diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The process of CONDENSATION algorithm . . . . . . . . . . . . . . Skin color in RGB space . . . . . . . . . . . . . . . . . . . . . . . . . Skin color distribution in normalized RG space . . . . . . . . . . . . . Skin color in HSV space . . . . . . . . . . . . . . . . . . . . . . . . . Skin color distribution in HS space . . . . . . . . . . . . . . . . . . . Raw image taken from camera in RGB . . . . . . . . . . . . . . . . . Image ltered by skin color model . . . . . . . . . . . . . . . . . . . . Substraction of the two frames . . . . . . . . . . . . . . . . . . . . . . Detection of the skin color edge . . . . . . . . . . . . . . . . . . . . . Representation of the hand contour after initialization: the dots represent the control points of the curve, the dashed curve is constructed from the points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40 42
3.10 State space parameters
. . . . . . . . . . . . . . . . . . . . . . . . .
3.11 The distribution of hypotheses in translation (the points on the top and left indicate are samples on the distribution of translation on x, y axis, evolved from previous iteration). . . . . . . . . . . . . . . . . . . 46
3.12 The distribution of hypotheses on rotation (points on the bottom are samples evolved from previous iteration), when there are no changes in other parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13 The distribution of the scaling (points on the right are samples evolved from previous iteration), when there are no changes in other parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 The distribution of the state in translation, rotation and scaling, evolved from previous iteration. The points on the top and left indicate are samples on the distribution of translation on x, y axis. The points on the bottom are samples on distribution of parameter rotation. The points on the right are samples on distribution of parameter scaling.) 3.15 Distance from the object to each of the cameras can be maintained roughly the equal,when the cameras xate on the object . . . . . . . 3.16 The normals along the hand contour . . . . . . . . . . . . . . . . . . 3.17 The normals along the hand contour, the arrows shows the direction from interior to exterior of the hand shape . . . . . . . . . . . . . . . 3.18 The normals along the contour for measurement of the features . . . 53 55 51 52 49 48 47
3.19 Measurement line along the hypothetical contour. The shaded part illustrates the real hand region, while the black curve indicates the hypothetical contour which is measured. The dashed line with two arrows measures the nearest feature point to the hypothetical contour. The solid line with one arrow shows the measurement taken from interior to exterior portion of the contour. . . . . . . . . . . . . . . . . . . . 56 60 63
3.20 Epipolar geometry of the camera system . . . . . . . . . . . . . . . . 3.21 Finding correspondence along the epipolar line . . . . . . . . . . . . .
xi
3.22 View overlapping when cameras verge . . . . . . . . . . . . . . . . . . 3.23 Transformations to the head coordinate system . . . . . . . . . . . . 3.24 3D orientation of the hand . . . . . . . . . . . . . . . . . . . . . . . . 3.25 Tracker Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Accuracy vs. number of samples: the solid curve shows the accuracy of tracking hand with no cluttered background (0.03% of the pixels are skin color), the red curve shows the one with light cluttered background (3.74% of the pixels are skin color), the thick green curve shows the one with highly cluttered background (9.35% of the pixels are skin color), the error bars are the standard deviation in the results in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Computational complexity vs. number of samples: the error bars are the standard deviation of the experimental results . . . . . . . . . . . 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Frame 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frame 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64 66 67 69
73
75 77 77 78 78 79 79 80 81 82 82 83 83
4.10 Frame 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Frame 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Frame 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Frame 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 Frame 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
4.15 Frame 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.16 Frame 90 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.17 Frame 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.18 Frame 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.19 Frame 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84 84 85 86 86 87 87 88 88 90 92 94 96
4.20 Frame 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.21 Frame 60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.22 Frame 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.23 Frame 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.24 System setup for experiment on 3D orientation . . . . . . . . . . . . . 4.25 Real distances vs. estimated distance . . . . . . . . . . . . . . . . . . 4.26 Orientation in xz plane . . . . . . . . . . . . . . . . . . . . . . . . . . 4.27 The orientation of the arm vertical . . . . . . . . . . . . . . . . . . . 4.28 Arm orientation projected in xz plane. Measurements are taken at
860mm (red), 106mm (black) and 125mm (green), and -45
(blue),
-22.5 (cyan), 22.5 (yellow) and 45

(pink). The vertex in each color 97
represents the position of the elbow in each experiment. . . . . . . . . 4.29 A 3D view of the experimental results on tracking showed in Figure 4.28. Measurements are taken at 860mm (red), 106mm (black) and 125mm (green), and -45 (blue), -22.5 (cyan), 22.5 (yellow) and 45 (pink). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98
xiii
List of Tables
4.1 4.2 Experimental result on accuracy . . . . . . . . . . . . . . . . . . . . . Experimental result on complexity . . . . . . . . . . . . . . . . . . . 72 74
xiv
Chapter 1 Introduction
1.1 Motivation
In order to satisfy the increasing need to permit groups of people to communicate quickly and eciently over distance, computer supported cooperative work (CSCW) applications have made rather signicant progress during the past decades. With the development of the Internet, almost everyone can enjoy such new technology almost everywhere. One of the important issues in CSCW is the human-computer interface. The primary means of input for Human Computer Interaction (HCI) are keyboards, mice and joysticks. However as the complexity of the applications increases some requirements that the conventional interfaces cannot satisfy are emerging. Importing natural means of human communication and interaction into HCI is an approach to design easier to use, more eective input methods. A trend in HCI enhancement is importing human body based communication and interaction methods into the interfaces. Gesture recognition (Gutta et al. [20], Pavlovic et al. [36]), sign language recognition (Starner and Pentland [46], Starner et al. [47]) and emotion recognition (Cowie et al. [12], Rosenblum et al. [40]) are examples of many research areas that can be utilized in communicating with the computer in a structured manner. Moreover
with the advancement in processing speed and display technology more sophisticated interaction methods like immersive virtual reality and telepresence, which are collectively called virtual reality in Benford et al. [3], are emerging. These techniques require a precise estimate of the human body pose, and animation of a copy of the human body at dierent scales on the display devices. Most of the human hand based HCI techniques require the estimation and tracking of hand pose over time. Hand pose data is analyzed for mainly two purposes: communication which interprets hand shape and motion as gestures such as Pavlovic et al. [36]; and manipulation which interprets hand shape and motion as a manipulation tool such as Rauterberg et al. [38]. This requires measurement of a set of parameters related to hand kinematics and dynamics to make decisions about the interaction between the hand and virtual objects. Electro-Mechanical sensing using gloves provided a rst solution to the hand pose estimation and tracking problem. However, this technique has some limitations and is too expensive for most applications. A more recent solution used non-contact computer vision techniques, e.g. in Rehg and Kanade [39], which corresponds to estimating the pose of the hand using one or more image sequences that captures the hand shape and motion in real time . Another important issue related to the design of the system is the set of constraints applied on the input of the system. One has to apply segmentation to locate the hand and/or ngers, or feature extraction to detect ngertip and/or joint locations in the input images. There are two types of constraints applied on the input: 1) Background constraints, 2) Foreground constraints. Background constraints refer to the constraints on the environment in which the hand will be tracked. Usually a
static background where no other object is moving is used; another choice is a uniform background, which will make hand localization much easier, e.g., in Oka et al. [35]. Foreground constraints refer to the hand itself. To increase the robustness of feature extraction, Dorner [16] and Dorfmuller-Ulhaas and Schmalstieg [15] use gloves with markers. These constraints make such systems less exible and robust to the applied environment. The most advanced applications of CSCW are trying to make the communication or collaboration more realistic and intelligent. In applications such as video-conferencing, distance learning and telemedicine systems, it has been found very useful that the participants can share pointing and other gestures over shared documents or objects, e.g., the collaborative environment designed by Leung et al. [31]. Normally, current video subsystems within the above systems simply capture a scene from xed zoom, pan and tilt settings of the camera or cameras regardless of the action of the speaker and audience. Some of them can be adjusted by an operator or participant; however, the function is very limited and far from intelligent. For example, in a distance learning system, the lecturer could walk around on the platform, point out some words on the blackboard and so on. In a common class room, the audience follows the speakers body and what is being pointed out. Furthermore, people in the audience put up their hand when they have a question, and the lecturer should be able to see who may have a question and select one to respond. Meanwhile all the others know who is asking what question. Without advanced features such as active tracking or zooming based on changing circumstances, even todays highend teleconferencing systems only provide remote manual camera control or audio based adjusting. Currently, no system is known to actively respond to visual cues for
attention. GestureCAM was designed to address this challenge.
1.2
Goals
GestureCAM is an active stereovision system that is able to detect and track faces and hands, and interpret gestures in a real world environment. It acts like an active observer or like a virtual cameraman. In an active stereovision system like GestureCAM, tracking a hand in a pointing gesture is not easy because the background may be highly cluttered and the cameras are always active (under the control of a tracking program). Therefore, simple background subtraction algorithms for detecting a moving object are useless while tracking. Additionally, on a highly cluttered background, the states of the moving object are ambiguous and multi-modal, i.e., the distribution of the states could be aected by noise. The dynamic model of the object maybe non-linear, especially when camera motion is considered, and the model becomes even more complicated.
1.3
Contributions
This implementation of a hand tracker works in a highly cluttered visual environment (for example, a lab with lots of books, devices, furniture), tracking a hand in a pointing gesture. With the help of depth information, it can compute the 3D direction of the pointing nger, i.e., where the hand is pointing. The hand tracker extracts the translation, rotation and scaling parameters. The 3D orientation of the hand is based 4
on the tracking result, and guides the camera to that direction. The system consists of a robotically controlled binocular head called TRISH-2 (Figure 1.1 and 1.2 ), connected to a Dual Pentium II PC platform, which acts as its server. Each color camera mounted on the robot head can be controlled independently. Two sets of video are captured by Imaging Technologies S-Video frame grabber, with a resolution of 512x480 pixels and color depth of 24 bits. The cameras can be used independently or as a stereo pair. There are 4 mechanical degrees of freedom: head pan, independent eye vergence, and head tilt. The server computer directly connects with the motors and cameras (control part) of the robotic head through motor control cards and serial port, respectively. The client computer, where the application is running, has two video inputs from the cameras and can send TCP/IP packets to the server to set and get the parameters of the motors and cameras, so that the head is controlled by the application ( as in gure 1.3 ).
Figure 1.1: The binocular head (TRISH-2) in GestureCAM
Figure 1.2: The degrees of freedom of the camera system
Stereo Video
Network (TCP/IP)
Motor & Camera Control
Setting and getting parameters of Camera & Motors
Figure 1.3: System diagram
1.4
Thesis Outline
The following chapters will provide more detailed descriptions of the hand tracker:
Chapter 2 provides a review of related works in detection and tracking. Analysis

of algorithms such as skin color blob, active contour, mean shift, Kalman lter, conditional density propagation and so on are presented. In the summary, it gives the reasons why the CONDENSATION algorithm is chosen so that the tracker can achieve the goal introduced in section 1.2. Chapter 3 presents detailed models and algorithm descriptions of the CONDENSATION tracker, including hand detection, hand shape representation, dynamic model, measurement model and calculation of 3D orientation. Chapter 4 presents the experimental result of the tracker working in dierent conditions, such as dierent lighting, clutter, occlusion. More detailed experiments on the depth and orientation calculations are shown last. Chapter 5 provides the conclusions and future research work.
Chapter 2 Review of Related Work

Tracking has been studied extensively in the computer vision literature. There are large numbers of applications which apply dierent techniques to track dierent targets in dierent conditions. The features such as color, edge, gradient of intensity in each image, and the information over multiple consecutive images can be helpful to track individual objects or to perform a more general motion segmentation. Building probabilistic models to describe the likely motion and appearance of an object of interest is a promising approach. More details of some of these techniques are given in the following sections.
2.1
Detecting Motion with an Active Camera
When the background is uniform or does not change, detection of the moving object can be easily done by subtracting two frames. Especially, when the initial background is remembered, the whole process of the motion of the object can be done in a similar way. The problem is that if the object stops moving for a while, the tracking strategy may lose it. In the system presented by Francois and Medioni [18], background pixel values are modelled as multi-dimensional Gaussian distributions in HSV color space. The value observed for each pixel in a new frame is compared to the current 8
corresponding distribution. The pixels on the moving object in the image then are grouped into connected components. The distribution is updated using the latest observation. The assumption is that the object could not appear in the rst frames, which are used for the constructing the background distribution. For a moving camera, the whole background changes from frame to frame. The algorithm just described cannot work, because both the object (foreground) and the background change together. There is no reference frame for eliminating the background pixels. Burt et al. [8] applied a dynamic motion analysis technique which was based on optical ow. At the early level of the analysis, ow vectors are computed between frames. At the intermediate level, a sequence of focal probes examines the motion in dierent parts of the region. Finally, at the highest level, a global motion model is built and updated continuously. In order to apply this technique to real-time tasks, such as autonomous vehicle navigation, it needs special hardware to compute the motion. A similar framework was proposed by Tao et al. [49], which implemented a dynamic motion layer tracker by modelling and estimating the layer representation of appearance, motion, segmentation and shape using the expectation maximization algorithm over time. In this work, a Gaussian shape prior is chosen to specically develop a near real-time tracker for vehicle tracking in aerial videos. In Philomin et al. [37], a shape model and CONDENSATION algorithm was employed to track pedestrians from a moving vehicle. A class of training shapes were represented by a Point Distribution Model. Then, Principal Component Analysis (PCA) was used to analyze the training set and detect the shape in the tracking. If the shape of the
object varies signicantly, a large training contour should be used leading to increased computation in the tracking process. When the object moves in a wide area, its appearance changes signicantly with respect to a relatively xed camera. In Yachi et al. [53], the tracker utilized parameters of the camera, and related them to the contour models of dierent head appearances. This allowed them to adaptively select the model to deal with the variations in the head appearance due to the human activities. It achieved a accuracy about 80% in detecting head contour using a ellipse model.
2.2
Skin Blob Tracking
Skin color of the hand and face has been used as a good feature for tracking for a long time. Dierent color spaces were applied in tracking and detecting face or hand systems, such as RGB (Jones and Rehg [26]), normalized RGB (Yang et al. [54]), HSV (Herpers et al. [21, 22], Sandeep and Rajagopalan [41], Sigal et al. [42]), YUV (Kampmann [29]) and so on. Almost all of them are transformed from raw RGB space to obtain robustness against changes in light conditions. Jones and Rehg [26] dened a generic color distribution model in RGB space for skin and non-skin classes by using sets of photos on the web. In Herpers et al. [21, 22], the skin color blobs are detected by a method using scan lines and a Bayesian algorithm. The computations of the skin color detection module are based on HSV color values transferred from the 24-bit RGB video signals. A radial scan line detection algorithm was developed for real-time tracking. It scans outward from the center of a region of
10
interest along concentric circles with a particular step width. If the arc between two radial scan points is higher than a particular threshold a new radial scan line positioned between them and with intermediate orientation is introduced. The insertion of a new scan line is iterated whenever the distance between two neighboring scan points is above the threshold. Then connected blobs were merged while too small blobs are eliminated. In the system of Yang et al. [54], an adaptation technique estimated the new parameters for the mean and covariance of the multivariate Gaussian skin color distribution by using a linear combination of previous parameters. The algorithm was implemented in normalized RGB color space. It can track a persons face in 30 frames/second while the person walks, jumps, sits and rises.
2.3
Active Contour
Active contours or Snakes have been used in deformable contour tracking and segmenting rigid or non-rigid objects. The tracking can be done by the minimization of the Snake energies such as internal energy, image energy and external energy. Usually, the snake is initialized near the object of interest and attracted toward the contour of the object by forces depending on the intensity gradient. The energy function of a contour c = c(s) is given by: = ((s)Econt + (s)Ecurv + (s)Eimage )ds, where (s), (s), (s) control the rel-
ative inuence of the corresponding energy term. Econt , Ecurv and Eimage are the energy terms that encourage continuity, smoothness and edge inuence, respectively.
11
The classic snake algorithm will not operate well if there are large dierences in the position or form of the object between successive images. The snake may fall into local minima while moving to new positions. In Kim and Lee [30], the tracker utilizes the image ow, which gives rough information on the direction and magnitude of the moving objects by a correlation process between two images, to make the snake jump to the new location. The success of tracking is largely based on the calculation of the image ow. Unfortunately, it could become complicated in active vision, i.e., the situation with moving cameras. The whole image is moving including both the foreground and background. It is hard to distinguish the motion of the object of interest when it moves in a similar speed as the camera. Jehan-Besson et al. [25] proposed a general framework for region-based active contours to segment moving objects with a moving camera. The image domain is made up of two parts: the foreground part, containing the objects to segment and the background. The boundary between the two domains is a curve, that is, the contour of the object. For each of the parts there is a descriptor of the energy. Normally, the descriptor is proportional to the dierence of statistical ts to objects and background. The minimization of the total energy gives the boundary, i.e., the contour of the object. The camera motion is modelled by 6 parameters in rotation and translation. Based on such a model, the pixel in the previous frame is projected to the new position in the current frame in order to compensate for the camera motion.
12
2.4
Mean Shift
The mean shift algorithm is a simple iterative procedure that shifts each data point to the average of data points in its neighborhood. The data could be visual features of the object such as color, texture and gradient. Their statistical distributions characterize the object of interest, e.g., in Comaniciu et al. [11] the spatial gradient of the statistical measurement is exploited. In Bradski [7], a modied Mean Shift algorithm named Continuously Adaptive MeanShift is applied, which nds the center and size of the color object on a color probability image of the frame. The probability is created via a histogram model of the skin color or other specic colors. The tracker moves and resizes the search window until its center converges with the center of mass. The Mean Shift algorithm depends on the lower level feature detections. Highly cluttered background may distract the tracker from the object. Without describing the moving objects by states and a mechanism of prediction and correction, the tracker cannot distinguish occluded objects. In Comaniciu and Ramesh [10], a combination of Mean Shift and Kalman lter does a better job. In my case, a contour or shape, which can give the orientation of the object, is required instead of a box surrounding the object.
13
2.5
Optical Flow
Optical ow has long been used as a way both to estimate dense motion elds over the entire visible region of an image sequence (Beauchemin and Barron [2]), and to segment areas of consistent ow into discrete objects (Kalafatic et al. [28]). In order to explore how the optical ow may be estimated, the image sequence is modelled by an intensity function, I(x, y, t), which varies continuously with position (x, y) and time t. After expanding the intensity function in a Taylor series and ignoring the higher order terms, we obtain I(x + dx, y + dy, t + dt) = I(x, y, t) + I I I dx + dy + dt x y t
If the brightness value at (x + dx, y + dy, t + dt) is really a translation of the brightness value at (x, y, t) then I(x + dx, y + dy, t + dt) = I(x, y, t). So it must follow that, I I I dx + dy + dt = 0 x y t i.e., where u =
dx dt
(2.1)
I I I = u+ v t x y
and v =
dy . dt
They are the x and y components of the optical ow. The
equation (2.1) is known as the optical ow constraint equation (also called the image brightness constancy equation) and may be written as It =
I I = ( x , Iy = I ) y
I v , where It =
I , t
and v = (u, v) is the optical ow vector with components (u, v).
The term It is the rate of change of the grey level image function with respect to time for a given image point (i.e., it is the temporal image gradient). The spatial and temporal gradients, I, and It may all be measured from the images in an image 14
sequence for a particular pixel. The equation implies that the time rate of change of the intensity of a point in the image is the spatial rate of change in the intensity multiplied by the velocity with which the brightness point is moving in the image plane. However, as there is only one equation and two unknowns, (the two components of optical ow), it is not possible to estimate both components of the optical ow from the local spatial and temporal derivatives. This is referred to as the aperture problem and may be understood by considering the edge of an object moving below a small aperture. The velocity, v, of the edge has a component along the edge, v , and a component perpendicular to the edge, v. But only the component of the velocity in the direction perpendicular to the edge (parallel to the spatial gradient) can be observed and estimated. This is the problem that arises when using only the local spatial and temporal derivatives to estimate the optical ow. Only the component of optical ow in the direction of the maximum spatial derivative, v, may be estimated and, from the fundamental ow constraint equation, it can be shown to be given by: v = It I I 2 (2.2)
In order to solve the optical ow constraint equation it is necessary to either apply regularization, assuming change in motion is smooth over an image region, or parameterize the motion in an entire region using a low-dimensional model, for example an ane model. However, these assumptions may not be satised in an active vision system with moving cameras. When the cameras follow the object, they may cause large global motion. Smith [43, 44], Smith and Brady [45] built a system based on feature-based image 15
motion estimation. It tracks vehicles in a video take from a moving platform. 2D features such as corners and edges are extracted to compute the optical ow. The clusters of ow vectors which are spatially and temporally signicant provide the object motion information. The tracker was implemented on a set of PowerPC based image processing system, which makes the real-time performance possible. Based on the analysis of idealized gesture movements Derpanis [14] modelled the optical ow parametrically as an ane transformation. Using robust hierarchical motion estimator to capture the unknown parameters, it can handle motion larger than one pixel, avoid local minima. Skin colour was used to restrict the region of support to image data that arises from the hand.
2.6
Kalman Filter
The behavior of a dynamic system can be described by the evolution of a set of variables, often called state variables. In practice, the individual state variables of a dynamic system cannot be determined exactly by direct measurements; instead, we usually nd that the measurements that we make are functions of the state variables and that these measurements are corrupted by random noise. The system itself may also be subjected to random disturbances. It is then required to estimate the state variables from the noisy observations. If we denote the state vector by xt , the measurement vector by zt and an optional control input by ut , a dynamic system (in discrete-time form) can be described by xt = f (xt1 , ut , wt ) (2.3)
16
with a measurement that is zt = h(xt , vt ), where the random variables wt and vt represent the process and measurement noise respectively. They are assumed to be independent, white, Gaussian distributed: p(w) N (0, Q),p(v) N (0, R). In practice, the system noise covariance Q and measurement noise covariance R matrixes are usually determined on the basis of experience and intuition. In general, these noise levels are determined independently. We assume then there is no correlation between
T the noise process of the system and that of the observation, that is E[wt vt ] = 0.
2.6.1
Standard Kalman Filter
In Welch and Bishop [51], if the process is linear, it can be described as: xt = Axt + But + wt1 , and the measurement is given as zt = Hxt + vt . A priori and a posteriori estimate errors are dened as e = xt x , and et = xt xt , t t where x is the a priori state estimate given the knowledge of the process prior to t time t, and xt is the a posteriori state estimate at time t, given measurement zt . The a priori estimate error covariance is then Pt = E[e eT ] and the a posteriori t t estimate error covariance is Pt = E[et eT ]. The a posteriori error e can be calculated t t by the function e = K(zt H x ), i.e., xt = x + K(zt H x ), where K is the gain t t t t or lending factor matrix that minimizes the a posteriori error covariance Pt . The dierence (zt H x ) is called the measurement innovation, or the residual. One of t the popular forms of K is given by Kt = Pt H T (HPt H T + R)1 . Specically, lim Kt = H 1 , lim Kt = 0.
R 0 Pt 0
17
The Kalman lter estimates state of a discrete-time controlled process by using a form of recursive solution: the lter estimates the process state at some time and then obtains feedback in the form of measurements. The equations for the Kalman lter fall into two groups: time update equations (predictor) x = At1 + But and Pt = t x APt1 AT +Q; and measurement update equations (corrector) Kt = Pt H T (HPt H T + t R)1 , xt = x + Kt (zt H x ) and Pt = (I Kt H)Pt . t The Kalman lter gives a linear, unbiased, and minimum error variance algorithm to optimally estimate the unknown state of a linear dynamic system from noisy data taken at discrete real-time intervals. Thus, the gain matrix is proportional to the uncertainty in the estimate and inversely proportional to that in the measurement. If the measurement is very uncertain and the state estimate is relatively precise, then the residual is dominated mainly by the measurement noise and little change in the state estimate should be made. On the other hand, if the uncertainty in the measurement is small and that in the state estimate is big, then the residual contains considerable information about errors in the state estimate and strong correction should be made to the state estimate. Based on Kalman ltering, a nger and lip tracking system is developed by Blake and Isard [5] to estimate coecients in a B spline. Measurements are made to nd the minimum distance to move the spline so that it lies on a maximal gradient portion of the image. These measurements are used as the next input to the Kalman lter. In order to be robust to clutter the parameters of the motion model are trained from examples. In their experiments, almost all the motions are oscillatory rigid motion. The background clutter aects the tracking result signicantly. A Kalman lter was applied in the system of Martin et al. [33], where hand shape and position are tracked
18
with the cue of skin color and basic hand geometrical features. The resulting system provides robust and precise tracking which operates continuously at approximately 5 frames/second on a 150 megahertz Silicon Graphics Indy.
2.6.2
Extended Kalman Filter
If the process function (2.3) is not linear or a linear relationship between x and z cannot be written down, the so-called Extended Kalman Filter (EKF for abbreviation) can be applied (Azoz et al. [1], Dellaert and Thorpe [13]). The EKF approach is to apply the standard Kalman lter (for linear systems) to nonlinear systems with additive white noise by continually updating a linearization around the previous state estimate, starting with an initial guess. In other words, we only consider a linear Taylor series approximation of the system function at the previous state estimate and that of the observation function at the corresponding predicted position. This approach gives a simple and ecient algorithm to handle a nonlinear model. However, convergence to a reasonable estimate may not be obtained if the initial guess is poor or if the disturbances are so large that the linearization is inadequate to describe the system. To estimate the state of a non-linear process, the Extended Kalman Filter (EKF) can be used to give out an approximation to optimal non-linear estimation. It has a fundamental aw that the distributions of the random variables are no longer normal after undergoing non-linear transformation. So large errors maybe introduced into the posterior mean and covariance of the transformed Gaussian.
19
UKF, Unscented Kalman Filter, which was used in Stenger et al. [48], uses the unscented transformation algorithm proposed by Julier and Uhlmann [27] to approximate a Gaussian random variable, which is accurate to at least second order of the distribution. The tracker estimated the pose of a hand in 3D model in front of a dark background at a frame rate of 3 frames/second. The uni-modal Gaussian distribution assumption in Kalman lters, including EKF and UKF, maybe a great disadvantage in some tracking problem, for example, multimodal object tracking. The computational complexity of a Kalman lter increases sharply, when the number of tracked objects increases. In active vision systems, motion of both object and camera makes the distribution of the state more complicated and unpredictable.
2.7
CONDENSATION Algorithm
The Conditional Density Propagation algorithm presented in Isard [23], Isard and Blake [24], is a Bayesian ltering method that uses a factor sampling based density representation. It samples and propagates the posterior density of the states over time. There is no assumption on the state probability density function. In other words, it can work with arbitrary probability density functions. For example, based on the CONDENSATION algorithm, the system of Meier and Ade [34] tracks multiple objects with multiple hypotheses in range images. In Isard and Blake [24], an importance sampling function was introduced as an extension of the standard CONDENSATION algorithm, to improve the eciency of the
20
factored sampling. In order to robustly track sudden movement, the process noise of the motion model could be very high, so that the probability of each predicted cluster in state space becomes higher. Therefore, to populate these larger clusters with enough samples to permit eective tracking, the sample set size must be increased, thus also increases the computational complexity. Importance sampling applies when auxiliary knowledge is available in the form of an importance function describing which areas of the state space contain most information about the posterior. In the sampling stage, two given probabilities were set to determine the method from standard sampling, importance sampling and reinitialization. The hand blobs in Isard and Blake [24] were detected by using a Gaussian prior in RGB color space. The importance function, which was a mixture of Gaussians, gave more weight to the predictions that were near the center of the hand blob. It used a second order auto regressive process for the motion model.
2.7.1
Probability distribution
At time t, an object is characterized by a state vector Xt . Its history is Xt = {X1 , ..., Xt }. The set of features in the image is denoted by Zt with history Zt = {Z1 , ..., Zt }. There is no assumption on the density distribution, i.e., p(Xt ) can be a non-Gaussian or a multi-modal function, which cannot be described in closed form.
21
2.7.2
Stochastic Dynamics
The object dynamics are assumed to be a temporal Markov chain, i.e., p(Xt |Xt1 ) = p(Xt |Xt1 ), t > 1, which means that the current state Xt only depends on the immediately preceding state Xt1 and not on any distribution prior to t 1. The dynamics of the evolution are described by a stochastic dierential equation, for example, Xt = AXt1 + BWt . The deterministic part of the equation, dened by A, models the system knowledge, while the stochastic part, dened by BWt , models the uncertainties caused by factors such as noise.
2.7.3
Measurement Model
The observations of the features Zt are assumed to be independent, both mutually and with respect to the dynamics. This is expressed as
t1
p(Zt1 , Xt |Xt1 ) = p(Xt |Xt1 )
i=1
p(Zi |Xi )
t1
(2.4)
After integrating over Xt , (2.4) changes to p(Zt1 |Xt1 ) =

t
i=1
p(Zi |Xi ), so that, (2.5) (2.6)
p(Zt |Xt ) =
i=1
p(Zi |Xi )
p(Zt |Xt ) = p(Zt , Zt1 |Xt ) = p(Zt |Zt1 , Xt )p(Zt1 |Xt ) Integrating over Zt on both sides of equation (2.5), we get
t t1
p(Zt |Xt ) =
Zt Zt
i=1
p(Zi |Xi ) =
Zt
p(Zt |Xt )
i=1
p(Zi |Xi )
22
Further, we get
t1
p(Zt1 |Xt ) =
i=1
p(Zi |Xi )
(2.7)
Substituting the second term in (2.6) with (2.7) and the left side of (2.6) with (2.5)
t t1
i=1
p(Zi |Xi ) = p(Zt |Zt1 , Xt )
i=1
p(Zi |Xi ) (2.8)
p(Zt |Xt ) = p(Zt |Zt1 , Xt ) From (2.4), we know p(Zt1 , Xt |Xt1 ) = p(Xt |Xt1 )p(Zt1 |Xt1 ) p(Xt |Zt1 , Xt1 ) = p(Zt1 , Xt |Xt1 ) = p(Xt |Xt1 ) p(Zt1 |Xt1 )
(2.9)
by using the Markov assumption, (2.9) nally is given as (2.10) p(Xt |Zt1 , Xt1 ) = p(Xt |Xt1 ) = p(Xt |Xt1 ) (2.10)
2.7.4
Propagation of state density
According to the assumptions that the process is a Markov chain and the observations are independent of the state, the conditional state density is given by p(Xt |Zt ). Following Bayes rule and using (2.8), we can derive the formula of calculating the p(Xt |Zt ). p(Zt |Xt , Zt1 )p(Xt |Zt1 ) p(Zt |Zt1 )
p(Xt |Zt ) =
= kt p(Zt |Xt , Zt1 )p(Xt |Zt1 ) = kt p(Zt |Xt )p(Xt |Zt1 ) 23
(2.11)
kt is a normalization factor. By integrating the left of the equation (2.11) over Xt1 , so that
p(Xt |Zt ) = p(Xt |Zt )

Xt1
By integrating the right of the equation (2.11) over Xt1 , we get p(Zt |Xt )p(Xt |Zt1 ) = p(Zt |Xt )p(Xt |Zt1 )
Xt1
Thus, p(Xt |Zt ) = kt p(Zt |Xt )p(Xt |Zt1 ) The second term in equation (2.12) is calculated by as follows. p(Xt |Zt1 ) =
Xt1
(2.12)
p(Xt |Zt1 ) (2.13) p(Xt |Xt1 , Zt1 )p(Xt1 |Zt1 )

Xt1
Substituting the rst term on the right side of the equation by (2.10), we derive the following equation:
p(Xt |Zt1 ) =
Xt1 Xt2
p(Xt |Xt1 )p(Xt1 |Zt1 ) (2.14) p(Xt |Xt1 )p(Xt1 |Zt1 )

Xt1
24
Equation (2.12) and (2.14) give out the propagation of the conditional state density from p(Xt1 |Zt1 ) to p(Xt |Zt1 ), which is superimposed by the dynamical model p(Xt |Xt1 ). The observation model density p(Zt |Xt ) is normally non-Gaussian, because of the background. So when it is applied to (2.12) the state density given observation Zt is also generally non-Gaussian. The dynamics of the objects could be driven by a non-linear process and the system noise could also be non-Gaussian.
2.7.5
Factored Sampling
One of the key techniques in the CONDENSATION algorithm is factored sampling introduced in Grenander et al. [19]. Generally the density of p(Xt |Zt ) can not be evaluated simply in closed form. Additionally, the state space is multi-dimensional (in our case, 4 dimensions). The factor sampling method is used to nd an approximation to a probability density. A set of samples St1 is drawn from the density of the previous time step t 1. According to the dynamic model density p(Xt |Xt1 ), a new set of samples is generated for the time step t: St . The probability or weight of the samples after the measurement is given by p(Zt |Xt = St ). After normalizing weights, the density of the time
(n) (n) (n)
25
step t is P (Xt |Zt ) = P (Zt |Xt = St )P (Xt |Zt1 )

n (n)
P (Zt |Xt = St )
(n) N n=1 n
(n)
P (Zt |Xt = St )
P (Xt |Xt1 = St1 )P (Xt1 |Zt1 )

(n)
(n)
(2.15)
P (Zt |Xt = St )
The entire sample set St
(n)
is going to be used for the next iteration. The whole
process of sampling and propagation is shown in gure 2.1.
2.8
Summary
From the above analysis and comparison of current tracking techniques, we found the CONDENSATION algorithm can handle the hand tracking in our system. Image-based tacking such as background subtraction, skin color blob or mean shift tracker can be easily implemented, but in most of the cases they have to feed into a high level model to estimate the motion of objects. Normally, they track the object shape in rectangle or oval, which is too coarse to estimate a hand in a certain gesture. The large displacement of pixels in the images due to the moving camera makes the assumption of small motion in optical ow unsatisable. In our tracking system, this problem can be solved by obtaining camera status from the server. The Kalman lters, including EKF and UKF, assume uni-modal Gaussian distributions in the state space. This may disable it in some tracking problems, for example,
26
p(xt1 | Zt 1)
p( xt | Zt 1 )
propagation
p(zt | xt )
observation
p(xt | Zt ) =kt p(zt | xt )p(xt | Zt1)
Figure 2.1: The process of CONDENSATION algorithm
27
multiple object tracking. The computational complexity of a Kalman lter increases sharply, when the number of tracked objects increases. Based on factored sampling of the state probability density function, a CONDENSATION tracker can estimate the position of the hand contour on skin color ltered image. Theoretically, the more samples that are taken from the previous distribution the more accurate the tracking. The detailed design and implementation of the tracker will be presented in the next chapter.
28
Chapter 3 CONDENSATION Hand Tracker

From the previous chapter on related work, we nd that the CONDENSATION algorithm has no assumption on the state density function nor on the object motion model, which makes it suitable to deal with the tracking problem in an active vision system. In the following sections, a hand tracker based on the CONDENSATION algorithm is presented. It tracks a hand in a rigid pointing gesture using a binocular sequence of images taken from the active vision system described in the Chapter 1. The cameras can pan and tilt, verge and xate the object during the tracking, and these parameters can be retrieved from the server over the network. This tracking problem, however, is complicated because there is signicant camera motion, and the hand moves according to unpredictable/unknown motion models. We want to make no assumptions about how the camera is moving or about the viewing angle. Hence it is not feasible to break up the dynamics (motion model) into several dierent motion classes introduced in Blake et al. [6] and learn the dynamics of each class and the class transition probabilities. We need a general model that is able to cope with the wide variety of motions exhibited by both the camera and the hand. The tracker works in a normal lab room environment, in which there is uorescent
29
lighting from the ceiling and an incandescent lamp in front of the hand. One reason to set up a secondary light source is that the automatic cameras have no setting for backlighting (the primary source of light comes from behind the subject), and without the extra light source the object is too dark. The subjects normally face the cameras with their hand stretched out, pointing to an object with their index nger in a rigid gesture. The hand pose projected onto the image plane is assumed to show obvious nger and palm part. Otherwise, for example, when the nger points to the camera, there is no distinct nger appearance in the image, and as a result the tracking will fail. Moreover, the hand is assumed to point to the object in the same side of space with respect to the body, i.e., right/left hand always points to the objects on the right/left side. When the hand changes its pointing direction from the right to the left, the tracker can not estimate the hand state evolved from the initial state. In the initialization stage, the tracker stops the movement of the cameras, detects the motion of the hand, which is assumed to be the only moving object at the beginning detailed in section 3.1. In other words, the subject waves his/her hand to tell the tracker there is a gesture to be tracked. In section 3.2 a Gaussian normalized distribution of skin color is built from samples of pixels on the hand. In each iteration, hypotheses of the hand state, derived by sampling the previous state space, propagate through a dynamic model. Each hypothesis is measured on the skin color map of the image, so that the distribution of the state is reshaped according to the observations. The motion of the camera can be estimated from the parameters of the stereo camera system, so that relative motion of the camera to the hand can be
30
cancelled out. The dynamic, measurement model will presented in detail in sections 3.5 and 3.6.
3.1
Initialization
In the original CONDENSATION algorithm proposed by Isard [23], the templates for the trackers are initialized by hand. But in GestureCAM, the template of the hand should be initialized without much human intervention. At the beginning, we make an assumption that the only motion is due to a hand. To get the initial position of the hand and bootstrap the tracker, we freeze the cameras for a moment and take the dierence of two successive frames. Thus, a template of the hand can be extracted within the region by applying the skin color lter introduced in section 3.2. This means that the hand is assumed to be the moving object with skin color within the rst several frames, when the cameras are static. Meanwhile, the position of the hand in the images is used to initialize the state space, setting the value of X0 = [x0 , y0 , r0 , s0 ]T .
3.2
Hand Detection
Human hands usually have a similar skin color as the face of their owner. A general color model generated by statistics of the human skin color may work in most situations. But dierent lighting, for example, natural sunlight and articial indoor
31
lighting, may weaken the result of such color detection. It has been observed that human skin colors cluster in a small region in a color space and dier more in intensity than in color. A skin color distribution can be characterized by a multivariate normal distribution in the normalized color space Yang et al. [54]. Figure 3.1 shows in RGB space a typical aggregated color occurrence distribution from a set of skin color pixels. Each point in the gure designates the presence of a color with coordinate (R, G, B). Based on an analysis of distribution of skin color in dierent color spaces, past research Bergasa et al. [4], Cheng et al. [9], Fang and Tan [17] came to the conclusion that normalized RGB space is suitable for skin color detection. In this space the individual color components are independent of the brightness of the image and robust to changes in illumination. The transformation from RGB to normalized RGB space is simple and fast. One of its disadvantages is that it is very noisy at low intensities due to nonlinear transformation. A threshold obtained from experiments is always applied to remove this eect. Another commonly used color space HSV (distribution shown in gure 3.3 and 3.4) has similar properties, but the conversion from standard RGB costs more that the conversion to normalized RGB. In this implementation, pixels in RGB color space are transformed to a normalized color space by the equations in 3.1. Since r + g + b = 1, b can be represented by r and g. Therefore, the 3-dimension color space RGB is converted to 2-dimension normalized RGB color space.
32
R , R+G+B G g= , R+G+B B b= R+G+B r=
(3.1)
By taking sample pixels from the pictures of 11 dierent subjects hand under 3 dierent lighting conditions (uorescent lamp, incandescent lamp, and both of them), we found that they cluster in normalized RGB space as in gure 3.2.
250 200 150

B
100 50 0 250 200 150 100 50 G 0 0 50 R 100 150 200
250
Figure 3.1: Skin color in RGB space
33
x 10 3 2.5 2
Probability
1.5 1 0.5 0
0.3 0.25 0.2 0.15 0.1 0.05 Normalized G 0 0 0.05 0.1 0.15 Normalized R 0.2 0.25 0.3
0.3
0.25
Normalized G
0.2
0.15
0.1
0.05
0.05
0.1
0.15 0.2 Normalized R
0.25
0.3
Figure 3.2: Skin color distribution in normalized RG space
34
0.8
0.6
V
0.4 0.2 0 1 0.8 0.6 0.4 0.2 S 0 0 0.4 0.2 H 0.6 0.8 1
Figure 3.3: Skin color in HSV space
0.045 0.04 0.035 0.03
Probablity
0.025 0.02 0.015 0.01 0.005 0 1 0.8 0.6 0.4 0.2 S 0 0 50 100 150 200 250 300 350
Figure 3.4: Skin color distribution in HS space
35
The mean vector m and covariance matrix of both the R and G channels of the skin color can be calculated by selecting a region where the hand is located. m = [, g ] r 2 r rg = 2 rg g where r =
1 N N
ri , g =
i=1
1 N
gi (N is the number of pixels) and rg = r g ( is the

i=1
correlation of r and g). Then a bivariate normal distribution model of the skin color is constructed by N (m, ). Equation 3.2 gives such density function.
p(r, g) = where z
1 2r g 1 2
exp
z 2(1 2 )
(3.2)
(r r)2 2(r r)(g g ) (g g )2 + 2 2 r r g g rg cor(r, g) = r g
The skin color map or probability of skin color image (gure 3.6) is the result of back projecting the distribution to the raw image (gure 3.5). To minimize the search region for the hand, we assume that at the initial stage, the only moving object in the scene is the hand, so that it can be detected easily by subtracting the rst two frames. Since the robotic head can be fully controlled by the application, cameras can be stopped for the operation of subtraction whenever it is necessary (initialization or re-initialization). As shown in gure (3.7), the box is the region where hand motion occurred between two frames. 36
Then the skin color lter is applied to this region. A binary image of the hand is segmented out by using threshold . For a normal distribution, the probability of encountering a point outside 3 is less than 0.3% ( is the standard deviation). From the distribution function (3.2), when r [3r , 3r ] and g [3g , 3g ] the minimum of z is 18(1 ). The threshold can be computed by substituting the z in (3.2). In order to deal with the changes in lighting during tracking, at the end of each iteration the mean value of the r and g are updated by sampling the pixels within the tracked contour. After applying the morphological operation close to remove small holes in the hand, the largest connected component within that box is extracted as the initial hand shape shown in gure 3.8. Because the segmentation depends on the color lter, if hand moves against a background in skin color, the tracker can not distinguish the it from the background. Therefore, we assume that there is no large continuous skin color area in the background with respect to the size of the hand shape.
Figure 3.5: Raw image taken from camera in RGB
Figure 3.6: Image ltered by skin color model
37
Figure 3.7: Substraction of the two frames
Figure 3.8: Detection of the skin color edge
38
3.3
Shape Representation
After the hand is segmented from the skin color map of the raw image, the shape should be represented in a certain way so that it can be fed into the tracker. To track a hand with rigid pointing gesture, a parametric curve that smoothly ts the contour of the shape could be a good representation. A B-spline is a generalization of the Bzier curve. Let a vector, known as the knot e vector, be dened T = {t0 , t1 , t2 , ...tm } where T is a nondecreasing sequence with ti [0, 1], and dene control points P0 , ...Pn . Dene the degree as p = m n 1. The knots tp+1 , ..., tmp1 are called internal knots. Dene the basis functions as Ni, 0 (t) = 1 if ti t < ti+1 and ti < ti+1 , otherwise
Ni, p (t) =
ti+p+1 t t ti Ni,p1 (t) + Ni+1, p1 (t) ti+p ti ti+p+1 ti+1

n
Then the curve dened by C(t) =

i=0
Pi Ni,p (t) is a B-spline.
There are many important properties inherent in a B-spline curve. First, the curve is completely controlled by the control points. Second, the curve can have dierent degrees without aecting the number of control points. Third, if a control point is moved, only the segments around this control points are aected. In this application, the tracker needs to nd the nearest edge in the skin color map of the image, where the hand contour is supposed to be. So a cubic B-spline, which 39
is closer to its control polygon, is a better choice. A hand shape is extracted by ltering the image with the skin color model. By scanning the hand shape extracted from the top to the bottom at a given interval of pixels, a sequence of control points on the contour is extracted. The curve passing through the points is then generated, which will be used to measure the distance to the closest edge with skin color. To help the measurement model put more weight on the nger part of the contour, the selection of the control points is taken unevenly. In other words, the shape will t the nger better than the palm part of the hand, because the index nger gives more information about the orientation.
Figure 3.9: Representation of the hand contour after initialization: the dots represent the control points of the curve, the dashed curve is constructed from the points
3.4
State Space
For a given time t the control points of the contour curve, introduced in section 3.3, give the estimated position of the hand. The tracker generates hypotheses of
40
the points to match to the underlying raw image features. The process model of the system species the likely dynamics of the curve over time, relative to the rst detected contour, i.e., the template represented in B-spline curve. A tracker could conceivably be designed to allow arbitrary variations in control point positions over time. This would allow maximum exibility in deformation to accommodate moving shapes. However, particularly for complex shapes requiring many control points to describe them, this is known to lead to instability in tracking. In my application, the moving hand is in a rigid, pointing gesture. Therefore, the state vector in a given time t, noted by Xt (xt , yt , rt , st ), describes the freedom respectively in translation (x,y), rotation r and scaling s. All these parameters are estimated and measured in the image. For each point on the curve, there is a transformation from the initial state X0 . OI , O0 and Ot are the origin of the raw image, template and prediction coordinates, respectively. The control points on the contour of the hypothesis are calculated as follows: The initial state is given byX0 = [x0 , y0 , r0 , s0 ]T . The state at time t is given by Xt = [xt , yt , rt , st ]T . The rotation matrix in image plane at time t is sin t cos t Rt = sin t cos t where t is the rotation from the initial place. At the beginning, 1 0 R0 = 0 1 The scaling parameter is st with the initial value s0 = 1. 41
The translation parameter is tt = [x, y]T . x and y are the translation of the origin Ot from O0 , i.e., tt = [x, y]T = Ot O0 . Finally, the transformation from coordinates with origin O0 to coordinates with origin O0 is given by formula, ut = u0 Rt st + tt , where ut = [xt , yt ]T . Figure 3.10 shows these denitions.
Figure 3.10: State space parameters This is based on the assumption that the components of the hand motion including translation, rotation, and scaling are independent to each other. So the probability density of the state at time t is given by p(Xt ) = p(xt , yt , rt , st ) = p(xt )p(yt )p(rt )p(st ). 42
( @ BA0
0 0 498760
0 0 0 541321
( & )'% "$#"!

^
^

Therefore, the state of the hand motion is represented in a 4-dimension space. Without pre-knowledge of the state distribution, a 4-variate normal distribution is set at the beginning. The initial probability density is given as p(X0 ) = p(x0 )p(y0 )p(r0 )p(s0 ) where x0 N (0, x ), y0 N (0, y ), r0 N (0, r ) and s0 N (0, s ). The x , y , r and s are initialized with experiential values. In each iteration of the tracking, the density of the state is sampled.
3.5
Dynamic Model
The hypothesis in the state space is propagated from the previous time step according to the system model. The dynamic model of the system describes the features of the motion, which is used by the tracker to predict the next state. It could be dened as Xt = AXt1 + Bwt , where Xt is the object state vector, A and B can be learned by experiment, wt is the random variable of noise. A simple linear model could work with smooth motion of the object. Complicated motion can be modelled by extending the model to higher orders. When the background is too noisy and the motion of the camera is introduced, the standard deviation of the noise variable should be set to a larger value. To deal with the wide variety of motions exhibited by both the camera and the object, a more general model is applied as Philomin et al. [37]. In their system there was no assumption about how the camera moves. By using a zero-order motion model with large process noise and
43
quasi-random sampling, the tracker concentrates the samples in large regions around highly probable locations at the previous time step. Quasi-random sampling technique generates the points that span the sample space so that the points are maximally far away from each other, which has better sampling results than the standard random algorithm. The deviation of the predictions are related to the speed of the changes in translation, rotation and scaling. When the object moves in a constant speed the sigma of the distribution can be xed. An adaptive parameter of the distribution changes during accelerating (including both increasing and decreasing speed). All the distributions are initialized as Gaussian. The sampling and propagation algorithm may change the shape of the distribution after iterations. Intuitively the setting of the sigma is such that the next step of the translation should be within a circle, which is a little more than the speed multiplied by the time interval. So in the equation Xt + Ct = A(Xt1 + Ct1 ) + Bwt , the A = 1, and the B is given by Bt = KBt1 , where Ct is the relative motion of the object introduced by the camera motion, K is the coecient for all the state parameters.
K=
|Xt + Ct A(Xt1 + Ct1 )| |Xt1 + Ct1 A(Xt2 + Ct2 )|
where Ct is the state of camera at time t. The following images shows the distribution of the hypotheses in state space, respectively. Figure 3.11 shows the distribution in translation on x and y axis when the hand does
44
not changes its orientation and scale. The black shape is the real hand position in the image, while the contours are samples of the state space propagated from the previous iteration. The envelope of the distribution is no longer Gaussian after iteration. Figures 3.12 and 3.13 show the distribution of the samples in rotation and scaling respectively, when the rest of the state parameters are assumed constant. Finally, Figure 3.14 shows the distribution of all the samples in state space. The rotation distribution is shown over the angles rotated from the initial pose. The middle point represents zero, while negative and positive values mean counter clockwise and clockwise rotation, respectively. The scaling distribution is computed over the ratio of the hypothetic contour size to the initial one.
45
Figure 3.11: The distribution of hypotheses in translation (the points on the top and left indicate are samples on the distribution of translation on x, y axis, evolved from previous iteration).
46
Figure 3.12: The distribution of hypotheses on rotation (points on the bottom are samples evolved from previous iteration), when there are no changes in other parameters.
47
Figure 3.13: The distribution of the scaling (points on the right are samples evolved from previous iteration), when there are no changes in other parameters.
48
Figure 3.14: The distribution of the state in translation, rotation and scaling, evolved from previous iteration. The points on the top and left indicate are samples on the distribution of translation on x, y axis. The points on the bottom are samples on distribution of parameter rotation. The points on the right are samples on distribution of parameter scaling.)
49
In an active vision system, when the camera xates on object and follows its motion, a global movement is presented in each image. For our specic system, we can get the status of the motors, so we can utilize this to cancel out the motion of the camera. However, there are still problems. One is that the rotation axis of the camera (i.e., the motor driving the camera) may not intersect with the optical axis of the lens. So, the rotation of the motor causes not only rotation of the camera, but also translation of the image plane. The second is that the distance from the rotation center to the image plane is estimated by experiment. The errors in the estimation may introduce errors in depth calculation. In normal cases, the hand is at the same distance to both of the cameras. Actually, it is one of the camera systems goals: xating the target. Because the robotic head can pan at the neck and each of the cameras (eyes) can also pan independently, the distance from the object to each of the cameras can be maintained roughly equal (see gure 3.15). Therefore, generally the size of the hand in both images, i.e., the scaling parameter in the tracker, changes together at similar speed, except in the situation that hand moves toward one camera suddenly more quickly than the movement of the system motors. We assume that this situation rarely happens. So the distribution of the scaling, especially, the mean of the scaling is almost the same in both of the cameras.
50
object
left camera O neck of the robotic head
right camera
Figure 3.15: Distance from the object to each of the cameras can be maintained roughly the equal,when the cameras xate on the object
51
3.6
Measurement Model
The likelihood between the observation and the hypothesis can be evaluated by taking the normals along the hypothesized contour (see gure 3.16) and calculating the distance of the nearest skin color edge (edge between skin and non-skin color). The observation process dened by p(Zt |Xt ) is hard to estimate from data, so a reasonable assumption is made that p(Z|X) is specied as a static function assuming that any true target measurement is unbiased and normally distributed, the observation density is given as 1 p(Zt |Xt ) exp( 2 min|z(sm ) r(sm )|) 2 m=1
M
(3.3)
where min|(z(sm )r(sm )| is the distance between the hypothesis point to the nearest edge, is the deviation proportional to the size of the search window along the normal.
Figure 3.16: The normals along the hand contour This approach is similar to what Blake and Isard [5] used in the implementation of their trackers. The only dierence is that we introduce skin color information, while in their implementation, the edges are extracted from grey scale images. The points inside the closed hand contour contain more information about the hand than those outside the contour, which maybe introduced by the image background. Normal hand 52
middle point between edges
center
Figure 3.17: The normals along the hand contour, the arrows shows the direction from interior to exterior of the hand shape shape is illustrated in gure 3.17. The points which are inside the palm, should be closer to the center of the palm than the exterior ones. Similarly, the points which are inside the nger region, should be closer to the middle points between the two edges. We augment the measurement model by probing the skin color edge along the normals from the point inside the hand shape to the outer part. We measure the nearest edge introduced by skin color probability sharply changing from high to low, that is, the probing along the normal now carrying not only the information of the position of the edge but also direction of the normal. In our case, the direction is from the region within the hand to the outside surrounding area. Usually, part or all of a measurement line lies in the interior of a target. Many extensions and modications have been applied to the CONDENSATION algorithm to improve performance and expand its usefulness. MacCormick and Blake [32] have
53
proposed the use of a contour discriminant, which is a metric associated with each sample. The idea behind the contour discriminant is that each sample represents some contour conguration in the image and that there is some likelihood that each conguration matches the true target and some other likelihood that the conguration matches clutter in the image. The contour discriminant is a ratio of likelihoods that indicates how much more target-like a conguration is than clutter-like, given the observations. The main dierence between the method proposed by Isard and Blake [24] and the contour discriminant method involves the assumptions made regarding the distribution of observed features in the image. Both methods calculate likelihoods by dening normals to the contour under consideration and searching for edges along these 1-D normals. Isard and Blake [24] assume that features along these normals in the interior of the contour are distributed similarly to the features on the exterior of the contour along the normals. This eectively treats the contour as a wire-frame, disregarding any knowledge of the interior. MacCormick and Blake [32] assume dierent distributions for the interior and the exterior of a contour, assuming that interior features are due to the target and exterior features are due to clutter. This provides a more accurate model for the probability of features, since the interior feature distribution can be determined by making measurements of the target interior in the rst image of the sequence. The contour discriminant is much more computationally expensive, but MacCormick and Blake [32] claim that it gives much better performance. For a hand in a pointing gesture, the nger indicates more information about the orientation. If the measurements taken along the contour are uniformly distributed, small changes in the shape of the palm will distract the tracker. Therefore, we measure the likelihood by probing along unevenly distributed normals on the contour (Figure 54
3.18). This means we take more measurements on the nger than the palm of the hand, which makes the matching of the nger more important, or in other words there is bias on the nger.
Figure 3.18: The normals along the contour for measurement of the features Under the assumption that the hand tracked in our system is in rigid pointing gesture, i.e., the relative position of the index nger, palm and part of the wrist do not change, a simple measurement was applied. Based on the assumption that all the parts of the hand within the contour are in skin color, a measurement line from the interior to the exterior will rst pass the contour of the hand (shown in Figure 3.19 as a solid unidirectional line. If the estimation of likelihood is done along the measurement line from the hypothetical contour shown in Figure 3.19 as dashed bidirectional line, the clutter which is in skin color may aect the measurement. In the gure, where the clutter is closer to the hypothetical contour, it is much easier to be picked up by a bidirectional measurement line.
55
hand shape
Figure 3.19: Measurement line along the hypothetical contour. The shaded part illustrates the real hand region, while the black curve indicates the hypothetical contour which is measured. The dashed line with two arrows measures the nearest feature point to the hypothetical contour. The solid line with one arrow shows the measurement taken from interior to exterior portion of the contour.
clutter hypothetical contour
56
When all samples of P (Zt |Xt ) are less than a certain probability, it means that the condence on this hypothesis is so low that the object is out of sight or has been lost by the tracker. This threshold determines tolerance to the error in tracking. If it is set too high, the result is going to be more accurate, but reinitializing may be too frequent. On the order hand, if the threshold is set too low, background clutter distracts the tracking easily. Because the hand contour is represented by spline curves, there are errors between such curves and the real contour due to the interpolation. Therefore, the hypothetical contours never perfectly match the real hand contour. In other words, the P (Zt |Xt ) can not reach 1. The length of the normals used in measurement model is set to 10, whose middle point is on the hypothetical contour. So the width of the nger part in the image has to be more than 5 pixels. The rst iteration of the tracking process right after the initialization gives the best estimation of the state (Pinit ). When the tracker follows the hand, background clutter and hand motion introduce noise in the result. By analyzing experimental results on dierent lighting and background conditions, we found that when the probability of states are below 0.6 of the largest initial probability, it lose tracking of the hand. So we set 0.6 Pinit as the threshold. A simple way to reinitialize is stop the cameras and detect the motion of the hand again with same constraints as the initialization stage of the tracker.The tracker reinitializes itself by sampling the whole image with the help of the skin color map, which is extracted by the skin color lter presented in section 3.2.
57
3.7
3D Orientation of the Hand
The hypothesis with the highest evaluation after the measurement step gives the position of the tracked hand. After rening the hand contour, a more precise shape of the hand is extracted. Based on epipoloar geometry in Xu [52], pairs of correspondence are found by correlation, followed by calculation of the position of the points on the hand in 3D space.
3.7.1
Rening the Result
For the upcoming step of calculating the depth, the more precise the nger is located the more accurate the result can be computed. Precision can be degraded due to inaccuracy in color edge detection, focus, changes in lighting, non-rigid hand movement, or tracking errors. All these factors aect the result. A rening process is initialized which localizes the nearest hand contour to the tracking result. By searching the edge within a predened region along the contour, a rened contour of the hand can be detected. It operates similarly to the process of the measurement phase in the CONDENSATION tracker.
3.7.2
Epipolar geometry
The setting of the two cameras in the system is shown in the gure 1.2. Its epipolar geometry is shown in gure 3.20. Using the perspective projection model, C and C are the optical centers of left and right camera, I and I are the image planes; is the
58
epipolar plane where the object point M, C and C lie. m and m are the projected points of M on both image planes. e and e are the epipoles which are the points where the baseline CC intersects with the left and right image plane. The two lines lm and lm are epipolar line of m and m , i.e., the projection of the ray through MC and MC . They are also the intersection of the epipolar plane and the two image planes. The above points can be denoted as following vectors: M = [X, Y, Z]T , M = [X , Y , Z ]T , m = [x, y, z]T , m = [x , y , z ]T where M and m are the points expressed in the rst camera coordinate system while M and m are in second camera coordinate system. The focal lengths of the two cameras are f and f , thus, z = f and z = f . The space point is projected to m =
f M Z
and m =
f Z
M in the two image planes. A point in the rst camera
coordinate system can be expressed by a point in the second camera through rotation R followed by a translation T , that is, [X, Y, Z]T = R[X , Y , Z ]T + T (3.4)
r11 r12 r13 where R = r21 r22 r23 r31 r32 r33
and T = [tX , tY , tZ ]T . And R satises RRT = 1.
Similarly to equation (3.4), we have M = RM + T . Because of the coplanarity of the vectors T , M and M T , we have M T [T (RM )] = 0, . 59
lm C
m e e
m C I I
lm
m y z z T O x baseline R
m x O
Figure 3.20: Epipolar geometry of the camera system
60
lm y
The above equation can be rewritten as M T EM = 0 0 where E = tZ tY of the rotation and tZ tY 0 tX (3.5)
tX 0 the translation between the two cameras.
R, which is also called Essential Matrix is a function
Dividing equation (3.5) by ZZ , gives mT Em = 0. Em is the projective line in the rst image that goes through the point m. So given a point in one of the images, we can get the epipolar line in the other image. In our system geometry shown in gure 1.2 there is translation between the center of two cameras along the baseline, noted by tx ; and relative rotation between the two optical axes, noted by , because the two cameras tilt synchronously. Therefore, 0 0 tZ E = tZ 0 tX 0 tX 0
cos 0 sin 0 1 0 sin 0 cos
Since the essential matrix relates corresponding image points expressed in the camera coordinate system, to calculate the epipolar line in a pixel coordinate system within the reference frame, we need the intrinsic matrix of the camera, which is constructed by previous experiments (not be discussed in this thesis), in which the focal length is xed. The intrinsic matrix, which transforms the normalized coordinates to pixel coordi-
61
nates, can be expressed as f ku f ku cos u0 A= 0 f kv / sin v0 0 0 1 where kv and ku are the ratios between the units of the camera coordinates and the pixel coordinates; (v0 , u0 ) is the principal point in pixel image coordinates and is the angle between the two axes. The intrinsic matrices of the right and left camera are noted as A and A , respectively. m = Am and m = A m are the points in pixel image coordinates. They satisfy the epipolar equation so mT AT EA 1 m = 0. So the fundamental matrix F is given as F = A T EA1 . (3.6)
The projective epipolar line lm corresponding to the point m in the left reference frame can be calculated by equation lm = F m.
3.7.3
Correspondence
The corresponding pixels on the hand in both images are found by calculating the correlation within the region where the epipolar line and the hand overlap, as shown in gure 3.21. We used a normalized correlation algorithm to compute the correlation coecient or score of two pixels on left and right image. The formula is given as Score(m1 , m2 ) = Cov(m1 , m2 ) V ar(m1 )V ar(m2 ) 62 (3.7)
where
n m
Cov(m1 , m2 ) =
i=n j=m n m
[I1 (u1 + i, v1 + j) I1 (u1 , v1 )][I2 (u2 + i, v2 + j) I2 (u2 , v2 )] [Ik (uk + i, vk + j)]
Ik (uk , vk ) =
i=n j=m
(2n + 1)(2m + 1)
n m i=n j=m
V ar(mk ) =
[Ik (uk + i, vk + j) Ik (uk , vk )] (2n + 1)(2m + 1)
The pixel m2 in image I2 whose correlation score is the maximum within the search window along the epipolar line l1 is the corresponding pixel m1 in I1 .
epipolar line l1 of m1
u2
u1
v2
v1
m1
m2
I1
I2
correlation window
search window
Figure 3.21: Finding correspondence along the epipolar line
63
Because of the inherent properties of the two cameras, the images are dierent in the sensitivity to the same color. Furthermore, the camera is set with automatic focus, exposure, iris and so on. For example, as show in gure 3.22, when there is back light appearing in area 1, the right camera adapts to the change, which makes two images dier from each other very much. Another problem is caused by the high vergence of the cameras in that the images from each camera shows dierent views of the object (in area 3) and such views contain dierent information about the object. This makes it hard to select correspondence from the candidates, when the hand is too close to the robotic head. Normally, in a lab or lecture room, this situation seldom appears and the images from both of the cameras are similar.
2

1 3
only in the right image only in the left image in the both images
Figure 3.22: View overlapping when cameras verge
64
Based on the assumption that the relative distance among neighboring points do not change dramatically on the hand, the feedback from the depth calculation introduced in the next section can be used to cancel out part of the unreliable correspondence.
3.7.4
3D Orientation
Based on the parameters of the robotic head and cameras, which includes the vergence and tilt of the cameras and the rotation of the neck, the epipolar geometry of the cameras system is constructed. With the information of pairs of correspondences in each image, the depth of each point of the hand can be calculated. But the correspondences can not be detected perfectly. Occlusion and unequal lighting on the object in each image can make the error of the correspondence detection even worse. For the majority of hand gestures, the depth of the points on the hand do not vary too much. Normally the variation is within 10 cm. Since we have the pair of correspondences in the two images from the previous step, the lines connecting the correspondence and the optical center in each image intersect the corresponding one in other image at the point of the object. In gure 3.23, line mOl and m Or intersect at M. By solving the line equations, we can get the 3D position in the coordinate system C, regardless of the pan and tilt.
65
left camera
bt
y
Ol
y y
O
zp
yt
zt
z
m M
bt
y
O
ap
z
xp
right camera A
bt
B
y '
Or
z x z ' m x '
C
Figure 3.23: Transformations to the head coordinate system 1 0 0 After applying two rotation matrices (tilt Rt = 0 cos bt sin bt and pan Rp = 0 sin bt cos bt cos ap 0 sin ap 0 1 0 ), the coordinate is transformed to coordinate system A, whose sin ap 0 cos ap origin is located at the intersect of the baseline and the neck of the head.
66
Furthermore, the orientation of the hand in 3D space is computed by a simple line equation. If the vector formed by the average location of the nger tip pixels is F and the vector of the average location of the palm pixels is P , the orientation of the hand is given by H = F P as show in gure 3.24.
Y
H F P Z
F H
P X O
Figure 3.24: 3D orientation of the hand
67
3.8
System Architecture and Implementation
The tracker is intended to operate in the GestureCAM system introduced in section 1. Figure 3.25 shows the main process of the system. The process of the tracker for right and left camera are symmetric. In the gure only shows the details of the one of them which is circled by a dashed oval. The tracker runs on a normal desktop computer which is equipped with AMD 1.2GHz CPU, 512 MB memory, and two frame grabbers used to capture images from the stereo cameras. The operating system is Windows 2000. The program is developed under Visual C++ 6.0 and Intel Image Processing Library. Due to the independency of the trackers for both cameras, they are implemented by two separate threads which sample, propagate and measure the density of the hand state distribution in each image over time. These two threads were synchronized before the start of calculating the 3D orientation. The whole process works iteratively beginning with reading image data and ending with a 3D orientation of the hand.
68
System Initialization
Initializing State and Shape Model Dynamic Model Left Image Right Image
Propagation
Skin color filter Hand Tracker Hand Tracker Measurement Searching Correspondence
No
Yes
lost track? 3D Orientation
Figure 3.25: Tracker Diagram
69
Chapter 4 Experiments and Discussion

In order to analyze the major factors which aect the performance of the hand tracker a series of experiments are shown in this chapter, followed by analysis and discussion of the results. The hand tracker was run in dierent lighting and background image clutter conditions to show its robustness. Accuracy and complexity of estimating the 3D orientation are presented in the following sections.
4.1
Accuracy of Tracking
The gure 4.1 shows the relation between tracking accuracy and number of samples used in the CONDENSATION tracker. In this set of experiments, the number of samples used in the tracker ranges from 10 to 5000, while other parameters were kept unchanged. It tracked a hand moving over a dark background. It other words, there was almost no background clutter that could distract the tracker. In this controlled situation, the factors aect the accuracy are the models and number of samples in CONDENSATION algorithm. Three kinds of hand motion such as translation, rotation and scaling were examined. An image processing specic measure is employed to assess the accuracy of the track-
70
ing process (i.e., The accuracy of shape, position and orientation of the tracked contour) in Tissainayagam and Suter [50]. Thus the error measure is independent of the contour representation. The hand contour resulting from the tracking process is rendered at lled in by the foreground color (colored with white) into the image Itrack . It is more appropriate to measure the signal in terms of the area of foreground pixels in the ground truth image, which was obtained by applying the same skin color lter on the image frame and manually marking the hand region. The signal and noise are calculated using the following quantities. signal = 2
images x,y
[Iref (x, y)]2 (4.1)

2
noise =
images x,y
[Iref (x, y) Itrack (x, y)]
Iref is the pixel value at (x, y) for the ground truth image. The pixel value for a background pixel is 0. The scale factor of 2 in the signal value was chosen so that a Signal to Noise Ratio (SNR) of 0 (i.e., signal = noise) would occur if the tracker silhouette consisted of a shape of the same area as the ground truth shape but inaccurately placed so that there is no overlap between the two. This is the worst case scenario where the tracker has completely failed to track the object. The output SNR (in dB) denoted as out SNR is calculated by using the following equation. SN Rout (dB) = 10 log signal noise (4.2)
From the graph, we can see that the percentage of successfully tracked frames increases with respect to the increasing number of samples. Especially, when the number of samples goes from 1 to 1000, the accuracy is improved signicantly. After 1000, the accuracy asymptotically converges. It means that from this point having more samples of the distribution of the state does not yield major improvement. Theoret71
ically, a perfect tracking result will make the noise equal to 0, causing SN R in (4.2) to go to innity. In fact, there is inherent noise introduced by the shape representation, because the hand contour is represented by the spline curve which is not exactly the hand shape edge. Five image sequences of dierent combination of hand motion were tested. The solid curve in gure 4.1 shows the accuracy of tracking hand with clutter-less background while the dotted curve shows the one with higher cluttered background. This experiment shows that increasing the number of samples in CONDENSATION algorithm can improve the tracking accuracy to a certain extent. Another factor aecting the tracking is the noise in the image. Due to the use of skin color lter, the more clutter in the background, especially skin color, the more likely tracker is distracted by the noise. In this experiment, dierent light sources made dierent background illumination. The extent of the clutter was measure by the average percentage of pixel in skin color. number of samples SN Rnoclutter (dB) SN Rhighclutter (dB) 20 50 100 6.03 6.73 9.14 4.74 6.16 7.8 500 1000 2000 3000 5000 10.59 10.73 11.75 11.76 11.80 8.61 8.74 8.82 8.21 9.75
Table 4.1: Experimental result on accuracy
72
13
11
accuracy measureed by SNR (dB)
500
1000
1500
2000 2500 3000 3500 number of samples used in tracker
4000
4500
5000
5500
Figure 4.1: Accuracy vs. number of samples: the solid curve shows the accuracy of tracking hand with no cluttered background (0.03% of the pixels are skin color), the red curve shows the one with light cluttered background (3.74% of the pixels are skin color), the thick green curve shows the one with highly cluttered background (9.35% of the pixels are skin color), the error bars are the standard deviation in the results in the experiments.
73
4.2
Computational Complexity
In Isard [23] the use of the random-sampling algorithm causes one iteration of the CONDENSATION algorithm to have formal complexity O(N log N ). N is the number of samples for each iteration and log N is the cost for randomly picking up a sample from the base sample set by using binary subdivision. The graph in gure 4.2 shows the relation between number of samples used in the tracker and the time consumption in calculation using the same image sequences in section 4.1. The computation of the tracker in each frame was timed with respect to the dierent number of samples used in the tracker. The curve in the graph shows the roughly linear relation between complexity and number of samples. This result is independent of the complexity of the image or the hand motion. Thus, the computation cost of tracking is stable. The graph in gure 4.1 shows why there is always a trade-o between accuracy and complexity. Since the samples go through the lter independently, the computation of each sample can be parallelized, and the images from each camera can also be processed independently. number of samples time of computation (sec) 20 50 100 200 500 1000 2000 3000 0.36 0.54 0.7 0.98 2.21 4.11 7.67 12.04
Table 4.2: Experimental result on complexity
74
14
12
10
time of computation (sec)
500
1000
1500 2000 number of samples
2500
3000
3500
Figure 4.2: Computational complexity vs. number of samples: the error bars are the standard deviation of the experimental results
75
4.3
Experimental Results of Tracking on Real Images
Given sequences of images taken from the stereo cameras with dierent lighting and background clutter, experiments on tracking show how these two factors aect the result. In each following image, the distribution of state (translation in x and y, rotation, scaling) is shown respectively in top, left, bottom and right boundary. Finally, a set of experiments show the performance of estimating 3D orientation.
4.3.1
Performance of Tracker with Low Cluttered Background
When there is no skin color clutter in the background, the hand motion can be tracked in both cameras more accurately. In this experiment, an average of 0.03% of the background pixels are skin color. 1000 samples were used in the tracking algorithm. The following pairs of images in Figures 4.3 to 4.9 show tracking using a black curtain as background. There is no back lighting in the scene. The hand moved up and down, towards the cameras. The tracker keeps track of the hand successfully. The factors that aect the result are the lighting condition and vergence of the camera. These may change the hand shape appearance in each camera during tracking, which introduces error for a rigid contour tracker. The frame rate is 10 frame/second. The accuracy of the tracking result is shown by the solid curve in gure 4.1.
76
Figure 4.3: Frame 5
Figure 4.4: Frame 15
77
78
79
80
4.3.2
Performance of Tracker with Lightly Cluttered Background
Without the black curtain, but using the only one main light source, the clutter in the background increases. In the following images, the tracker works well until it moves in front of the face. There are about 3.74% the background pixels in skin color. When the hand occludes the face, that is, an object on the background with similar color distribution, the tracker loses the tracking of the hand. 1500 samples were used in the tracking algorithm. In the pairs of images from 4.10 to 4.16, the cameras moved when the tracker followed the hand. The accuracy of the tracking result is shown by the dotted curve in gure 4.1.
81
82
83
84
4.3.3
Performance of Tracker with Highly Cluttered Background
The following images from gure 4.17 to 4.23 are frames during the tracking of hand in a lab situation, in which background is cluttered and both of the cameras move actively. The light sources are the uorescent lamp from the ceiling and a incandescent lamp in front of the subject in order to reduce the backlight eect. Approximately 9.35% of the background pixels are in skin color. 1500 samples were used in the tracking algorithm. From the result we found that the motion of cameras almost did not aect the tracking. The accuracy of the tracking result is shown by the thick dashed curve in gure 4.1.
85
86
87
88
4.4
Experiments on 3D orientation
The tracking of the hand from the images of the cameras is in two dimensions. By using epipolar geometry with the help of the intrinsic information of the camera system, we can get the 3D location of the object in the view. There are mechanical errors introduced by the motors and the cameras system, that is, the information we get from the server indicating the state of the motor might not be the true rotation of the camera system, when the images were captured. When the object is far from the cameras, the error of estimating the location might become very large. The experimental setup is shown in gure 4.24. The thick arrow in the lower image represents the hand and arm while the point at the end of the arrow indicates where the elbow is located. The arm moves against the plane A that is perpendicular to the xz plane. In other words, the hand moves in a vertical plane. The dashed arrow and planes indicate the possible position of the arm in the experiments. Plane A changes its orientation and distance to the cameras as shown in the lower diagram. 4 dierent orientations of rotation with respect to the axis y, and 5 angles within the plane A were tested. Furthermore, three positions (z= 860 mm, 1060 mm and 1250 mm) of the plane parallel to the xy plane were taken to test the accuracy of the depth. To minimize the eect of other noise during tracking, the background was pure black and light source was from the front of hand.
89
Rotation in vertical plane
rotation in xz plane
Plane A y
z x
Figure 4.24: System setup for experiment on 3D orientation
90
The gure 4.25 shows the experimental result on distance estimation, i.e., calculating the distance from the camera to the hand in axis z. The hand moved in a vertical plane parallel to the plane xy. The cameras were xed with 10 degree in vergence, 0 degree in tilt and pan. We found that the estimation is linear with the real distance. The slope is about 1.3. The deviation of the estimation gets larger when the distance from camera to the hand increases. There are several major sources of the error in depth estimation. First, the error in the calibration of the stereo camera is a dominant factor aecting the accuracy. Second, the rotation angle of the motor which drives the camera is not very accurate due to mechanical errors. Third, the corresponding points search method is based on estimating correlation of the pixels along epipolar line. Mismatching a pair of points may introduces large error in calculating depth. Additionally, errors in tracking results may aggravate the inaccuracy in searching corresponding points. The gure also shows that the depths of the points on frontoparallel plane cluster around the estimated value, which gives a resolution of less than 10cm within a distance of 1 meter.
91
2000.00 1800.00 1600.00 Estimated depth (mm) 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00 860.00 1060.00 Read depth (mm) 1250.00 1088.20 1439.80 1743.40
Figure 4.25: Real distances vs. estimated distance
92
There are 5 experimental results on estimating the rotation of the arm projected on
z xz plane shown in gure 4.4, by using formula arctan[ xh ze ], where (xh , yh , zh ) is h xe
the average position of the points on hand, and the (xe , ye , ze ) is the position of the elbow. In the experiment, the vertical plane was placed in 5 dierent angles, while the position of elbow was 1 meter from the camera in the z axis. They are -45 ,

-22.5 , 0 , 22.5

and 45 . The cameras were xed during the experiment with 10
degree in vergence, 0 degree in tilt and pan. The deviation of the estimation becomes larger when the plane turning away from the plane parallel to the xy plane. When

the hand moved in the 45
or -45
plane, the distortion of hand shape was more
signicant than in 0 . The large error in estimation of points z coordinate worsens this calculation.
93
60
40 30.63 Estimated Angle (degree) 20 12.16 0 21.80
-20
-22.00 -40.2096
-40 -60
-80 -45.00 -22.50 0.00 22.50 45.00 Real angle (degree)
Figure 4.26: Orientation in xz plane
94
The Figure 4.27 shows relation between the estimated and the real rotation of the arm within a vertical plane. The angle is computed by anglev = arctan[
(yh ye ) (xh xe )2 +(zh ze )2
],
where (xh , yh , zh ) is the average position of the points on hand, and the (xe , ye , ze ) is the position of the elbow. The maximal standard deviation is less than 2 degree. The maximal error from real value is less than 10 degree. Comparing to the previous graph showing high deviation in estimating rotation in xz plane, it shows that tracking motion in a vertical plane is much more accurate than tracking horizontal motion, due to the smaller accuracy in depth estimation.
95
7UXH DQJHO GHJUHH HVWLPDWHG WUXH
Figure 4.27: The orientation of the arm vertical
(VWLPDWHG DQJHO GHJUHH
96
Figure 4.28 shows the estimation of the arm orientation projected on the xz plane. The hand moved in the vertical plane at 3 depth positions and 4 horizontal rotations. Figure 4.29 shows a 3D view of the experiment result. The arm rotating in the vertical plane is illustrated as a line. All the lines converge at a point where the elbow is located.
2200
2000
1800
1600
1400
1000
800 -150
-100
-50
50 X
100
150
200
250
300
Figure 4.28: Arm orientation projected in xz plane. Measurements are taken at 860mm (red), 106mm (black) and 125mm (green), and -45 (blue), -22.5 (cyan), 22.5 (yellow) and 45 (pink). The vertex in each color represents the position of the elbow in each experiment.

97

1200
R R 6 5 4 2 1 0 ) 4 4 $3) ( # # $&$" ' % # ! U T R VSSQ B B E C FDB IH PG @ 8 A97
300 200
300
200 100 0 100 200 100 300 2200 2000 1800 1600 1400 Z 1200 200 1000 800 X 0
Y
100
Figure 4.29: A 3D view of the experimental results on tracking showed in Figure 4.28. Measurements are taken at 860mm (red), 106mm (black) and 125mm (green), and -45 (blue), -22.5 (cyan), 22.5 (yellow) and 45 (pink).

98
4.5
Summary
From the above experiments, we found that the following aect the performance of the hand tracker.
1. To some extent increasing the number of samples using in CONDENSATION algorithm can improve the accuracy. 2. The clutter in the background also reduces the accuracy. 3. The linear relation between the number of samples and computational cost implies that there is trade-o of accuracy and complexity. Tracking speed is about 1 frame/second using a Pentium3 937MHz normal desktop. 4. As we can see from the tracking result on the real image sequences, the tracker estimates the translation, rotation and scale of the hand contour in the stereo images. The lighting condition changes the clutter in the background as well as the distribution of the skin color model. The active moving cameras introduces more noise in the motion model. 5. The images in 4.3 show the tracking results with the distribution of each dimension in state space. The shape of the density function changes after tracking iterations. Although initialized as a Gaussian distribution (a reasonable general guess), it becomes non-Gaussian quickly when the background gets cluttered. 6. In the experiments on estimation of hand orientation in 3D space, the noisy depth calculation caused by errors in camera calibration and searching corresponding points, enlarges the errors in estimating the position. The points on
99
the hand seem to cluster when the estimated positions in vertical plane are projected to horizontal plane shown in 4.28, but the resolution is about 10cm at the distance of 1 meter. 7. The hand motion in 3D space is projected to 2D image plane in each camera. The rotation with respect to z axis is estimated by the rotation parameters of hand state, while the rotation with respect to y ro x axis is estimated by scaling parameter. The last two kinds of rotation may change the aspect ratio of the hand shape, so that the tracking result become less accurate.
100
Chapter 5 Discussion and Future Work

In this thesis, we presented a hand tracker based on CONDENSATION algorithm. It tracks hand in a rigid gesture in an active vision system and gives the 3D orientation of the hand in a lab situation. By applying an adaptive bivariate normal model in normalized RG color space, it reduces the non-skin color clutter. Initial hand contour is detected by simple substraction of two consecutive frames followed by skin color ltering and a morphological operation. The tracker estimates the translation, rotation, and scaling of such contours, according to the CONDENSATION algorithm. Using the factored sampling technique, it can deal with arbitrary distributions of the state. Such samples go through the dynamic model combining with camera motion, then generate a set of hypotheses. The measurement model estimates the likelihood of the hypothesis to the features in the image. The tracker measures the feature points on the normals along the contour with bias on nger points, and searching the nearest feature points from interior to exterior based on the knowledge of the hand shape at the initialization stage. The weight of the hypothesis is normalized and the state distribution is updated as the base for next iteration.
101
The weighted mean of the hypotheses is the estimation of current hand contour state. As a cue for searching correspondences in the pair of stereo images, the hand contour decreases the range of candidates. By applying epipolar geometry and the correlation algorithm, the 3D position of the point on hand is calculated. The tracker based on the CONDENSATION algorithm works well in cluttered and globally moving background. The accuracy ranges from about 9dB to about 12db measured by SNR introduced in 4.1, using 2000 samples. Due to the noise introduced by calibration of the stereo camera and corresponding points, the average resolution in depth is more than 10 cm when the distance is above 1 meter. Tracking speed is about 1 frame/second using a Pentium3 937MHz normal desktop. In the GestureCAM project, the robotic head may be placed at the back of a lecture room tracking a speaker standing in the front of the room. The tracker can guide the stereo cameras xate at the moving hand with appropriate zooming in. Zooming changes the focal length which need to be calibrated. The base line between the two cameras in the epipolar geometry is far from the object, so that the stereo cameras are almost parallel to each other and the major motion is head panning. Since the vergence of the cameras are small, the appearances of the object in both images are almost the same, which helps in the searching corresponding points on the hand shape.
102
5.1
Future Work
In this implementation we assume that the gesture of the hand is rigid. In the real world, hand can do more complicated motion, and may appear dierently due to the change of lighting and perspective. By extending the state space to higher dimension, or applying deformable model, more types of hand motion and gesture can by tracked. The hand tracker presented in this thesis can only track a single gesture, which is the full palm with the pointing nger up. It could deal with more gestures and gesture switching if there is mechanism to evaluate which model is the best t to the hand gesture. The computational complexity is the proportional to the number of samples used in the CONDENSATION tracker. In our implementation, we found that generating the hypothetic contours and the normals takes large part of the total running time. Since the measurement on each hypothesis is independent, it can be computed by parallel algorithm in the future in order to run in real time and keep a good accuracy at the same time. The accurate calibration of active stereo camera, estimating the intrinsic and extrinsic parameters, is crucial in tracking object in 3D space. The complexity of calibrating active camera is higher than for a passive one. It includes processes such as motorized lens calibration, the kinematic calibration and head/eye calibration. A well-calibrated stereo vision system would not only dramatically reduce the complexity of the stereo correspondence problem but also signicantly reduce the 3D estimation error.
103
Bibliography
[1] Y. Azoz, L. Devi, and R. Sharma. Tracking hand dynamics in unconstrained environments. In In Proc. Third International Conference on Automatic Face and Gesture Recognition, pages 274279, Nara, Japan, April 1998. [2] S. S. Beauchemin and J. L. Barron. The computation of optical ow. ACM Computing Surveys, 27(3):433467, 1995. [3] S. Benford, J. Bowers, L. E. Fahln, C. Greenhalgh, and D. Snowdon. User e embodiment in collaborative virtual environments. In Proc. ACM Conf. Human Factors in Computing Systems, CHI, volume 1, pages 242249, 1995. URL citeseer.nj.nec.com/benford95user.html. [4] L.M. Bergasa, M. Mazo, A. Gardel, M.A. Sotelo, and L. Boquete. Unsupervised and adaptive Gaussian skin-color model. Image and Vision Computing, 18:987, August 1970. [5] A. Blake and M. Isard. 3d position, attitude and shape input using video tracking of hands and lips. In Proceeding of ACM Siggraph, pages 185192, 1994. [6] A. Blake, B. North, and M. Isard. Learning multi-class dynamics, 1998. URL citeseer.nj.nec.com/article/blake98learning.html. [7] G.R. Bradski. Real time face and object tracking as a component of a perceptual user interface. In Proceedings of Fourth IEEE Workshop on Applications of Computer Vision (WACV) 98, pages 214 219, 1998.
104
[8] P. J. Burt, J. R. Bergen, R. Hingorani, R. J. Kolczynski, W. A. Lee, A. Leung, J. Lubin, and J. Shvaytser. Object tracking with a moving camera. In IEEE Workshop on Visual Motion, pages 212, Irvine, CA, 1989. [9] H.D. Cheng, X.H. Jiang, Y. Sun, and J. Wang. Color image segmentation: advances and prospects. Pattern Recognition, 34:2259, August 1970. [10] D. Comaniciu and V. Ramesh. Mean shift and optimal prediction for ecient object tracking. In Proceedings of International Conference on Image Processing, volume 3, pages 7073, 2000. [11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR00), volume 2, pages 142149, Hilton Head Island, South Carolina, 2000. [12] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):3280, Jan 2001. [13] F. Dellaert and C. Thorpe. Robust car tracking using Kalman ltering and Bayesian templates. In Conference on Intelligent Transportation Systems, 1997. [14] K. G. Derpanis. Vision based gesture recognition within a linguistics framework. Masters thesis, Computer Science, York University, May 2003. [15] K. Dorfmuller-Ulhaas and D. Schmalstieg. Finger tracking for interaction in augmented environments. In Proceedings of the IEEE and ACM International Symposium on Augmented Reality 2001, pages 5564, 2001.
105
[16] B. Dorner. Chasing the colour glove: Visual hand tracking. Masters thesis, Computer Science, Simon Fraser University, June 1994. [17] Y. Fang and T. Tan. A novel adaptive colour segmentation algorithm and its application to skin detection. In Proceedings of The Eleventh British Machine Vision Conference, volume 1, pages 2331, September 2000. [18] A. R. J. Francois and G. G. Medioni. Adaptive color background modeling for real-time segmentation of video streams. In the Proceedings of the International on Imaging Science, Systems, and Technology, pages 227232, Las Vegas, Nevada, 1999. [19] U. Grenander, Y. Chow, and D. M. Keenan. Hands : a pattern theoretic study of biological shapes. Springer-Verlag, New York, 1991. [20] S. Gutta, J. Huang, I. Imam, and H. Weschler. Face and hand gesture recognition using hybrid classiers, 1996. URL citeseer.nj.nec.com/gutta96face.html. [21] R. Herpers, K. Derpanis, D. Topalovic, J. MacLean, A. Jepson, and J. Tsotsos. Adaptive color background modeling for real-time segmentation of video streams. In Workshop Dynamische Perzeption, Universitaet Ulm, Germany, 2000. [22] R. Herpers, G. Verghese, K. Darcourt, K. Derpanis, R. F. Enenkel, J. Kaufman, M. Jenkin, E. Milios, A. Jepson, and J. K. Tsotsos. An active stereo vision system for recognition of faces and related hand gestures. In Second Int. Conference on Audio- and Video-based Biometric Person Authentication, pages 217223, Washington,D.C., 1999. [23] M. Isard. Visual Motion Analysis by Probabilistic Propagation of Conditional
106
Density. PhD thesis, Robotics Research Group, Department of Engineering Science, University of Oxford, 1998. [24] M. Isard and A. Blake. Icondensation: Unifying low-level and high-level tracking in a stochastic framework. In European Conference on Computer Vision, pages 893908, 1998. [25] S. Jehan-Besson, M. Barlaud, and G. Aubert. Region-based active contours for video object segmentation with camera compensation. In Proceedings of 2001 International Conference on Image Processing, volume 2, pages 6164, 2001. [26] M. J. Jones and J. M. Rehg. Statistical color models with application to skin detection. In Computer Vision and Pattern Recognition (CVPR 99), pages 274 280, Ft. Collins, CO, 1999. [27] S. Julier and J. Uhlmann. nonlinear transformations of A general method for approximating probability distributions, 1996. URL
citeseer.nj.nec.com/julier96general.html. [28] Z. Kalafatic, S. Ribaric, and V. Stanisavljevic. A system for tracking laboratory animals based on optical ow and active contours. In Proc. 11th International Conference on Image Analysis and Processing, ICIAP 2001, pages 334339, Palermo, Italy, September 2001. [29] M. Kampmann. Segmentation of a head into face, ears, neck and hair
for knowledge-based analysis-synthesis coding of videophone sequences. URL citeseer.nj.nec.com/227514.html. [30] W. Kim and J. Lee. Visual tracking using snake for objects discrete motion.
107
In Proceedings of the 2001 IEEE International Conference on Robotics and Automation, Seoul, Korea, 2001. [31] W. H. Leung, K. Goudeaux, S. Panichpapiboon, S.-B. Wang, and T. Chen. Networked intelligent collaborative environment (netice). In IEEE Intl. Conf. on Multimedia and Expo, New York, July 2000. [32] J. MacCormick and A. Blake. A probabilistic contour discriminant for object localisation. In International Conference on Computer Vision, pages 390395, 1998. [33] J. Martin, V. Devin, and J. Crowley. Active hand tracking. In IEEE Third International Conference on Automatic Face and Gesture Recognition, FG 98, April 1998. URL citeseer.nj.nec.com/martin98active.html. [34] E. B. Meier and F. Ade. Tracking cars in range images using the condensation algorithm. In IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems ITSC99, pages 129134, Tokyo, Japan, October 1999. [35] K. Oka, Y. Sato, and H. Koike. Real-time tracking of multiple ngertips and gesture recognition for augmented desk interface systems. In Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR02), pages 429434, 2002. [36] V. Pavlovic, R. Sharma, and T. S. Huang. gestures for human-computer interaction: Visual interpretation of hand IEEE Transactions URL
A review.
on Pattern Analysis and Machine Intelligence, 19(7):677695, 1997. citeseer.nj.nec.com/pavlovic97visual.html.
108
[37] V. Philomin, R. Duraiswami, and L. Davis. Pedestrian tracking from a moving vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, pages 350355, USA, 2000. [38] M. Rauterberg, M. Bichsel, M. Meier, and M. Fjeld. A gesture based interaction technique for a planning tool for construction and design. In 6th IEEE International Workshop on Robot and Human Communication, pages 212217, September 1997. [39] J. M. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: an application to human hand tracking. In European Conference on Computer Vision, pages 3546, 1994. URL citeseer.nj.nec.com/rehg94visual.html. [40] M. Rosenblum, Y. Yacoob, and L. S. Davis. Human emotion recognition from motion using a radial basis function network architecture. In IEEE Work-
shop on Motion of Non-Rigid and Articulated Objects, pages 4349, 1994. URL citeseer.nj.nec.com/rosenblum94human.html. [41] K. Sandeep and A.N. Rajagopalan. Human face detection in clutURL
tered color images using skin color and edge information. citeseer.nj.nec.com/557854.html.
[42] L. Sigal, S. Sclaro, and V. Athitsos. Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2000. URL citeseer.nj.nec.com/article/sigal00estimation.html. [43] S. M. Smith. Asset-2: visual tracking of moving vehicles. In IEEE Colloquium on Image Processing for Transport Applications, 1993. 109
[44] S. M. Smith. Asset-2: real-time motion segmentation and shape tracking. In Fifth International Conference on Computer Vision, pages 237244, 1995. [45] S. M. Smith and J. M. Brady. Asset-2: real-time motion segmentation and shape tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 814 820, Aug 1995. [46] T. Starner and A. Pentland. Real-time American Sign Language recognition from video using hidden Markov models. In International Simposium on Computer Vision, pages 265270, 1995. [47] T. Starner, J. Weaver, and A. Pentland. A wearable computer based American Sign Language recognizer. pages 130137, 1997. [48] B. Stenger, P. R. S. Mendona, and R. Cipolla. Model-based hand tracking c using an unscented Kalman lter. In Proc. British Machine Vision Conference, volume I, pages 6372, Manchester, UK, September 2001. [49] H. Tao, H. S. Sawhney, and R. Kumar. Object tracking with Bayesian estimation of dynamic layer representations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):75 89, 2002. [50] P. Tissainayagam and D. Suter. Performance measures for assessing contour trackers. International Journal of Image and Graphics, 2:343359, April 2002. [51] G. Welch and G. Bishop. An introduction to the Kalman lter. Technical Report TR 95-041, Department of Computer Science, University of North Carolina, NC, USA, 2002.
110
[52] G. Xu. Epipolar geometry in stereo, motion, and object recognition : a unied approach. Kluwer Academic Publishers, 1996. [53] K. Yachi, T. Wada, and T. Matsuyama. Human head tracking using adaptive appearance models with a xed-viewpoint pan-tilt-zoom camera. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000. [54] J. Yang, W. Lu, and A. Waibel. Skin-color modeling and adaptation. In Proceedings of ACCV98, pages 687694, 1998.
111

Zhu MSC

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Zhu MSC

Загружено:

Авторское право:

Доступные форматы

Hand Detection and Tracking in an Active Vision System

Copyright by Yuliang Zhu 2003

Hand Detection and Tracking in an Active Vision System

Approved by Supervising Committee:

Hand Detection and Tracking in an Active Vision System

Yuliang Zhu, M.Sc. York University, 2003

Supervisor: Prof. John Tsotsos

York University June 2003

2.7.1 2.7.2 2.7.3 2.7.4 2.7.5 2.8

Stochastic Dynamics . . . . . . . . . . . . . . . . . . . . . . . Measurement Model . . . . . . . . . . . . . . . . . . . . . . . Propagation of state density . . . . . . . . . . . . . . . . . . . Factored Sampling . . . . . . . . . . . . . . . . . . . . . . . .

System Architecture and Implementation . . . . . . . . . . . . . . . .

4.3.2 4.3.3 4.4 4.5

Chapter 5 Discussion and Future Work 5.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.10 State space parameters

860mm (red), 106mm (black) and 125mm (green), and -45

-22.5 (cyan), 22.5 (yellow) and 45

(pink). The vertex in each color 97

attention. GestureCAM was designed to address this challenge.

Figure 1.1: The binocular head (TRISH-2) in GestureCAM

Figure 1.2: The degrees of freedom of the camera system

Motor & Camera Control

Setting and getting parameters of Camera & Motors

Figure 1.3: System diagram

Chapter 2 provides a review of related works in detection and tracking. Analysis

Chapter 2 Review of Related Work

Detecting Motion with an Active Camera

Skin Blob Tracking

They are the x and y components of the optical ow. The

and v = (u, v) is the optical ow vector with components (u, v).

Standard Kalman Filter

Extended Kalman Filter

p(Zt1 , Xt |Xt1 ) = p(Xt |Xt1 )

After integrating over Xt , (2.4) changes to p(Zt1 |Xt1 ) =

p(Zi |Xi ), so that, (2.5) (2.6)

p(Zi |Xi ) = p(Zt |Zt1 , Xt )

p(Zi |Xi ) (2.8)

Propagation of state density

= kt p(Zt |Xt , Zt1 )p(Xt |Zt1 ) = kt p(Zt |Xt )p(Xt |Zt1 ) 23

p(Xt |Zt ) = p(Xt |Zt )

p(Xt |Zt1 ) (2.13) p(Xt |Xt1 , Zt1 )p(Xt1 |Zt1 )

p(Xt |Xt1 )p(Xt1 |Zt1 ) (2.14) p(Xt |Xt1 )p(Xt1 |Zt1 )

step t is P (Xt |Zt ) = P (Zt |Xt = St )P (Xt |Zt1 )

P (Xt |Xt1 = St1 )P (Xt1 |Zt1 )

The entire sample set St

is going to be used for the next iteration. The whole

process of sampling and propagation is shown in gure 2.1.

p(xt | Zt ) =kt p(zt | xt )p(xt | Zt1)

Figure 2.1: The process of CONDENSATION algorithm

Chapter 3 CONDENSATION Hand Tracker

R , R+G+B G g= , R+G+B B b= R+G+B r=

250 200 150

100 50 0 250 200 150 100 50 G 0 0 50 R 100 150 200

Figure 3.1: Skin color in RGB space

0.15 0.2 Normalized R

Figure 3.2: Skin color distribution in normalized RG space

Figure 3.3: Skin color in HSV space

0.045 0.04 0.035 0.03

Figure 3.4: Skin color distribution in HS space

gi (N is the number of pixels) and rg = r g ( is the

(r r)2 2(r r)(g g ) (g g )2 + 2 2 r r g g rg cor(r, g) = r g

Figure 3.5: Raw image taken from camera in RGB

( & )'% "$#"!