Академический Документы
Профессиональный Документы
Культура Документы
Displays
We introduce a system to track a user’s hand and head in 3D and real-time for usage with a large tiled display.
The system uses overhead stereo cameras and do not require the user to wear any equipment. Hands are
detected using Haar-like features with an Adaboost classifier and heads are detected using Hough transforms
generalized for circles. Three-dimensional values for the hand and head are obtained by correspondence
matching of the paired cameras. Finally, a 3D vector is extrapolated from the centroid of the head to the
hand. A projection of the 3D vector to the large tiled display is the pointing location.
Contents
1 Introduction 2
2 Hardware Setup 4
2.1 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 IR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Camera Calibration 6
3.1 Why is Calibration Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Homogeneous Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Calibration Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Hand Detection 9
4.1 Cascaded Adaboost Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.2 Cascaded Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Training Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Head Detection 12
5.1 Edge Detection[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 Image Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 Canny Edge Detector[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Head Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6 Tracking 16
6.1 Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Token Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Object Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7 Stereopsis 18
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3 Correspondence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.4 Range Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.5 Range Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8 Vector Extrapolation 21
1
Chapter 1
Introduction
Today, scientific computing is generating large data sets and high resolution imaging. A biological sample
under an electron microscope can easily produce gigabytes of high resolution 3D images. Due to advances in
remote sensing technologies, the scope and amount of high resolution geospatial imagery has become widely
available today. These factors has created the need to visualize the imagery data. Many research groups
have built large tiled displays. The human computer interface problem for large tiled display still remains to
be a problem. Using a traditional mouse to navigate the data is unfeasible. The NCMIR group at CALit2
located in University of California, San Diego, has built an alternative human computer interaction system
using a hand controller and head gear. Requiring such equipment places a restrictive burden on the user.
Our objective is to create a hand tracking system given overhead stereo cameras to drive a large display.
Our system is vision-based and does not require any external sensors on the user. There are a few contraints
that eases our objective. Our cameras are fixed overhead. This allows less background noise to aid in hand
detection. We developed a hand and arm tracking system called HATS. An overview of our system is depicted
in 1.1.
2
Figure 1.1: System Overview
3
Chapter 2
Hardware Setup
We place our camera system over the user. Other possibilities for camera configuration could have been
placing the two cameras in front of the user or one camera above the user and one camera in front of the
user. We selected overhead camera configuration due to better performance and ease of use. There is very
little noise since the cameras will be pointing at the ground. If we had placed the cameras in front of the
user, we would have to deal with a noisey background. eg. People walking by.
The cameras are mounted 364.57 mm apart, and are slightly verged in, in order to have the largest shared
field of view. Each camera is verged in by 0.0392 radians, thus the two cameras’ are focused on the same
point in space about 4.6 m below the cameras. It was decided to use this configuration for the cameras
since it provided good range resolution, while also having a large shared field of view. The choice of these
parameters will further discussed in the ’Stereopsis’ chapter.
2.2 IR Filters
Infrared filters and illuminators were used in our setup in order to provide an enhanced contrast between
the user, specifically skin sections since they show up with high intensity, and the rest of the background.
Additionally, using infrared for illumination nullifies any variation in the lighting in which this setup will be
used, and can also provide proper illumination for the cameras, while not changing the lighting perceived by
the user. Thus, the setup can function properly in a totally dark room.
The IR filters used blocks out visible light up to 650 nm, and transmits 90% of light in the 730 nm to
2000 nm range (near IR). One thing to note is that incandescent light bulbs and perhaps some other types,
and the sun emit light in this range as well, and can cause an overexposure in the cameras if not carefully
controlled.
4
Figure 2.1: Two Point Grey Research Dragon Fly Cameras are used to make a stereo pair. The IR filters
help skin tone illumination.
5
Chapter 3
Camera Calibration
The process of stereopsis allows the depth of each pixel in the image to be calculated and thus the 3D
coordinate of each point to be calculated. The stereopsis process will be described in more detail in Chapter
7 of this report.
Figure 3.1 above shows two different coordinate systems. The transformation between the two coordinate
systems in the diagram is an affine transformation, which consists of a rotation followed by a translation.
y = Rx + t (3.1)
6
The translation component of the transformation is given by the equation below
tx
T = ty (3.3)
tx
Under the Euclidean coordinate system an affine transformation has to be expressed as two separate transfor-
mations, as shown in Equation 3.1. However if the image coordinates are first converted to the homogeneous
coordinate system then the transformation can be expressed as a single matrix multiplication.
Equation 3.4 below shows the transformation from from the euclidean coordinate system to the homoge-
neous coordinate system.
(x, y, z) ⇒ (x, y, z, w) = (x, y, z, 1) (3.4)
Equation 3.5 below shows the transformation from from the homogeneous coordinate system to the euclidean
coordinate system.
(x, y, z, w) ⇒ (x/w, y/w, z/w) (3.5)
The following equations describe the transformation from the camera coordinate system to the pixels in
the image.
−f x
xim = + ox , (3.7)
sx
−f y
yim = + oy (3.8)
sy
Where f is the focal length of the camera, (ox , oy ) is the coordinate in Pixels of the image centre (Focal
Point), sx is the width of the pixels in millimetres, and sy is the height of the pixels in millimetres
When written in matrix form Equations 3.7 and 3.8, yield the following matrix, which is known as the
intrinsic matrix.
−f /sx 0 ox
Mint = 0 − f /sy oy (3.9)
0 0 1
3.4 Rectification
Image rectification is a process that transforms a pair of images so that the images have a common image
plane. Once this process has been performed on the stereo image pair that are captured by the overhead cam-
eras, the stereopsis process reduces to a triangulation problem, which is discussed in more detail in Chapter 7.
7
The image rectification process makes use of the extrinsic and intrinsic parameters of the two cameras
in order to define the transformation that maps the right camera image onto the same image plane as the
left camera image. Whilst this process is required in order to simplify the stereopsis process it does result in
the rectified objects being warped and disfigured.
The image rectification process is performed by combining the extrinsic and intrinsic parameters of the
system into one matrix which is known as the fundamental matrix. This matrix provides the linear trans-
formation between the original images and the rectified images that can then be used to perform stereopsis.
The next section of this chapter discusses the process that is used in order to determine the parameters of
this matrix.
8
Chapter 4
Hand Detection
Using Haar-like features for object detection has several advantages over using other features or direct image
corelation. For one, simple image features such as edges, color or contours are good for basic object detection
but as the object gets more complex, they tend not to work well. Papageorgiou[[?]] used Haar-like features
as the representation of an image for object detection. Since then, Viola and Jones and later Lienhart added
an extended set of the Haar-like features. They are shown in 4.2.
In each of the Haar-like features, the value is determined by the difference of the two different colored
areas. For example, in Fig. 4.2(a), the value is the difference of all the pixels in the black rectangle from all
the pixels in the white rectangle.
Viola and Jones introduced the ”integral image” to allow fast computation of these Haar-like features.
The integral image at any (x,y) is the summation of all the pixels to the upper left as described by equation
4.1:
X
ii(x, y) = = i(x0 , y 0 ) (4.1)
x0 ≤x,y 0 ≤y
where ii(x,y) is the integral image and i(x,y) is the orignal image. The integral image can be computed
in one pass with the recurrence relation:
9
ii(x, y) = ii(x − 1, y) + s(x, y) (4.3)
(s(x,y) is the cumulative row sum, s(x,-1) = 0 and ii(-1, y) = 0)
With the integral image, any of the Haar-like features can be computed quickly. As shown in the integral
image 4.2, the vertical rectangle edge feature in 4.2(a) is simply 4 - (1 + 3).
Figure 4.2: Haar-like features can be computed quickly using an Integral Image.
Each Adaboost classifier can reject the search window early on. The succeeding classifiers have a more
difficult step of distinguishing the features. Because the early classifiers can reject the majority of search
windows, a larger portion of the computational power can be focused on the later classifiers.
The algorithm can target for how much each classifier is trained to find and eliminate. A trade-off is
that a higher expected hit-rate will give more false positives and a lower expected hit-rate can give less false
positives.
An empirical study can be found in the work of Lienhart et al[?].
10
4.2 Training Data and Results
Below is a list training different sets of data and a qualitative note of its performance in the real-time video
capture.
Produced XML cascade output file Image dim Training size nstages nsplits minhitrate maxfalsealarm
hand classifier 6.xml 20x20 357 15 2 .99 .03
hand pointing 2 10.xml 20x20 400 15 2 .99 .03
wrists 40.xml 40x40 400 15 2 .99 .03
hand classifier 8.xml 20x20 400 15 2 .99 .03
hand classifier 6.xml High hit rate but also high false positives.
hand pointing 2 10.xml Low positives but low false positives.
wrists 40.xml Can detect arms but bounding box too big.
hand classifier 8.xml Low positives but low false positives.
11
Chapter 5
Head Detection
The head detection module relies on the shape of the head being similar to that of a circle. Generally
speaking, the head is the most circular object that is found in the scene when a user is interacting with the
tiled displays. In order to utilise this geometric property, a circle detector is applied to the set of edge points
for the input image. These edge points map out the skeleton or outline of objects in the image, and thus can
be used to match geometric shapes to the objects.
The flowchart above shows the stages that are involved in detecting regions that are likely to contain a head,
these stages are discussed in more detail in the following sections.
As mentioned previously boundaries or edges within images are marked by a sharp change in the images
intensity at in the area where an edge lies. These sharp changes in image intensity will result in the gradient
at the point in an image where an edge occurs having a large magnitude.
12
Given the x and y component of the gradient at each point the gradient magnitude at that point can be
calculated using the following equations.
q
kGk = G2x + G2y (5.2)
The gradient direction at each point is also required for certain edge detection algorithms such as the
Canny edge detector. The equation for determining the direction of the gradient at each point is shown
below.
Gy
α = atan (5.3)
Gx
The previous section outlined the methods for calculating the gradient at each point, and then how to utilise
this information to calculate the magnitude and direction of the gradient at each point in the image. However
applying this operation to the raw image, generally results in many spurious edges being detected, and hence
an inaccurate or unusable edge image being calculated.
These spurious edges are caused by noise that is introduced to the image when it is captured. In order
to reduce the effect of this noise the image must first be blurred, which involves replacing the value of each
pixel with a weighted average calculated over a small window around the pixel. However this procedure has
the unwanted effect of smearing edges within the image as well, which results in edges being harder to detect.
This stage of the algorithm loops over all the detected edge pixels in the image, and looks at the pixels
that lie either side of the pixel in the direction of the gradient at that point, if the current point is a local
maximum then it is classified as an edge point, otherwise the point is discarded.
13
(a) Edge Image with no Blurring (b) Edge Image with Prior Blur
Once the pixels in the image that lie on edges have been found, the pixels need to be processed in order
to find circles that best fit the edge information. Since the set of features (edge points) computed by the
edge detector are from the whole image and are not just localised to the edges that correspond to the head,
the circle matching process is required to be resilient to outliers (features that do not conform to the circle
being matched) within the input data.
The method employed for detecting circles within the camera frames is a generalised Hough transform.
The Hough transform is a technique for matching a particular geometric shape to a set of features that have
been found within an image, the equation that the input set of points are matched to is shown below.
c = r2 = (x − ux )2 + (y − uy )2 (5.4)
Looking at the equation of a circle, and the diagram below it can be seen that three parameters are re-
quired to describe a circle, namely the radius and the two components of the centre point. The Hough
transform method uses an accumulator array (or search space) which has a dimension for each parameter,
thus for detecting circles a 3-dimensional accumulator array is required.
The efficiency of the Hough transform method for matching a particular shape to a set of points is directly
linked to the dimensionality of the search space, and thus the number of parameters that are required to
describe the shape. In light of the requirement to achieve real time processing the images are down-sampled
by a factor of 4 before the circle detector algorithm is applied. Due to the algorithm being third order, this
reduces the number of computations by a factor of 64.
The transform assumes that each point in the input set (the set of edge points) lies on the circumference of a
circle. The parameters of all possible circles, for which the circumference passes through the current point, are
then calculated, and for each potential circle the corresponding value in the accumulator array is incremented.
Once all the points in the input set have been processed, the accumulator array is checked to find ele-
ments that have enough votes to be considered as a circle in the image, the parameters corresponding to
these elements in the accumulator array are then returned as a region within the image that could potentially
contain a head.
14
5.2 Head Detection Results
The head detection method described above provides a method of identifying potential head regions within
the input scene. Although the detector performs well in most situations, noise and other circular shaped
objects present in the scene occasionally cause the detector to skip a frame, or to return multiple head
regions. In order to overcome these limitations post processing is applied to the results to interpolate when a
frame is missed and to enable the most probable head location to be selected when multiple head regions are
returned by the detector. These post processing techniques are implemented in the object tracking module
which is discussed later on.
(a) Edge Image with no Blurring (b) Edge Image with Prior Blur
15
Chapter 6
Tracking
Ideally the object detectors employed to detect the head and hand regions within the image would have a
100% accuracy rate and as such always locate the correct objects within the scene, and never miss classify
a region. However realistically this will not be the case and as such the detection methods employed are
inevitably prone to errors.
The object tracker module aims to supplement the detection modules by providing additional processing
which helps to limit the effect of erroneous detections and skipped frames.
The object tracker does this by maintaining the state information of all objects that have been detected
by a particular object detector, within a given time interval, using tokens, and then using this information
to determine which detected region is most likely to correspond to the object that is being tracked.
6.1 Tokens
Within the token tracker tokens are data structures that store the state information of a particular object
that has been detected. The following information is encapsulated within a token.
• Location of the object
• Size of the object (area in pixels)
• Time since the object was last observed (in frames)
• Number of times the object has been observed during its lifetime
When an object is detected it is first compared to the list of tokens, if the properties of the object are a close
enough match to any of the objects described by the tokens then the token is updated with the new state
information. If no match is found, then a new token is initialised with the properties of the newly detected
object. This process is applied to all the objects that are detected by the object detector.
Once all the objects have been processed and the corresponding tokens updated, or initialised, the list
of tokens is filtered to remove any dead tokens. A dead token is any token that corresponds to an object
that has not recently been seen. This could occur for the following three reasons. The token could have been
initialised for an object that was erroneously detected. If this is the case, then the object is likely to only
be detected in once, and then not detected again. The second cause could be that the object that the token
was tracking is no longer present in the scene. The third reason for a token dying is if the object’s properties
changed too rapidly between detections. This would result in a match between the existing token and the
object not being made, and a new token being initialised to represent the object.
Once all the dead tokens have been filtered out from the list, the age of all the tokens (the time since
seen) is incremented. And the process is then repeated for the next set of objects that are detected.
16
6.3 Object Selection
At any given moment the object tracker can be keeping track of multiple objects, some of which may have
been correctly classified, and others which may have been erroneously detected. In order for the system to
allow the user to interact with the display, there must be a mechanism to enable the system to discern which
of these objects the ’correct’ object is.
As mentioned before, the ideal case would be if the object detectors correctly identified the object regions
with 100% accuracy. However, with the object tracker supplementing the detection systems, this requirement
can be relaxed such that for a detector to be admissible, it must meet the following requirements.
• The object must be correctly classified in more frames than erroneous regions are classified
If these two criteria are met, then the token tracking the object will receive more updates than any other
token that exists, this will result in the correct token having a higher vote (observed count) than all the
other tokens. Therefore the location of the object can be found stored in the token that has the highest vote
count.
17
Chapter 7
Stereopsis
Now that the head and hands have been detected and tracked, obtaining their actual location in space is
the final step before the final output vector can be calculated. Calculating the depth of the user’s hand or
head is accomplished by using two cameras, mounted on the same horizontal axis a certain distance apart,
called the baseline. Similar to how humans perceive depth by using two eyes, the computer can calculate
depth by using two cameras that are properly aligned. The method through which this is accomplished will
be explained in this chapter.
7.1 Overview
Using a stereo pair of cameras, gives us two perspectives of the same image. By comparing these two
perspectives, we gain additional information on the scene. Assuming there is no major occlusion, we can
find a point in the right image, and find that same point in the left image. With these two points we can
triangulate the depth of this point in real space. One can imagine, if two cameras are mounted side by side,
the left and right image of a far away object will seem almost identical. In fact, as the distance of the object
from the cameras increases passed a certain threshold, there can be no discernible difference seen between
the two images. On the other hand, if the object is very close, the two perspectives of the object will be very
different. By using this difference in the images, the depth can be extracted. Thus, the goal is to find and
quantize this difference.
18
image is taking and slid across the entire horizontal line, finding the NCC at each horizontal point along the
line. The horizontal point with the largest NCC value is then taken to be that corresponding point.
• The right hand coordinate system is used, where the z axis is point down and out of the cameras, the
x axis is pointing outward, and the y axis is pointing into the page.
• θvergenceL and θvergenceR is the angle the cameras are rotated inward.
19
• f is the focal length of the cameras.
• XL and XR are the known distances of the project points onto the CCD and the center of the CCD
given by the correspondence process.
• θL and θR are two angles adjacent to the x-axis. These angles characterize the distance z.
θL and θR can be calculated given XL , XR , θvergenceL , θvergenceR , and the focal lengths. They are given
by:
θL = arctan( XfL + θvergenceL ) θR = arctan( XfR + θvergenceR )
Using similar triangles Z can be found as:
b
Z = tanθR −tanθ L
Notice that as θvergenceL,R approach zero, Z becomes inversely proportional to the disparity.
After Z is found, X and Y are easily computed:
Z
X = tanθ L
Y = YLf∗Z
where YL is similar to XL , but in the y-direction.
Given these three values (X,Y,Z), a 3D point of the object of interest is formed.
20
Chapter 8
Vector Extrapolation
Given a 3D point for the head location and the hand location, a vector of where the user is trying to point
can be extrapolated. By taking the difference between the hand’s 3D point and the head’s 3D point, and
normalizing its magnitude to be a 1 meter, a unit vector is formed in the direction towards the display.
This vector as well as the 3D location of the user’s head is then outputted to the interface where the user’s
intended pointing direction can be projected onto the tiled display.
21
Chapter 9
We have presented an approach for a marker-free human computer interaction system with overhead stereo
cameras. We have developed a promising demonstration that can be enhanced to drive large tiled displays.
There is more work required in this project for it to be a truly robust system. Speed needs to be improved
and gesture recognition needs to be accomplished. For gesture recognition, we did not get good results with
an overhead camera system. It would be a good experiment to add a third camera in front of the user
solely for gesture recognition. Porting the code to take advantage of multi-core processors such as the Cell
Broadband Engine would likely allow parallel computation of classifiers for different gestures.
22