Вы находитесь на странице: 1из 24

Camera Based Hand and Body Driven Interaction with Very Large

Displays

Tom Cassey, Tommy Chheng, Jeffery Lien


{tcassey,tcchheng, jwlien}@ucsd.edu

June 15, 2007


Abstract

We introduce a system to track a user’s hand and head in 3D and real-time for usage with a large tiled display.
The system uses overhead stereo cameras and do not require the user to wear any equipment. Hands are
detected using Haar-like features with an Adaboost classifier and heads are detected using Hough transforms
generalized for circles. Three-dimensional values for the hand and head are obtained by correspondence
matching of the paired cameras. Finally, a 3D vector is extrapolated from the centroid of the head to the
hand. A projection of the 3D vector to the large tiled display is the pointing location.
Contents

1 Introduction 2

2 Hardware Setup 4
2.1 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 IR Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Camera Calibration 6
3.1 Why is Calibration Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2.1 Homogeneous Coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2.2 Homogeneous Extrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Intrinsic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Calibration Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Hand Detection 9
4.1 Cascaded Adaboost Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.2 Cascaded Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Training Data and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Head Detection 12
5.1 Edge Detection[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 Image Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 Canny Edge Detector[?] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 Hough Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Head Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Tracking 16
6.1 Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Token Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.3 Object Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

7 Stereopsis 18
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Epipolar Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.3 Correspondence Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.4 Range Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.5 Range Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

8 Vector Extrapolation 21

9 Conclusion and Future Work 22

1
Chapter 1

Introduction

Today, scientific computing is generating large data sets and high resolution imaging. A biological sample
under an electron microscope can easily produce gigabytes of high resolution 3D images. Due to advances in
remote sensing technologies, the scope and amount of high resolution geospatial imagery has become widely
available today. These factors has created the need to visualize the imagery data. Many research groups
have built large tiled displays. The human computer interface problem for large tiled display still remains to
be a problem. Using a traditional mouse to navigate the data is unfeasible. The NCMIR group at CALit2
located in University of California, San Diego, has built an alternative human computer interaction system
using a hand controller and head gear. Requiring such equipment places a restrictive burden on the user.

Our objective is to create a hand tracking system given overhead stereo cameras to drive a large display.
Our system is vision-based and does not require any external sensors on the user. There are a few contraints
that eases our objective. Our cameras are fixed overhead. This allows less background noise to aid in hand
detection. We developed a hand and arm tracking system called HATS. An overview of our system is depicted
in 1.1.

2
Figure 1.1: System Overview

3
Chapter 2

Hardware Setup

We place our camera system over the user. Other possibilities for camera configuration could have been
placing the two cameras in front of the user or one camera above the user and one camera in front of the
user. We selected overhead camera configuration due to better performance and ease of use. There is very
little noise since the cameras will be pointing at the ground. If we had placed the cameras in front of the
user, we would have to deal with a noisey background. eg. People walking by.

2.1 Stereo Cameras


Two Point Grey Dragon Fly cameras shown in 2.1 were used to make our stereo pair. The cameras are able
to run at a frame rate of 120 fps, with a resolution of 640x480 pixels, each having a 6 mm focal length, and
a pixel width of 0.0074 mm.

The cameras are mounted 364.57 mm apart, and are slightly verged in, in order to have the largest shared
field of view. Each camera is verged in by 0.0392 radians, thus the two cameras’ are focused on the same
point in space about 4.6 m below the cameras. It was decided to use this configuration for the cameras
since it provided good range resolution, while also having a large shared field of view. The choice of these
parameters will further discussed in the ’Stereopsis’ chapter.

2.2 IR Filters
Infrared filters and illuminators were used in our setup in order to provide an enhanced contrast between
the user, specifically skin sections since they show up with high intensity, and the rest of the background.
Additionally, using infrared for illumination nullifies any variation in the lighting in which this setup will be
used, and can also provide proper illumination for the cameras, while not changing the lighting perceived by
the user. Thus, the setup can function properly in a totally dark room.

The IR filters used blocks out visible light up to 650 nm, and transmits 90% of light in the 730 nm to
2000 nm range (near IR). One thing to note is that incandescent light bulbs and perhaps some other types,
and the sun emit light in this range as well, and can cause an overexposure in the cameras if not carefully
controlled.

4
Figure 2.1: Two Point Grey Research Dragon Fly Cameras are used to make a stereo pair. The IR filters
help skin tone illumination.

5
Chapter 3

Camera Calibration

3.1 Why is Calibration Needed?


The goal of camera calibration is to determine a transformation between the two independent pixel coordi-
nate systems of the two cameras, and the real world coordinate system. Once the real world coordinate of
each pixel can be determined, it is possible to perform stereopsis on the images.

The process of stereopsis allows the depth of each pixel in the image to be calculated and thus the 3D
coordinate of each point to be calculated. The stereopsis process will be described in more detail in Chapter
7 of this report.

3.2 Extrinsic Parameters


The extrinsic camera properties describe the position and orientation of the camera with respect to the world
coordinate system. In the stereo system we are working with the world coordinate system is taken to be
the coordinate system of the left camera. Therefore the extrinsic parameters describe the transformation
between the right cameras coordinate system, and the left cameras coordinate system.

Figure 3.1: Two Coordinate Systems

Figure 3.1 above shows two different coordinate systems. The transformation between the two coordinate
systems in the diagram is an affine transformation, which consists of a rotation followed by a translation.

y = Rx + t (3.1)

The rotation component of the transformation is given by the equation below


 
iA · iB iA · jB iA · kB
R =  jA · iB jA · jB jA · kB  (3.2)
kA · iB kA · jB kA · kB

6
The translation component of the transformation is given by the equation below
 
tx
T =  ty  (3.3)
tx

Under the Euclidean coordinate system an affine transformation has to be expressed as two separate transfor-
mations, as shown in Equation 3.1. However if the image coordinates are first converted to the homogeneous
coordinate system then the transformation can be expressed as a single matrix multiplication.

3.2.1 Homogeneous Coordinates


The Homogeneous coordinate system adds an extra dimension to the euclidean coordinate system, where the
extra coordinate is a normalising factor.

Equation 3.4 below shows the transformation from from the euclidean coordinate system to the homoge-
neous coordinate system.
(x, y, z) ⇒ (x, y, z, w) = (x, y, z, 1) (3.4)
Equation 3.5 below shows the transformation from from the homogeneous coordinate system to the euclidean
coordinate system.
(x, y, z, w) ⇒ (x/w, y/w, z/w) (3.5)

3.2.2 Homogeneous Extrinsic Parameters


 
r11 r12 r13 tx
 r21 r22 r23 ty 
Mext =
 r31
 (3.6)
r32 r33 tz 
0 0 0 1

3.3 Intrinsic Parameters


The intrinsic parameters describe the transformation from the camera coordinates to the pixel coordinates
in the image. This transformation depends on the optical, geometric and digital parameters of the camera
that is capturing a given image.

The following equations describe the transformation from the camera coordinate system to the pixels in
the image.
−f x
xim = + ox , (3.7)
sx
−f y
yim = + oy (3.8)
sy

Where f is the focal length of the camera, (ox , oy ) is the coordinate in Pixels of the image centre (Focal
Point), sx is the width of the pixels in millimetres, and sy is the height of the pixels in millimetres
When written in matrix form Equations 3.7 and 3.8, yield the following matrix, which is known as the
intrinsic matrix.
 
−f /sx 0 ox
Mint =  0 − f /sy oy  (3.9)
0 0 1

3.4 Rectification
Image rectification is a process that transforms a pair of images so that the images have a common image
plane. Once this process has been performed on the stereo image pair that are captured by the overhead cam-
eras, the stereopsis process reduces to a triangulation problem, which is discussed in more detail in Chapter 7.

7
The image rectification process makes use of the extrinsic and intrinsic parameters of the two cameras
in order to define the transformation that maps the right camera image onto the same image plane as the
left camera image. Whilst this process is required in order to simplify the stereopsis process it does result in
the rectified objects being warped and disfigured.

The image rectification process is performed by combining the extrinsic and intrinsic parameters of the
system into one matrix which is known as the fundamental matrix. This matrix provides the linear trans-
formation between the original images and the rectified images that can then be used to perform stereopsis.
The next section of this chapter discusses the process that is used in order to determine the parameters of
this matrix.

3.5 Calibration Implementation


[Describe how we used the checker board and SVS software for Calibration]

8
Chapter 4

Hand Detection

Using Haar-like features for object detection has several advantages over using other features or direct image
corelation. For one, simple image features such as edges, color or contours are good for basic object detection
but as the object gets more complex, they tend not to work well. Papageorgiou[[?]] used Haar-like features
as the representation of an image for object detection. Since then, Viola and Jones and later Lienhart added
an extended set of the Haar-like features. They are shown in 4.2.

Figure 4.1: Haar-like features and an extended set.

In each of the Haar-like features, the value is determined by the difference of the two different colored
areas. For example, in Fig. 4.2(a), the value is the difference of all the pixels in the black rectangle from all
the pixels in the white rectangle.
Viola and Jones introduced the ”integral image” to allow fast computation of these Haar-like features.
The integral image at any (x,y) is the summation of all the pixels to the upper left as described by equation
4.1:
X
ii(x, y) = = i(x0 , y 0 ) (4.1)
x0 ≤x,y 0 ≤y

where ii(x,y) is the integral image and i(x,y) is the orignal image. The integral image can be computed
in one pass with the recurrence relation:

s(x, y) = s(x, y − 1) + i(x, y) (4.2)

9
ii(x, y) = ii(x − 1, y) + s(x, y) (4.3)
(s(x,y) is the cumulative row sum, s(x,-1) = 0 and ii(-1, y) = 0)
With the integral image, any of the Haar-like features can be computed quickly. As shown in the integral
image 4.2, the vertical rectangle edge feature in 4.2(a) is simply 4 - (1 + 3).

Figure 4.2: Haar-like features can be computed quickly using an Integral Image.

4.1 Cascaded Adaboost Classifiers


4.1.1 Adaboost
The Adaboost algorithm[?] uses a combination of simple weak classifiers to create a strong one. Each simple
classifier is weak because they can only classify perhaps 51% of the training data sucessfully. The final
classifier becomes strong because it weighs the weak classifiers accordingly during the training process.

4.1.2 Cascaded Classifiers


Viola and Jones[?] introduced an algorithm to construct a cascade of adaboosted classifiers as depicted in
figure 4.3. The end result gives increased detection rates and reduced computation time.

Figure 4.3: A cascaded classifier allows early rejection to increase speed.

Each Adaboost classifier can reject the search window early on. The succeeding classifiers have a more
difficult step of distinguishing the features. Because the early classifiers can reject the majority of search
windows, a larger portion of the computational power can be focused on the later classifiers.
The algorithm can target for how much each classifier is trained to find and eliminate. A trade-off is
that a higher expected hit-rate will give more false positives and a lower expected hit-rate can give less false
positives.
An empirical study can be found in the work of Lienhart et al[?].

10
4.2 Training Data and Results
Below is a list training different sets of data and a qualitative note of its performance in the real-time video
capture.

Produced XML cascade output file Image dim Training size nstages nsplits minhitrate maxfalsealarm
hand classifier 6.xml 20x20 357 15 2 .99 .03
hand pointing 2 10.xml 20x20 400 15 2 .99 .03
wrists 40.xml 40x40 400 15 2 .99 .03
hand classifier 8.xml 20x20 400 15 2 .99 .03
hand classifier 6.xml High hit rate but also high false positives.
hand pointing 2 10.xml Low positives but low false positives.
wrists 40.xml Can detect arms but bounding box too big.
hand classifier 8.xml Low positives but low false positives.

[show good results and false positives]

11
Chapter 5

Head Detection

The head detection module relies on the shape of the head being similar to that of a circle. Generally
speaking, the head is the most circular object that is found in the scene when a user is interacting with the
tiled displays. In order to utilise this geometric property, a circle detector is applied to the set of edge points
for the input image. These edge points map out the skeleton or outline of objects in the image, and thus can
be used to match geometric shapes to the objects.

Figure 5.1: Head Detection Flowchart

The flowchart above shows the stages that are involved in detecting regions that are likely to contain a head,
these stages are discussed in more detail in the following sections.

5.1 Edge Detection[?]


Boundaries between objects in images are generally marked by a sharp change in the image intensity at the
location of the boundary, the process of detecting these changes in image intensity is known as edge detection.
However boundaries are not the only cause of sharp changes of intensity within images.
The causes of sharp intensity changes (edges) are as follows:
• discontinuities in depth
• discontinuities in surface orientation
• changes in material properties
• variations in scene illumination
The first three of these causes occur (but not exclusively) when there is a boundary between two objects. The
fourth cause ’variations in the scene illumination’ is generally causes by objects occluding the light source,
which results in shadows being cast over regions in the image.

As mentioned previously boundaries or edges within images are marked by a sharp change in the images
intensity at in the area where an edge lies. These sharp changes in image intensity will result in the gradient
at the point in an image where an edge occurs having a large magnitude.

5.1.1 Image Gradient


Since the images are 2-dimensional, the gradient with respect to the x and y direction must be computed
separately. In order to speed up the computation of the gradient at each point these two operations are
implemented as kernels which are then convolved with the image in order to extract the x, and y components
of the gradient. The Sobel kernel representations of these two discrete gradient equations are shown below.
   
−1 −2 −1 −1 0 1
Gx =  0 0 0  , Gy =  −2 0 2  (5.1)
1 2 1 −1 0 1

12
Given the x and y component of the gradient at each point the gradient magnitude at that point can be
calculated using the following equations.
q
kGk = G2x + G2y (5.2)
The gradient direction at each point is also required for certain edge detection algorithms such as the
Canny edge detector. The equation for determining the direction of the gradient at each point is shown
below.
 
Gy
α = atan (5.3)
Gx

5.1.2 Canny Edge Detector[?]


The ’Canny Edge Detector’ provides a robust method for locating edge points within an image, which is
based upon image gradient. The canny detector also includes additional processing to the calculated gradi-
ent image in order to improve the locality of the detected edge. The Canny Edge detector is broken down
into multiple stages which are outlined in the flowchart below.

Figure 5.2: Canny Edge Detector Flowchart

The previous section outlined the methods for calculating the gradient at each point, and then how to utilise
this information to calculate the magnitude and direction of the gradient at each point in the image. However
applying this operation to the raw image, generally results in many spurious edges being detected, and hence
an inaccurate or unusable edge image being calculated.

These spurious edges are caused by noise that is introduced to the image when it is captured. In order
to reduce the effect of this noise the image must first be blurred, which involves replacing the value of each
pixel with a weighted average calculated over a small window around the pixel. However this procedure has
the unwanted effect of smearing edges within the image as well, which results in edges being harder to detect.

Oriented non-maximal suppression


The additional processing that is provided by the canny edge detector is known as oriented non-maximal
suppression. This process makes use of the gradient magnitude and direction that was calculated previously,
in order to ensure that the edge detector only responds once for each edge in the image, and that the locality
of the detected edge is accurate.

This stage of the algorithm loops over all the detected edge pixels in the image, and looks at the pixels
that lie either side of the pixel in the direction of the gradient at that point, if the current point is a local
maximum then it is classified as an edge point, otherwise the point is discarded.

13
(a) Edge Image with no Blurring (b) Edge Image with Prior Blur

Once the pixels in the image that lie on edges have been found, the pixels need to be processed in order
to find circles that best fit the edge information. Since the set of features (edge points) computed by the
edge detector are from the whole image and are not just localised to the edges that correspond to the head,
the circle matching process is required to be resilient to outliers (features that do not conform to the circle
being matched) within the input data.

The method employed for detecting circles within the camera frames is a generalised Hough transform.
The Hough transform is a technique for matching a particular geometric shape to a set of features that have
been found within an image, the equation that the input set of points are matched to is shown below.

c = r2 = (x − ux )2 + (y − uy )2 (5.4)

5.1.3 Hough Transform


The Hough transform is a method that was originally developed for detecting lines within images, the method
can however be extended for detecting any shape, and in this case has been used to detected circles.

Looking at the equation of a circle, and the diagram below it can be seen that three parameters are re-
quired to describe a circle, namely the radius and the two components of the centre point. The Hough
transform method uses an accumulator array (or search space) which has a dimension for each parameter,
thus for detecting circles a 3-dimensional accumulator array is required.

Figure 5.3: Cicle Diagram

The efficiency of the Hough transform method for matching a particular shape to a set of points is directly
linked to the dimensionality of the search space, and thus the number of parameters that are required to
describe the shape. In light of the requirement to achieve real time processing the images are down-sampled
by a factor of 4 before the circle detector algorithm is applied. Due to the algorithm being third order, this
reduces the number of computations by a factor of 64.

The transform assumes that each point in the input set (the set of edge points) lies on the circumference of a
circle. The parameters of all possible circles, for which the circumference passes through the current point, are
then calculated, and for each potential circle the corresponding value in the accumulator array is incremented.

Once all the points in the input set have been processed, the accumulator array is checked to find ele-
ments that have enough votes to be considered as a circle in the image, the parameters corresponding to
these elements in the accumulator array are then returned as a region within the image that could potentially
contain a head.

14
5.2 Head Detection Results
The head detection method described above provides a method of identifying potential head regions within
the input scene. Although the detector performs well in most situations, noise and other circular shaped
objects present in the scene occasionally cause the detector to skip a frame, or to return multiple head
regions. In order to overcome these limitations post processing is applied to the results to interpolate when a
frame is missed and to enable the most probable head location to be selected when multiple head regions are
returned by the detector. These post processing techniques are implemented in the object tracking module
which is discussed later on.

(a) Edge Image with no Blurring (b) Edge Image with Prior Blur

15
Chapter 6

Tracking

Ideally the object detectors employed to detect the head and hand regions within the image would have a
100% accuracy rate and as such always locate the correct objects within the scene, and never miss classify
a region. However realistically this will not be the case and as such the detection methods employed are
inevitably prone to errors.

The object tracker module aims to supplement the detection modules by providing additional processing
which helps to limit the effect of erroneous detections and skipped frames.

The object tracker does this by maintaining the state information of all objects that have been detected
by a particular object detector, within a given time interval, using tokens, and then using this information
to determine which detected region is most likely to correspond to the object that is being tracked.

6.1 Tokens
Within the token tracker tokens are data structures that store the state information of a particular object
that has been detected. The following information is encapsulated within a token.
• Location of the object
• Size of the object (area in pixels)
• Time since the object was last observed (in frames)
• Number of times the object has been observed during its lifetime
When an object is detected it is first compared to the list of tokens, if the properties of the object are a close
enough match to any of the objects described by the tokens then the token is updated with the new state
information. If no match is found, then a new token is initialised with the properties of the newly detected
object. This process is applied to all the objects that are detected by the object detector.

6.2 Token Updates


When a match is made between a detected object and a pre-existing token, the token must be updated so
that it reflects the new state of the object. During this update process the count of the number of times the
object has been seen is incremented and the number of frames since the last detection is reset to zero.

Once all the objects have been processed and the corresponding tokens updated, or initialised, the list
of tokens is filtered to remove any dead tokens. A dead token is any token that corresponds to an object
that has not recently been seen. This could occur for the following three reasons. The token could have been
initialised for an object that was erroneously detected. If this is the case, then the object is likely to only
be detected in once, and then not detected again. The second cause could be that the object that the token
was tracking is no longer present in the scene. The third reason for a token dying is if the object’s properties
changed too rapidly between detections. This would result in a match between the existing token and the
object not being made, and a new token being initialised to represent the object.

Once all the dead tokens have been filtered out from the list, the age of all the tokens (the time since
seen) is incremented. And the process is then repeated for the next set of objects that are detected.

16
6.3 Object Selection
At any given moment the object tracker can be keeping track of multiple objects, some of which may have
been correctly classified, and others which may have been erroneously detected. In order for the system to
allow the user to interact with the display, there must be a mechanism to enable the system to discern which
of these objects the ’correct’ object is.

As mentioned before, the ideal case would be if the object detectors correctly identified the object regions
with 100% accuracy. However, with the object tracker supplementing the detection systems, this requirement
can be relaxed such that for a detector to be admissible, it must meet the following requirements.

• The object must be correctly classified in more frames than erroneous regions are classified

• The object must be classified in more frames than it is missed

If these two criteria are met, then the token tracking the object will receive more updates than any other
token that exists, this will result in the correct token having a higher vote (observed count) than all the
other tokens. Therefore the location of the object can be found stored in the token that has the highest vote
count.

17
Chapter 7

Stereopsis

Now that the head and hands have been detected and tracked, obtaining their actual location in space is
the final step before the final output vector can be calculated. Calculating the depth of the user’s hand or
head is accomplished by using two cameras, mounted on the same horizontal axis a certain distance apart,
called the baseline. Similar to how humans perceive depth by using two eyes, the computer can calculate
depth by using two cameras that are properly aligned. The method through which this is accomplished will
be explained in this chapter.

7.1 Overview
Using a stereo pair of cameras, gives us two perspectives of the same image. By comparing these two
perspectives, we gain additional information on the scene. Assuming there is no major occlusion, we can
find a point in the right image, and find that same point in the left image. With these two points we can
triangulate the depth of this point in real space. One can imagine, if two cameras are mounted side by side,
the left and right image of a far away object will seem almost identical. In fact, as the distance of the object
from the cameras increases passed a certain threshold, there can be no discernible difference seen between
the two images. On the other hand, if the object is very close, the two perspectives of the object will be very
different. By using this difference in the images, the depth can be extracted. Thus, the goal is to find and
quantize this difference.

7.2 Epipolar Geometry


An easy way to quantize this difference is by comparing the the location of where a point of the object is
projected onto both images. Taking the difference of these two locations then gives us a measure of how far
away the image is. This distance is called the disparity, and is measured as the difference between the two
projections of the same point onto the CCD of the cameras. Thus, given a point in one image, it is necessary
to find the corresponding point in the other image. It is desirable to have these two points, representing
the same point in real space, be on the same horizontal line in the image. This way the search for the
corresponding point can be limited to one dimension.
In order to use stereo cameras to extract depth, it is necessary to have the left and right images calibrated and
aligned so that they share the same horizontal scan lines. That is, any one point in real space, corresponds
to a point in the right image, and a point in the left image on the same horizontal line. By mounting the two
cameras parallel on the same horizontal axis, this is easily achieved. However, since our cameras are slightly
verged in, the points projected in the two images are not on the same horizontal line. To rectify this problem,
we employ the SVS Videre tool, which performs a transformation on the image that effectively makes the
two cameras parallel. This process of calibration requires several images of a checkerboard pattern displayed
at different angles, and then outputs two rectified images that correct for not only the verged cameras, but
also the distortion in the lenses.

7.3 Correspondence Problem


Given a point in one image (a hand or a head, for example), how do we locate that same point in the
other image? Since the images have been rectified, the point must be on the same horizontal line in both
images. So the search for the corresponding point is narrowed down to one dimension. As a measure of
correspondence, the Normalized Cross Correlation is used. A window around the object of interest from one

18
image is taking and slid across the entire horizontal line, finding the NCC at each horizontal point along the
line. The horizontal point with the largest NCC value is then taken to be that corresponding point.

Figure 7.1: Correspondence Problem

7.4 Range Calculations


7.2 shows the setup of the cameras. All coordinates will now be with respect to the labeled axis.

Figure 7.2: Camera Setup

• The right hand coordinate system is used, where the z axis is point down and out of the cameras, the
x axis is pointing outward, and the y axis is pointing into the page.

• θvergenceL and θvergenceR is the angle the cameras are rotated inward.

• b is the baseline, or distance between the two cameras.

19
• f is the focal length of the cameras.

• XL and XR are the known distances of the project points onto the CCD and the center of the CCD
given by the correspondence process.

• θL and θR are two angles adjacent to the x-axis. These angles characterize the distance z.

θL and θR can be calculated given XL , XR , θvergenceL , θvergenceR , and the focal lengths. They are given
by:
θL = arctan( XfL + θvergenceL ) θR = arctan( XfR + θvergenceR )
Using similar triangles Z can be found as:
b
Z = tanθR −tanθ L
Notice that as θvergenceL,R approach zero, Z becomes inversely proportional to the disparity.
After Z is found, X and Y are easily computed:
Z
X = tanθ L
Y = YLf∗Z
where YL is similar to XL , but in the y-direction.
Given these three values (X,Y,Z), a 3D point of the object of interest is formed.

7.5 Range Resolution


The resolution of the range is dependent on many factors, including how far away the object that is trying
to be resolved is away from the camera (Z), the baseline of the cameras, the focal length, and the size of the
width of a pixel in the CCD. ?? shows the approximately Z 2 relationship of the resolution.

Figure 7.3: Range Resolution

20
Chapter 8

Vector Extrapolation

Given a 3D point for the head location and the hand location, a vector of where the user is trying to point
can be extrapolated. By taking the difference between the hand’s 3D point and the head’s 3D point, and
normalizing its magnitude to be a 1 meter, a unit vector is formed in the direction towards the display.
This vector as well as the 3D location of the user’s head is then outputted to the interface where the user’s
intended pointing direction can be projected onto the tiled display.

21
Chapter 9

Conclusion and Future Work

We have presented an approach for a marker-free human computer interaction system with overhead stereo
cameras. We have developed a promising demonstration that can be enhanced to drive large tiled displays.
There is more work required in this project for it to be a truly robust system. Speed needs to be improved
and gesture recognition needs to be accomplished. For gesture recognition, we did not get good results with
an overhead camera system. It would be a good experiment to add a third camera in front of the user
solely for gesture recognition. Porting the code to take advantage of multi-core processors such as the Cell
Broadband Engine would likely allow parallel computation of classifiers for different gestures.

22

Вам также может понравиться