Вы находитесь на странице: 1из 15


Flow Diagram

Input User Face Segmenta
Video Detection tion

n Scaling Fitting



User Extraction From The Input Video

Extraction of user allows to create an augmented reality environment by

isolating the user area from the video stream and superimposing it onto a virtual
environment in the user interface. Furthermore, it is useful todetermine the
region of interest that is also used for skin detection .The Kinect SDK provides
the depth image and theuser ID. When the device is working, depth image is
segmented in order to separate background fromthe user. The background is
removed by blendingthe RGBA image with the segmented depth image for each
pixel by setting the alpha channel to zero if the pixel does not lie on the user.

Face Detection:

The main function of this step is to determine whether human faces appear in a given
image, and where these faces are located at. The expected outputs of this step are
patches containing each face in the input image. In order to make further face
recognition system more robust and easy to design, face alignment are per-formed
to justify the scales and orientations of these patches. Besides serving as the pre-
processing for face recognition, face detection could be used for re-gion-of-interest
detection, retargeting, video and image classification, etc. For such purpose VIOLA
JONES algorithm is used.


The ViolaJones object detection framework is the first object

detection framework to provide competitive object detection rates in real-time
proposed in 2001 by Paul Viola and Michael Jones. Although it can be trained to
detect a variety of object classes, it was motivated primarily by the problem of face
detection. This algorithm is implemented in OpenCV as cvHaarDetectObjects().

There are three methods, they are

Robust object detection rates in real times

1. Integral Image representation allows fast feature computation

2.AdaBoost AdaBoost (adaptive (adaptive boosting)boosting) -based based

classifier training classifier training procedure

3. Classifier cascade allows fast rejection of non-face images

Use rectangular features instead of pixels

Features = Weak Classifiers

Each round selects the optimal feature given:

Previous selected features

Exponential Loss

Three major contributions/phases of the algorithm :

Feature extraction

Classification using boosting

Multi-scale detection algorithm

Feature extraction and feature evaluation.

Rectangular features are used, with a new image representation their calculation is
very fast.Classifier training and feature selection using a slight variation of a method
called AdaBoost. A combination of simple classifiers is very effective


Four basic types.

They are easy to calculate.
The white areas are subtracted from the black ones.
A special representation of the sample called the integral image makes feature extraction faster

Summed area tables
A representation that means any rectangles values can be calculated in four accesses
of the integral image.

Features are extracted from sub windows of a sample image.
The base size for a sub window is 24 by 24 pixels.
Each of the four feature types are scaled and shifted across all possible

In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.





In computer vision, the LucasKanade method is a widely used differential

method for optical flow estimation developed by Bruce D. Lucas and Takeo
Kanade. It assumes that the flow is essentially constant in a local neighbourhood of
the pixel under consideration, and solves the basic optical flow equations for all the
pixels in that neighbourhood, by the least squares criterion.

By combining information from several nearby pixels, the Lucas-Kanade method

can often resolve the inherent ambiguity of the optical flow equation. It is also less
sensitive to image noise than point-wise methods. On the other hand, since it is a
purely local method, it cannot provide flow information in the interior of uniform
regions of the image


The Lucas-Kanade method assumes that the displacement of the image

contents between two nearby instants (frames) is small and approximately constant
within a neighborhood of the point p under consideration. Thus the optical flow
equation can be assumed to hold for all pixels within a window centered at p.
Namely, the local image flow (velocity) vector must satisfy
where are the pixels inside the window, and are
the partial derivatives of the image with respect to position x, y and time t,
evaluated at the point and at the current time.

These equations can be written in matrix form , where

This system has more equations than unknowns and thus it is usually over-
determined. The Lucas-Kanade method obtains a compromise solution by the least
squares principle. Namely, it solves the 22 system


Where is the transpose of matrix . That is, it computes

with the sums running from i=1 to n.

The matrix is often called the structure tensor of the image at the point p.

Morphology is a technique of image processing based on shapes. The value of each

pixel in the output image is based on a comparison of the corresponding pixel in the
input image with its neighbors. By choosing the size and shape of the neighborhood,
you can construct a morphological operation that is sensitive to specific shapes in
the input image. Morphologic operations are especially suited to the processing of
binary images and greyscale images.

Dilation and Erosion

Dilation and erosion are two fundamental morphological operations. Each

background pixel that has a neighbor in the object is relabeled as an object
pixel Making object bigger, also called growing

Dilation adds pixels to the boundaries of objects in an image, while erosion

removes pixels on object boundaries. Elimination of boundary pixels from
objects in binary images making objects smaller, also called shrinking.

Opening and Closing

A single erosion followed by a single dilation by the same operator

R = (R A) A

A single dilation followed by a single erosion by the same operator

R = (R A) A

The tracking algorithm of Kinect SDK is created using a large training data set
with over 500k frames distributed over many video arrays. The data set have
been generated synthetically by rendering depth images of common poses of
actions which are frequently performed in video games such as dancing, runningor
kicking, along with their labeled ground truth pairs. In order eliminate same poses
in the data set; a small offset of 5cm is used to create a subset of poses that all
poses are at least as distant as the offset to each other. Pixel depths are then
normalized before raining so that they become translation independent.
After that, training is done to form a decision forest of depth 20.

Body joints that are used for positioning of models


In computer vision, image segmentation is the process of partitioning

a digital image into multiple segments. The goal of segmentation is to simplify
and/or change the representation of an image into something that is more meaningful
and easier to analyze. Image segmentation is typically used to locate objects and
boundaries as lines, curves, etc.in images. More precisely, image segmentation is the
process of assigning a label to every pixel in an image such that pixels with the same
label share certain visual characteristics.

The result of image segmentation is a set of segments that collectively cover the
entire image, or a set of contours extracted from the image. Each of the pixels in a
region are similar with respect to some characteristic or computed property, such
as color, intensity, or texture

Skin Segmentation
The model of the costume is superimposed on the top layer, the user always
stays behind the model which restricts some possible actions of the user such as
folding armsor holding hands in front of the t-shirt. In order to solve that issue skin
colored areas are detected and brought to the front layer.HSV and YCbCr color
spaces are commonly usedfor skin color segmentation. It is preferred as theYCbCr
color space and the RGB images are convertedinto YCbCr color space by using
following equations:

Y = 0:299R + 0:587G + 0:114B

Cb = 128 0:169R 0:332G + 0:5B
Cr = 128 + 0:5R 0:419G 0:081B (1)
Chai and Ngan report that the most representative ecolor ranges of human skin on
YCbCr color space.A threshold is applied to the color components of the
image within the following ranges

77 <Cb < 127

133 <Cr < 173
Y < 70 (2)

Since we have the extracted user image as a regionof interest the threshold is applied
only on the pixels that lie on the user. Thus, the areas on the background
which may resemble with the skin color are not processed.


It uses three parameters, they are


It is used to measure the distance of object from screen.


It is used to detect side distance from left and right corner.


It is used to detect rotating angle of the object.

Model Positioning and Rotation

The skeletal tracker returns the 3D coordinates 20body joints in terms of pixel
locations for x and y coordinates and meters for the depth. The joints which are used
to locate the parts of the 2D cloth model is seen in tracking algorithm .One may
notice flickers and vibrations on the joints due to the frame based recognition
approach. This problem is partially solved by adjusting the smoothing parameters of
the skeleton engine of the Kinect SDK.The rotation angles of the parts of the model
are defined as the angles of the main axes of the bodyand arms.
=atan2 (fyjoint1 yjoint2g; fxjoint1 xjoint2g) (3)
Here the main body axis is defined as the line between the shoulder center and the
hip center, the axes for arms are defined as the lines between the
corresponding shoulders and elbows and atan2 is a function similar to the inverse
tangent but only defined in the interval [0; pi).For each model part an anchor point
is defined as the center of rotation which is set to the middle of the
corresponding line of axis.

Model Scaling

A naive approach for model scaling can be to employthe distance between the joints
as a scaling factor.This approach can be convenient to accommodate the
height and weight related variations of users. Howeverthe distance between the
joints is not sufficientlyaccurate to scale the model to simulate the distancefrom the
sensor. Hence, we tried to adopt a hybridapproach in order to perform shape and
distancebased scaling with enhanced accuracy.The depth based scaling variable DS
is definedstraightforwardly as,

DS =Smodel/Zspine (4)

where smodel is a vector of default width and heightof the model when the user is
one meter distant fromthe sensor and zspine is the distance of the spine of
the user to the sensor.
The shape based length of an arm Lkarm is definedas the Euclidean distance between
the correspondingshoulder and the elbow. The width Wbody and heightHbody of
the body are also calculated with a similarfashion by using the distance between the
shouldersand the distance between the shoulder center and thehip center
correspondingly as follows:

=( )^2 + ( )^2

= *

=| |

=| |

Here x and y are the 2D coordinates of the joints, is constant the width to length ratio
of the arms and
k = {left; right}
We define a scaling function S as a weighted average of depth and shape based
scaling factors as,
=k{(1 +2 D )/(1 +2 )}

=k{(1 +2 D )/(1 +2 )}

where K is an arbitrary parameter to adjust the size for different models as a global
scaling factor, w1and w2 are the weights for shape and depth based

scaling factors. An increase in w2 will extend the range of displacement from the
sensor but decrease the robustness to the weight and height based shape
variations of users. In the experiments the parameters are defined as K = 1, w1 = 1
and w2 = 3.