Вы находитесь на странице: 1из 6

ICCAIS 2013: Main Track

ICCAIS 2013

An Effective Fusion Scheme of Spatio-Temporal Features for Human


Action Recognition in RGB-D Video
Quang D. Tran
Faculty of Information Technology
University of Science, VNU-HCMC

Ngoc Q. Ly
Faculty of Information Technology
University of Science, VNU-HCMC

tdquang@outlook.com

lqngoc@fit.hcmus.edu.vn

Abstract We investigate the problem of human action recognition by studying the effects of fusing feature streams retrieved
from color and depth sequences. Our main contribution is
two-fold: First, we present the so-called 3DS-HONV descriptor
which is a spatio-temporal extension of Histogram of Oriented
Normal vector (HONV), specifically designed for capturing
the joint shape-motion vision cues from depth sequences; on
the other hand, an effective RGB-D features fusion scheme,
which exploits information from both color and depth channels,
is developed to extract expressive representations for action
recognition in real scenarios. As a result, despite its simplicity,
our 3DS-HONV descriptor performs surprisingly well, and
achieves the state-of-the-art performance on MSRAction3D
dataset, which is 88.89% in overall accuracy. Further experiments demonstrate that our latter feature fusion scheme also
generalizes well and achieves good results on the one-shotlearning ChaLearn Gesture Data (CGD2011).

action recognition. Through extensive experiments on depthonly sequences, we prove that the presented descriptor could
characterize well the joint shape-motion cues of an object
by capturing the distribution of surface oriented normals in
spherical coordinate. Though its simplicity, our depth-based
descriptor leads to surprisingly good results on the depthaware MSRAction3D dataset. A second main contribution
is that we propose a feature fusion framework, which takes
profits of multi-modal RGB-D data by combining information from both RGB images and depth maps. In order to
evaluate the presented system, we conduct experiments and
compare the performance with several best previous studies
on the publicly available ChaLearn dataset (i.e. CGD2011)
[4].
The remainder of this paper is organized as follows:
Section II reviews related works. Section III details a
depth-aware spatio-temporal feature description, along with
a combination between HOG, and an extension of HOF
to capture appearance and motion information from RGBD data. Section IV describes our feature fusion framework
which is employed in RGB-D data. Extensive experimental
results are provided in section V. Finally, section VI draws
conclusions of our work.

I. INTRODUCTION
Machine-based human gesture and action understanding
is a very challenging task with an enormous range of applications, such as human-computer interaction, user interface
design, robot learning, and surveillance [1]. In the past
decades, there have been many approaches dealing with
this problems in many ways. However, there still remains
many different layers of complexity for studying due to the
extensive range of possible human motions, variations in
scene, viewpoint or object occlusion, etc.
Recently introduced RGB-D cameras (e.g. Kinect) are capable of providing high quality synchronized videos of both
color and depth. With its advanced sensing techniques, this
technology opens up an opportunity to significantly increase
the capabilities of many automated vision-based recognition
tasks [2]. It also raises the problem of developing expressive
features for the color and depth channels of these sensors,
or further effectively fusing them to enrich information for
later recognition stages. To demonstrate benefits retrieved
from the depth map, as well as, the necessity of color
information for action-gesture recognition, a representative
depth-based descriptor and an effective color-depth features
fusion scheme are presented in this study.
In specific, the two main contributions of this paper
are summarized as follows. First, inspired by the success
of [3] on object recognition by designing a depth-based
representation of oriented normals (i.e. HONV), we firstly
present an extension of the HONV descriptor throughout
spatio-temporal domain, and later apply it in the context of

978-1-4799-0572-0/13/$31.00 2013 IEEE

246

II. R ELATED W ORKS


Comprehensive reviews of the previous studies can be
found in [1]. Our discussion in this section is restricted to a
few influential and relevant parts of literature, with a focus on
hand-crafted feature extraction and representation. Although
most previous works on feature extraction focused on using
2D videos [1], several approaches to extract features from
3D videos have been proposed in the past few years. One
of the earliest works on action recognition using a depth
sensor was presented in [5]. Li et al. proposed a bag-of3D-points feature representation for action recognition from
depth map sequences, where the 3D points are sampled
from the silhouettes of the depth maps [5]. In [6], Ni et
al. proposed 3D Motion History Images (i.e. 3D MHIs)
and Depth-Layered Multi-Channel STIPs (DLMC-STIPs)
framework, where STIPs were divided into multiple depth
layered channels, and afterwards those STIPs within different
depth layers are pooled correspondingly. Finally, it yielded
multiple depth channel histogram representation.
Since the color-based spatio-temporal interest points detectors, such as Harris3D [7] or Hessian3D [8], etc. are

ICCAIS 2013: Main Track

ICCAIS 2013
the Histogram of Optical Flow [14], namely as HOF2.5D,
which is able to capture the distribution of not only inplane movements extracted from RGB data, but also z-axis
motions by utilizing the available depth information. We
finally make a brief discussion about the HOG-HOF2.5D
descriptor, which is also called as early fusion that concatenates both HOG and HOF2.5D into one vector. We believe
that this aggregation of descriptions can satisfactorily capture
expressive appearance-shape representation for RGB-D data.

Fig. 1. Process of extracting 3DS-HONV feature descriptor from a Region


of Interest (ROI): (a)Surface normal is computed at each point in ROI, (b)3D
histogram of normal distribution in spherical coordinate is constructed,
(c)3D histograms at all points in ROI are accumulated.

not certainly reliable in depth sequences, some alternative


approaches were presented to alleviate this step by using the
available skeleton joints information instead. For instance,
Jiang et al. [9] firstly extracted the skeleton of human using
the skeleton tracking algorithm in [2]. Then [9] proposed an
LOP feature which computes the local occupancy information based on the 3D point clouds around a particular joint to
discriminate different types of interactions, and an actionlet
ensemble model to represent each action.
However, in fact, skeletal joint data are often unreliable
and not always available, especially in scenarios where the
camera is mounted on the ceiling. To that end, holistic
approaches, which mean a global feature is obtained for the
entire sequence instead of using local points, are considered
as solutions to alleviate the aforementioned issues. Several
studies that followed this approach can be listed as [10],
[11], and [12]. In [12], Omar et al. proposed a 4-dimensional
extension of the surface oriented normal vectors [3], and a
quantization method using 4D 600-cell polychoron (i.e. 120
uniform projectors). This quantization method was inspired
by the successful work of [13] which used a 3D polyhedron
to quantize 3D oriented gradient vectors. The authors in [12]
further carried out a process of quantization-step refinement,
using the classification score in training phases to generate
some perturbations to the original uniform projectors, since
they proved that non-uniform quantization performed much
better. As a consequence, though bypass the use of a skeleton
tracker, their method still resulted with state-of-the-art performances on several standard benchmarks (e.g. MSRAction3D)
at the present of this paper.
III. F EATURE D ESCRIPTION & R EPRESENTATION
The recently proposed HONV descriptor [3] could encode
the distribution of surface oriented normal vectors computed
over the depth map. S. Tang et al. [3] proved that the
field of normal vectors would effectively capture the 3D
shape structure of an object in the depth space. Inspired
by this proof, in this section we firstly describe a spatiotemporal extension of HONV, called as 3DS-HONV. Though
its simplicity, the 3DS-HONV descriptor can capture well the
joint shape-motion structures which are essential components
for learning robust action classifiers in depth sequences.
In addition, we then introduce an efficient extension of

978-1-4799-0572-0/13/$31.00 2013 IEEE

247

A. 3D Spherical Histogram of Oriented Normal Vectors


(3DS-HONV)
The process of computing the 3DS-HONV descriptor for
a specific Region of Interest (i.e. ROI) in an action depth sequence is described in Fig. 1. For each ROI, the orientation of
the normal vector at each depth point is first computed (Fig.
1(a)), quantized in spherical coordinate by using 3 angles
, , and voted into a 3D histogram qj Rb b b (Fig.
1(b)), where bi is the relevant bin size. Those 3D histograms
at all interest points in the ROI are then accumulated to create
a histogram of normal occurrences distribution (Fig. 1(c)).
Details of this computing process are described as follows:
1) Spatio-Temporal Surface Oriented Normal Vectors:
The depth sequence can be considered as a function R3
R1 : z = d(x, y, t) (d() is a function of depth sequence)
which constitutes a surface in the 4D space represented as the
set of points {p = (x, y, t, z)} satisfying S(p) = d(x, y, t)
z = 0. The normal to the surface S is computed as:
z z z
,
, 1) (1)
n = S = (zx , zy , zt , 1) = ( ,
x y t
where zx , zy , zt are first derivatives of the depth map z over
x, y, t, which can be calculated using the finite difference
approximation respectively. Since only the orientation of
the normal could depict the shape of the 4D surface, the
computed normal vectors
are then normalized
to a unit length



T
n
= (
zx , zy , zt , 1/ (zx , zy , zt , 1) ).
2
2) Spherical Quantization & 3D histogram representation: In our work, the orientation of spatio-temporal surface
normal is characterized by three Euler angles {, , and }
[0; ] computed in spherical coordinate (see Fig. 1(b)). The
Euler angles are a classical way to specify the orientation of
an object in space with respect to a fixed set of coordinate
axes [15]. According to Eulers rotation theorem[15], any
rotations may be described using three angles; therefore,
we clarify that by just using 3 Euler angles , , for
quantization, the resulting histogram can encode any kinds
of surface normal orientation in a rich representation. Compared to quaternions-based quantization (i.e. 4D polychoron
[12]), Euler angles-based quantization is simple, intuitive, but
also efficient. The approximate computation of Euler angles
, , [15] are summarized as in (2):
1

= tan


1
= tan

z z
y / x

= tan


z 2
x

z
y

2 1/2

/ tz

1/
 2
2


z 2
z
z 2

+ y + t
x

(2)

ICCAIS 2013: Main Track

ICCAIS 2013
xy, xz, yz as shown in Fig. 2. Specifically, we compute
orientations of each OF2.5D on three projected planes as:
1 = tan1 (Vy /Vx ) 2 = tan1 (Vz /Vx )
3 = tan1 (Vz /Vy )

Fig. 2.

Quantization scheme for computing HOF2.5D

To create 3D histogram representation for each depth point,


the [0; ] interval is subdivided in b , b , b bins, so that
the histogram has a total of b b b bins, and is then
normalized to compute the proportion of normals falling into
each bin. Through extensive experiments, we find that the
tuple of bins size which are {b = 5, b = 5, b = 6}
mostly results with best performances in classification stage.
Thus, this leads to a 150-dimensions 3DS-HONV for each
depth point in an action sequence.
B. Histogram of Semi-Scene Flow (HOF2.5D)
Though a lot of approaches have been focused on the
computation of the 2D optical flow (OF) and its related 2D
Histogram of Flow (HoF), the computation of 3D motion
vectors (i.e. scene flows) is still an active study and only a
few works employed it [16], [17]. The main reason behind
this is due to the computational complexity required for
variational methods. Here, we present a simple but effective
implementation of the semi-scene flows, which we call as
OF2.5D. This descriptor is not generated from a unified
image sequence function f (x, y, z, t) as [16], [17], but instead captured separately the xy-motions from pairs of RGB
images and the z-movements from pairs of depth channels.
Assuming that we are in a calibrated context of RGB-D
data which means the position of each pixel in RGB images
can be exactly mapped to the related cloud point in depth
, ytRGB })
= {xRGB
maps. In specific, each pixel (pRGB
t
t
belonging to the ROI of a RGB-D frame Ft can be easily
D D
reprojected to its corresponding position (pD
t = {xt , yt })
in the depth map. The process of computing HOF2.5D
descriptor is described as follows. For each ROI in a RGB
RGB
frame Ft=1..N
1 , the {Vx , Vy } components of the OF fields
at every pixels are computed using Lucas-Kanade algorithm
[14]. In order to create OF2.5D at each calibrated pixel
(pt = {pRGB
, pD
t
t }), we utilize the information of available
depth maps to compute the Vz component of the OF vector
D
D D
as this formulation: Vz = Ft+1
(pD
t+1 ) Ft (pt ).
As results, for each RGB-D frame Ft , we obtain a feature
descriptor D = {D1 , D2 , ..., Dn }, where each element Di =
{Vx , Vy , Vz } is a 3D vector that captures satisfactorily 3D
motion information of a particular pixel belongs to a specific
ROI. In addition, to make the resulting flow descriptors
invariant to the overall speed of the action, each moving
pixel is normalized by L2-norm.
As a final representation for each action sequence, we perform a histogram quantization using three orthogonal planes

978-1-4799-0572-0/13/$31.00 2013 IEEE

248

(3)

We then evenly deploy b1 , b2 , b3 orientations binning


on three orthogonal planes to finally generate a histogram
representation of each semi-scene flow vector, namely as
HOF2.5D. In all experiments, we set b1 = b2 = b3 =
8 since this tuple of values yields the best result during
parameters grid search. As a consequence, for each ROI,
by accumulating all HOF2.5D descriptors at all pixels, we
achieve a 24-bins histogram that captures the distribution of
motion flows in that specific ROI.
C. HOG-HOF2.5D
Since the HOF2.5D descriptor mainly encodes 3D motions
information from RGB-D data, we thus, further combine it
with HOG - a popular appearance descriptor, to augment
its capability of capturing much more semantic information.
In details, for each ROI of a RGB image, we compute
one-channel (i.e. grayscale) HOG descriptor, then further
concatenate it with the HOF2.5D computed from RGB-D
data. In case of parameters setting in later experiments (e.g.
local patchs size around STIPs, number of space-time ROIs
in each patch, etc.) for computing color-depth descriptors
such as HOG, HOF, HOF2.5D at each interest point, we
follow the same setup as [7].
D. Feature Representation via Sparse Coding
In above sections, we defined the 3DS-HONV descriptor
that can effectively capture the joint shape-motion information in depth sequences, while HOG-HOF2.5D can approximately depict both appearance and 3D motions from RGBD data. However, it is likely that these descriptors are still
affected by noise, and it is desirable to filter them beforehand.
One of the possible techniques that has been recently investigated in the Machine Learning community is sparse coding
(i.e. SC)[18]. The main concept behind SC is to compress
information in a representation with a greater expressive
power, it may be obtained by decomposing the signals into
linear combination of a few elements from a given or learned
dictionary [19]. SC approaches often involve a preliminary
stage, called Dictionary Learning, where the codewords are
learned directly from the data [19]. Specifically, given the
set of the previously computed descriptors h = [h1 , ..., hn ],
h can be the 3DS-HONV or HOG-HOF2.5D descriptors, the
goal is to learn a dictionary D and a code U that minimize
the reconstruction error.
2

min kh DU kF + kU k1
D,U

(4)

where kkF is the Frobenius norm. Fixing U , the above optimization reduces to a least square problem, while given D
it is equivalent to a linear regression with L1 norm. The solution is computed via the feature-sign search algorithm[18].
Thus, after the feature description stage for each action video,
we can describe a set of n frames as a sequence of codes:
u1 , ..., un rather than the original descriptors h.

ICCAIS 2013

ICCAIS 2013: Main Track

Since 3D sensors, such as Kinect, use structured light to


estimate depth information, it is prone to be affected by
noises due to reflection issues. These affects of noise could
significantly degrade the overall performance of depth-based
action recognition framework. Therefore, for each RGB-D
action data, we firstly apply a Gaussian smoothing process
to relieve the effect of noise from the depth channel.

calibrated color-depth information to compute this description for each interest point in SRGB . As mentioned in section
III-D, we further apply separately a dictionary learning and
vector quantization (i.e. VQ) process via sparse coding to
both 3DS-HONV and HOG-HOF2.5D descriptors. This idea
is significantly inspired by the success of [19] in applying
sparse coding instead of other VQ methods (e.g. BoF) for
the task of image representation and classification.
Additionally, it is important to note that for each RGB-D
action video, in order to introduce geometrical and temporal
information, we apply spatio-temporal pyramids. For computational efficiency, the spatio-temporal pyramids in our case
basically consist of dividing the RGB-D video volume into
M N P (i.e. Vsize ) non-overlapped cells along the x, y,
and t dimensions of the volume, respectively. In that sense,
we apply independently the dictionary learning and VQ
process for each cell of action volumes. In later experiments
on CGD2011 dataset, for each non-overlapped cell of action
sequences, we set the dictionarys size to 200 for all kinds of
color/ depth descriptors (e.g. HOG-HOF2.5D, 3DS-HONV,
etc.). As a consequence, after dictionary learning and VQ
stage at each cell of action volumes, we achieve 200-dims
HOG-HOF2.5D and 200-dims 3DS-HONV descriptors for
each set of interest points SRGB and SD , respectively. To
generate final representation for each cell, we accumulate
separately each kind of proposed descriptors at all interest
points belongs to that cell. To that end, after sparse coding,
the dimensions of 3DS-HONV and HOG-HOF2.5D representations for each action volume are 200 Vsize .
In the next and final classification step, the information
given by different modalities (i.e. 3DS-HONV and HOGHOF2.5D) is merged using late fusion.

B. Keypoint detection

D. Late Features Fusion and Classification

In order to reduce the amount of points in a dense


sampling, we use the spatio-temporal interest points (STIP)
detector [7], which is an extension of the well-known Harris
detector in the temporal dimension. The STIP detector first
computes the second-moment 3 3 matrix of first order
spatial and temporal derivatives. Then, the detector searches
regions in the video with significant eigenvalues 1 , 2 , 3
of , combining the determinant and the trace of :

The final step of our framework is to predict the class of


the query video via the computed feature representations.
Since we later conduct our experiment on the CGD2011
dataset [4], where there is only one RGB-D sample per
gesture class, a simple kNN classification using histogram
intersection distance is applied:

Input: Color-Depth
Sequences

Depth Channel
Removing Depth Noise & Detecting STIP

RGB Channel
Detecting STIPs

V
Space-Time Interest Points
Detection

Depth Channel
Compute STIP-HONV4D

RGB Channel
Compute HOF2.5D based
on STIP

STIPs

xy-plane motions z-axis motions

Harris3D-HOF2.5D

Harris3D- 3DS-HONV

Compute STIP-based HOG

Harris3D-HOG
Feature Description &
Representation

Visual Codewords via Sparse coding


sc-HOF2.5D-HOG
(hRGB)

sc-3DS-HONV (hD)
Late Fusion Schemes & Classification

Fig. 3. Feature fusion schema for depicting RGB-D data (e.g. CGD2011[4])

IV. F EATURES F USION P IPELINE FOR RGB-D DATA


In this section, we make a discussion about our feature
fusion scheme, which is specifically designed to utilize
benefits provided from multi-modal RGB-D data for solving
the task of action recognition. Our feature fusion framework
is depicted in Fig. 3, and described in details as belows.
A. Preprocessing Stage

H = || k Tr ()3

d =1

(5)

where || corresponds to the determinant, Tr () computes the


trace, and k stands for a relative importance constant factor.
As we have multimodal RGB-D data, we apply the STIP
detector separately on the RGB and depth volumes, so we
get two sets of interest points SRGB and SD .

(6)

where F {3DS HON V, HOG HOF 2.5D}. Finally, we perform classifier-based late fusion in order to merge the histograms 3DS-HONV and HOG-HOF2.5D to generate final
distance score as:
d = d3DS-HONV + (1 ) dHOG-HOF2.5D

(7)

where is a constant relative importance factor.

C. Keypoint description
At this step, we want to describe the interest points
detected in the previous stage. On one hand, for SRGB
we compute state-of-the-art HOG descriptor for encoding
appearance information around each interest point. On the
other hand, for SD we extract the proposed 3DS-HONV
descriptor. In case of HOF2.5D descriptor, we utilize both the

978-1-4799-0572-0/13/$31.00 2013 IEEE

F
min(Hquery
(i),H Fmod el (i))


P F
P F
min
Hquery (i),
H mod el (i)

249

V. E XPERIMENTS
We apply the proposed methods to two related recognition
tasks: depth-based human action, and one-shot-learning gesture recognition. The purpose of the first experiment which
is conducted on a depth-only dataset (i.e. MSRAction3D)
is to verify the capability of the 3DS-HONV descriptor

ICCAIS 2013

ICCAIS 2013: Main Track

TABLE I
C OMPARISON OF OUR METHOD WITH PREVIOUS APPROACHES ON
MSR-ACITON 3D DATASET
Method
3DS-HONV (Our method)
Omar et al. [12] (HON4D)
Omar et al. [12] ( HON4D + Ddisc )
Jiang et al. [9]
Yang et al. [20]
Klaser et al. [13]

Accuracy %
88.89
85.85
88.89
88.2
85.5
81.43

TABLE II
Fig. 4.
Classification accuracies on MSRAction3D by applying 3DSHONV descriptor over different levels of the spatio-temporal pyramid.
HighArmWave
HorizontalArmWave

100

hammer

6.67

HandCatch
HighThrow

RGB Desc.
HOG
HOF
HOG-HOF

100

6.67

20 33.33

40

66.67

ForwardPunch

6.67 13.33 6.67

TeLev%
34.52
41.44
33.14

Depth Desc.
3DS-HONV

TeLev
28.91

RGB-D Desc.
HOF2.5D
HOG-HOF2.5D
3DS-HONV/HOG-HOF2.5D

TeLev
33.52
30.27
24.89

100
85.71

14.29

DrawX

21.43

DrawTick

78.57
100

DrawCircle

13.33 6.67

HandClap

6.67

80
93.33

TwoHandWave

100

SideBoxing

6.67

Bend

6.67

93.33
93.33

ForwardKick

100

SideKick

6.67

Jogging
TennisSwing

M EAN L EVENSHTEIN DISTANCE FOR RGB AND DEPTH DESCRIPTORS


( AFTER VQ VIA SPARSE CODING )

86.67

6.67
100

6.67

TennisServe

93.33
100

GolfSwing
PickUpThrow

100
100

Fig. 5. The confusion matrix for MSRAction3D dataset by applying 3DSHONV descriptor

in capturing expressive joint shape-motion cues from depth


sequences. The second experiment proves that our proposed
feature fusion framework could thoroughly utilize benefits
from the multimodal RGB-D data for solving a very challenging problem, that is one-shot learning gesture recognition. The experimental results demonstrate that our approach
obtains good performance on both recognition tasks, particularly when taking full advantage of color-depth information
available from the RGB-D sensor. The approach is not only
simple enough, but it also exceeds the performance of several
alternative approaches and baselines.
A. MSR-Action3D Dataset
The MSR-Action3D dataset, which was introduced in [5],
provides both skeleton and depth information. It contains
twenty actions (e.g. side kick, pick up, throw, etc.), ten
subjects, and a total of 557 action samples. The dataset is
challenging due to small interclass variations among actions,
while the skeleton tracker fails often, and contains significant
noises. We follow cross-subject test settings, where the first
five actors are used in training and the rest for testing. For
classification stage, a non-linear SVM is applied as same
as [12]. Note that, in feature description stage, we simply
perform dense sampling over the whole depth sequence
instead of detecting interest points, since the effect of noises
in this dataset could dramatically decline the performance of
STIP detector. In addition, for each action sequence S, we
divide it into M N P non-overlapped cells over x, y, t
axes, and independently compute 3DS-HONV description
for each cell. To find out the best tuple of spatio-temporal
pyramids sizes (i.e. {M, N, P }) for each action volume, we
perform parameters grid search with size [1 : 8, 1 : 8, 1 : 6]

978-1-4799-0572-0/13/$31.00 2013 IEEE

250

and achieve results as shown in Fig. 5. To that end, by


setting M = 4, N = 5, P = 3 we get the best recognition
performance. Detailed classification results between action
classes are given by the confusion matrix in Fig. 5.
Experimental results and comparisons with previous best
methods, as mentioned in [12], are summarized in Table I. It
can be seen that our 3DS-HONV achieves the same performance (88.89%) with the current state-of-art HON 4D + Ddisc
proposed in [12], which already included an extremely heavy
process of projectors refinement for surface normals quantization stage. This proves that the 3DS-HONV descriptor is
not only low-cost but also captures well joint shape-motion
cues from depth sequences; thus, it can characterize and
discriminate effectively between different classes of action.
B. ChaLearn Gesture Challenge Dataset (CGD2011)
1) Data description & Evaluation metrics: CGD2011 is a
recently comprehensive dataset of multimodal RGB-D videos
of human actors performing a variety of gestures, which
has been made available to researchers under the Microsoft
ChaLearn Gesture Challenge [4]. The goal of the challenge
is to employ systems to perform gesture recognition from
videos containing diverse backgrounds, using a single sample
per gesture (i.e. one-shot learning).
We have used this dataset to test and compare the ability
of our proposed feature fusion framework, which exploits

Fig. 6. Performance comparison between proposed fusion framework and


two baseline methods [21], [22] over 20 CGD devel data. X axis represents
different batches, and Y depicts the mean Levenshtein distance (TeLev) of
each batch.

ICCAIS 2013

ICCAIS 2013: Main Track

TABLE III
C OMPARISON WITH SEVERAL BEST PUBLISHED STUDIES OVER 20
DEVELOPMENT DATA BATCHES , USING

VI. C ONCLUSIONS

L EVENSHTEIN D ISTANCE .

Method
3DS-HONV/HOG-HOF2.5D (Sparse coding) (Ours)
Baseline 1 (Template matching) [21]
Baseline 2 (PCA-based: Principal motion) [22]
Manifold LSR [23]
MHI [24]
Extened-MHI [24]

TeLev %
24.89
62.31
43.63
28.73
30.01
26.00

thoroughly benefits provided by RGB-D data, for the task


of one-shot gesture learning. Specifically, we used the first
20 development batches out of the available hundreds to
assess the performance and compare with other methods.
Each batch is made of 47 gesture videos and approximately
split into 8-12 for training and the rest for testing. The
videos are captured from frontal views with the actor roughly
centralized and no camera motions. Furthermore, every test
video contains from 1 to 5 continuous gestures.
The recognition performance is evaluated using the Levenshtein distance[21]: T eLev = (S + D + I)/N ; where S is the
number of substitutions (misclassifications), D the number of
deletions (false negatives), I the number of insertions (false
positives) and N the length of the ground-truth sequence.
2) Experimental setups: The procedure for conducting
experiments in this dataset is as same as the feature fusion
pipeline discussed in section IV. Note that, in contrast
with the above MSRAction3D experiments which use dense
sampling, we apply here the STIP detector for each video
sample to reduce the amount of points needed to be process.
The reason behind this choice is that this dataset provides
both RGB and quite-clean depth data, which is potential to
exploit the effectiveness of STIP detector. To that end, as
suggested by [7], we set the spatio-temporal pyramids size
(i.e. M N P ) as 2 2 2 cells to achieve the best
video representation. For the classifier-based late fusion, the
weight mentioned in IV-D is empirically set to 0.8.
3) Performance evaluation: Table II depicts a performance comparison between different type of RGB and depth
descriptors, as well as their several kinds of combination.
Note that, these descriptors are all represented via dictionary
learning and VQ process using sparse coding. It can be seen
that the original HOF performs the worst (41.44%), whilst its
extension HOF2.5D is much more effective (33.52%). Once
again, the evaluation of 3DS-HONV shows that it is very
expressive and discriminative when emits surprisingly good
result (TeLev=28.91%). As a consequence, we can observe
that our late fusion scheme between the two best 3DSHONV and HOG-HOF2.5D descriptors yields significant
improvements in overall accuracy over the rest.
As Fig. 6 reveals, our method significantly outperforms two
baseline algorithms [21], [22], and achieves the best result
with 24.89% average Levenshtein distance, which outperforms several best published studies as shown in table III.
This illustrates that our proposed fusion framework can be
effectively adopted for one-shot-learning gesture recognition.

978-1-4799-0572-0/13/$31.00 2013 IEEE

251

In this paper, we present a new spatio-temporal extension


of HONV descriptor, namely as 3DS-HONV, as well as a soft
modification of HOF to make it able to approximately capture 3D motions from RGB-D data. We then discuss a colordepth feature fusion pipeline, which is specifically designed
for exploiting benefits provided by the multimodal RGB-D
data. The experiments show that our proposed descriptors
and fusion approach perform surprisingly well and outperform previous best published studies on two challenging
benchmarks: MSR-Action3D & CGD2011. Meanwhile, this
study also exposes the potential to easily extend the current
fusion framework with other kinds of robust representations.
R EFERENCES
[1] J. Aggarwal and M. Ryoo, Human activity analysis: A review, ACM
Comput. Surv., Apr. 2011.
[2] J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,
R. Moore, A. Kipman, and A. Blake, Real-time human pose recognition in parts from single depth images, in CVPR, 2011.
[3] S. Tang, X. Wang, X. Lv, T. X. Han, J. M. Keller, Z. He, M. Skubic,
and S. Lao, Histogram of oriented normal vectors for object recognition with a depth sensor, in ACCV, 2012.
[4] M. Piorkowski, N. Sarafijanovic-Djukic, and M. Grossglauser,
chalearn gesture dataset (cgd2011), chalearn, california, 2011.
[5] W. Li, Z. Zhang, and Z. Liu, Action recognition based on a bag of
3d points, in CVPR. IEEE, 2012.
[6] B. Ni, G. Wang, and P. Moulin, Rgbd-hudaact: A color-depth video
database for human daily activity recognition, in ICCV, 2011.
[7] I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld, Learning
realistic human actions from movies, in CVPR. IEEE, jun 2008.
[8] G. Willems, T. Tuytelaars, and L. Gool, An efficient dense and scaleinvariant spatio-temporal interest point detector, in ECCV, 2008.
[9] Y. Wu, Mining actionlet ensemble for action recognition with depth
cameras, in CVPR. IEEE Computer Society, 2012, pp. 12901297.
[10] A. W. Vieira, E. R. Nascimento, G. L. Oliveira, Z. Liu, and M. F. M.
Campos, Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences, in CIARP, 2012, pp. 252259.
[11] X. Yang, C. Zhang, and Y. Tian, Recognizing actions using depth
motion maps-based histograms of oriented gradients, in ACM MM,
ser. 12, 2012, pp. 10571060.
[12] O. Oreifej and Z. Liu, Hon4d: Histogram of oriented 4d normals for
activity recognition from depth sequences, in CVPR, 2013.
[13] A. Klaser, M. Marszaek, and C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in BMVC, sep 2008, pp. 9951004.
[14] D. Sun, S. Roth, and M. J. Black, Secrets of optical flow estimation
and their principles, in CVPR. IEEE, Jun. 2010, pp. 24322439.
[15] J. Diebel, Representing attitude: Euler angles, unit quaternions, and
rotation vectors, 2006.
[16] F. Huguet and F. Devernay, A Variational Method for Scene Flow
Estimation from Stereo Sequences, in ICCV. IEEE, 2007.
[17] A. Wedel, T. Brox, T. Vaudrey, and C. Rabe, Stereoscopic scene flow
computation for 3d motion understanding, IJCV, Oct. 2011.
[18] H. Lee, A. Battle, R. Raina, and A. Y. Ng, Efficient sparse coding
algorithms, in NIPS, 2006, pp. 801808.
[19] J. Yang, K. Yu, Y. Gong, and T. S. Huang, Linear spatial pyramid
matching using sparse coding for image classification, in CVPR, 2009.
[20] J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, Robust 3d action
recognition with random occupancy patterns, in ECCV, 2012.
[21] I. Guyon, V. Athitsos, P. Jangyodsuk, B. Hamner, and H. J. Escalante,
Chalearn gesture challenge: Design and first results, in CVPR
Workshops, 2012, pp. 16.
[22] H. J. Escalante and I. Guyon, Principal motion: Pca-based reconstruction of motion histograms, in Technical Memorandum, June 2012.
[23] Y. M. Lui, A least squares regression framework on manifolds and
its application to gesture recognition, in CVPR Workshops, 2012.
[24] D. Wu, F. Zhu, and L. Shao, One shot learning gesture recognition
from rgbd images, in CVPR Workshops, 2012, pp. 712.

Вам также может понравиться