Академический Документы
Профессиональный Документы
Культура Документы
ICCAIS 2013
Ngoc Q. Ly
Faculty of Information Technology
University of Science, VNU-HCMC
tdquang@outlook.com
lqngoc@fit.hcmus.edu.vn
Abstract We investigate the problem of human action recognition by studying the effects of fusing feature streams retrieved
from color and depth sequences. Our main contribution is
two-fold: First, we present the so-called 3DS-HONV descriptor
which is a spatio-temporal extension of Histogram of Oriented
Normal vector (HONV), specifically designed for capturing
the joint shape-motion vision cues from depth sequences; on
the other hand, an effective RGB-D features fusion scheme,
which exploits information from both color and depth channels,
is developed to extract expressive representations for action
recognition in real scenarios. As a result, despite its simplicity,
our 3DS-HONV descriptor performs surprisingly well, and
achieves the state-of-the-art performance on MSRAction3D
dataset, which is 88.89% in overall accuracy. Further experiments demonstrate that our latter feature fusion scheme also
generalizes well and achieves good results on the one-shotlearning ChaLearn Gesture Data (CGD2011).
action recognition. Through extensive experiments on depthonly sequences, we prove that the presented descriptor could
characterize well the joint shape-motion cues of an object
by capturing the distribution of surface oriented normals in
spherical coordinate. Though its simplicity, our depth-based
descriptor leads to surprisingly good results on the depthaware MSRAction3D dataset. A second main contribution
is that we propose a feature fusion framework, which takes
profits of multi-modal RGB-D data by combining information from both RGB images and depth maps. In order to
evaluate the presented system, we conduct experiments and
compare the performance with several best previous studies
on the publicly available ChaLearn dataset (i.e. CGD2011)
[4].
The remainder of this paper is organized as follows:
Section II reviews related works. Section III details a
depth-aware spatio-temporal feature description, along with
a combination between HOG, and an extension of HOF
to capture appearance and motion information from RGBD data. Section IV describes our feature fusion framework
which is employed in RGB-D data. Extensive experimental
results are provided in section V. Finally, section VI draws
conclusions of our work.
I. INTRODUCTION
Machine-based human gesture and action understanding
is a very challenging task with an enormous range of applications, such as human-computer interaction, user interface
design, robot learning, and surveillance [1]. In the past
decades, there have been many approaches dealing with
this problems in many ways. However, there still remains
many different layers of complexity for studying due to the
extensive range of possible human motions, variations in
scene, viewpoint or object occlusion, etc.
Recently introduced RGB-D cameras (e.g. Kinect) are capable of providing high quality synchronized videos of both
color and depth. With its advanced sensing techniques, this
technology opens up an opportunity to significantly increase
the capabilities of many automated vision-based recognition
tasks [2]. It also raises the problem of developing expressive
features for the color and depth channels of these sensors,
or further effectively fusing them to enrich information for
later recognition stages. To demonstrate benefits retrieved
from the depth map, as well as, the necessity of color
information for action-gesture recognition, a representative
depth-based descriptor and an effective color-depth features
fusion scheme are presented in this study.
In specific, the two main contributions of this paper
are summarized as follows. First, inspired by the success
of [3] on object recognition by designing a depth-based
representation of oriented normals (i.e. HONV), we firstly
present an extension of the HONV descriptor throughout
spatio-temporal domain, and later apply it in the context of
246
ICCAIS 2013
the Histogram of Optical Flow [14], namely as HOF2.5D,
which is able to capture the distribution of not only inplane movements extracted from RGB data, but also z-axis
motions by utilizing the available depth information. We
finally make a brief discussion about the HOG-HOF2.5D
descriptor, which is also called as early fusion that concatenates both HOG and HOF2.5D into one vector. We believe
that this aggregation of descriptions can satisfactorily capture
expressive appearance-shape representation for RGB-D data.
247
= tan
1
= tan
z z
y / x
= tan
z 2
x
z
y
2 1/2
/ tz
1/
2
2
z 2
z
z 2
+ y + t
x
(2)
ICCAIS 2013
xy, xz, yz as shown in Fig. 2. Specifically, we compute
orientations of each OF2.5D on three projected planes as:
1 = tan1 (Vy /Vx ) 2 = tan1 (Vz /Vx )
3 = tan1 (Vz /Vy )
Fig. 2.
248
(3)
min kh DU kF + kU k1
D,U
(4)
where kkF is the Frobenius norm. Fixing U , the above optimization reduces to a least square problem, while given D
it is equivalent to a linear regression with L1 norm. The solution is computed via the feature-sign search algorithm[18].
Thus, after the feature description stage for each action video,
we can describe a set of n frames as a sequence of codes:
u1 , ..., un rather than the original descriptors h.
ICCAIS 2013
calibrated color-depth information to compute this description for each interest point in SRGB . As mentioned in section
III-D, we further apply separately a dictionary learning and
vector quantization (i.e. VQ) process via sparse coding to
both 3DS-HONV and HOG-HOF2.5D descriptors. This idea
is significantly inspired by the success of [19] in applying
sparse coding instead of other VQ methods (e.g. BoF) for
the task of image representation and classification.
Additionally, it is important to note that for each RGB-D
action video, in order to introduce geometrical and temporal
information, we apply spatio-temporal pyramids. For computational efficiency, the spatio-temporal pyramids in our case
basically consist of dividing the RGB-D video volume into
M N P (i.e. Vsize ) non-overlapped cells along the x, y,
and t dimensions of the volume, respectively. In that sense,
we apply independently the dictionary learning and VQ
process for each cell of action volumes. In later experiments
on CGD2011 dataset, for each non-overlapped cell of action
sequences, we set the dictionarys size to 200 for all kinds of
color/ depth descriptors (e.g. HOG-HOF2.5D, 3DS-HONV,
etc.). As a consequence, after dictionary learning and VQ
stage at each cell of action volumes, we achieve 200-dims
HOG-HOF2.5D and 200-dims 3DS-HONV descriptors for
each set of interest points SRGB and SD , respectively. To
generate final representation for each cell, we accumulate
separately each kind of proposed descriptors at all interest
points belongs to that cell. To that end, after sparse coding,
the dimensions of 3DS-HONV and HOG-HOF2.5D representations for each action volume are 200 Vsize .
In the next and final classification step, the information
given by different modalities (i.e. 3DS-HONV and HOGHOF2.5D) is merged using late fusion.
B. Keypoint detection
Input: Color-Depth
Sequences
Depth Channel
Removing Depth Noise & Detecting STIP
RGB Channel
Detecting STIPs
V
Space-Time Interest Points
Detection
Depth Channel
Compute STIP-HONV4D
RGB Channel
Compute HOF2.5D based
on STIP
STIPs
Harris3D-HOF2.5D
Harris3D- 3DS-HONV
Harris3D-HOG
Feature Description &
Representation
sc-3DS-HONV (hD)
Late Fusion Schemes & Classification
Fig. 3. Feature fusion schema for depicting RGB-D data (e.g. CGD2011[4])
H = || k Tr ()3
d =1
(5)
(6)
where F {3DS HON V, HOG HOF 2.5D}. Finally, we perform classifier-based late fusion in order to merge the histograms 3DS-HONV and HOG-HOF2.5D to generate final
distance score as:
d = d3DS-HONV + (1 ) dHOG-HOF2.5D
(7)
C. Keypoint description
At this step, we want to describe the interest points
detected in the previous stage. On one hand, for SRGB
we compute state-of-the-art HOG descriptor for encoding
appearance information around each interest point. On the
other hand, for SD we extract the proposed 3DS-HONV
descriptor. In case of HOF2.5D descriptor, we utilize both the
F
min(Hquery
(i),H Fmod el (i))
P F
P F
min
Hquery (i),
H mod el (i)
249
V. E XPERIMENTS
We apply the proposed methods to two related recognition
tasks: depth-based human action, and one-shot-learning gesture recognition. The purpose of the first experiment which
is conducted on a depth-only dataset (i.e. MSRAction3D)
is to verify the capability of the 3DS-HONV descriptor
ICCAIS 2013
TABLE I
C OMPARISON OF OUR METHOD WITH PREVIOUS APPROACHES ON
MSR-ACITON 3D DATASET
Method
3DS-HONV (Our method)
Omar et al. [12] (HON4D)
Omar et al. [12] ( HON4D + Ddisc )
Jiang et al. [9]
Yang et al. [20]
Klaser et al. [13]
Accuracy %
88.89
85.85
88.89
88.2
85.5
81.43
TABLE II
Fig. 4.
Classification accuracies on MSRAction3D by applying 3DSHONV descriptor over different levels of the spatio-temporal pyramid.
HighArmWave
HorizontalArmWave
100
hammer
6.67
HandCatch
HighThrow
RGB Desc.
HOG
HOF
HOG-HOF
100
6.67
20 33.33
40
66.67
ForwardPunch
TeLev%
34.52
41.44
33.14
Depth Desc.
3DS-HONV
TeLev
28.91
RGB-D Desc.
HOF2.5D
HOG-HOF2.5D
3DS-HONV/HOG-HOF2.5D
TeLev
33.52
30.27
24.89
100
85.71
14.29
DrawX
21.43
DrawTick
78.57
100
DrawCircle
13.33 6.67
HandClap
6.67
80
93.33
TwoHandWave
100
SideBoxing
6.67
Bend
6.67
93.33
93.33
ForwardKick
100
SideKick
6.67
Jogging
TennisSwing
86.67
6.67
100
6.67
TennisServe
93.33
100
GolfSwing
PickUpThrow
100
100
Fig. 5. The confusion matrix for MSRAction3D dataset by applying 3DSHONV descriptor
250
ICCAIS 2013
TABLE III
C OMPARISON WITH SEVERAL BEST PUBLISHED STUDIES OVER 20
DEVELOPMENT DATA BATCHES , USING
VI. C ONCLUSIONS
L EVENSHTEIN D ISTANCE .
Method
3DS-HONV/HOG-HOF2.5D (Sparse coding) (Ours)
Baseline 1 (Template matching) [21]
Baseline 2 (PCA-based: Principal motion) [22]
Manifold LSR [23]
MHI [24]
Extened-MHI [24]
TeLev %
24.89
62.31
43.63
28.73
30.01
26.00
251