Вы находитесь на странице: 1из 29

Journal Pre-proof

Skeleton-based Chinese sign language recognition and generation for


bidirectional communication between deaf and hearing people

Qinkun Xiao, Minying Qin, Yuting Yin

PII: S0893-6080(20)30040-X
DOI: https://doi.org/10.1016/j.neunet.2020.01.030
Reference: NN 4393

To appear in: Neural Networks

Received date : 31 August 2019


Revised date : 10 December 2019
Accepted date : 27 January 2020

Please cite this article as: Q. Xiao, M. Qin and Y. Yin, Skeleton-based Chinese sign language
recognition and generation for bidirectional communication between deaf and hearing people.
Neural Networks (2020), doi: https://doi.org/10.1016/j.neunet.2020.01.030.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the
addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive
version of record. This version will undergo additional copyediting, typesetting and review before it
is published in its final form, but we are providing this version to give early visibility of the article.
Please note that, during the production process, errors may be discovered which could affect the
content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier Ltd. All rights reserved.


Title Page (With all author details listed)

Journal Pre-proof

Skeleton-based Chinese Sign Language Recognition and


Generation for Bidirectional Communication between Deaf
and Hearing People

of
Qinkun Xiao*, Minying Qin* and Yuting Yin
Department of Electronics and Information Engineering,
Xi’an Technological University

pro
Xi’an City, China, 710032
*these authors contributed equally to this work

re-
lP
rna
Jou
Manuscript Click here to view linked References

Journal Pre-proof

Skeleton-based Chinese Sign Language Recognition and Generation


for Bidirectional Communication between Deaf and Hearing People
1 Abstract—Chinese sign language (CSL) is one of the most widely used sign language systems in
2 the world. As such, the automatic recognition and generation of CSL is a key technology enabling
3 bidirectional communication between deaf and hearing people. Most previous studies have focused

of
4 solely on sign language recognition (SLR), which only addresses communication in a single
5 direction. As such, there is a need for sign language generation (SLG) to enable communication in
6 the other direction (i.e., from hearing people to deaf people). To achieve a smoother exchange of

pro
7 ideas between these two groups, we propose a skeleton-based CSL recognition and generation
8 framework based on a recurrent neural network (RNN), to support bidirectional CSL
9 communication. This process can also be extended to other sequence-to-sequence information
10 interactions. The core of the proposed framework is a two-level probability generative model.
11 Compared with previous techniques, this approach offers a more flexible approximate posterior
12 distribution, which can produce skeletal sequences of varying styles that are recognizable to humans.
re-
13 In addition, the proposed generation method compensated for a lack of training data. A series of
14 experiments in bidirectional communication were conducted on the large 500 CSL dataset. The
15 proposed algorithm achieved high recognition accuracy for both real and synthetic data, with a
16 reduced runtime. Furthermore, the generated data improved the performance of the discriminator.
lP
17 These results suggest the proposed bidirectional communication framework and generation
18 algorithm to be an effective new approach to CSL recognition.
19
20 Index Terms— CSL, recognition, generation, RNN, bidirectional communication, probability model.
rna

21
22 1. Introduction
23 Vision-based sign language recognition (SLR) is an active area of research within the field of
24 artificial intelligence [1-9]. SLR research has advanced in recent years [10-13] but, excluding recent
25 efforts using RNN-based generation sequences [14, 15], automatic sign language production has not
26 been studied as extensively. The expression and comprehension of signs are complementary abilities
Jou

27 and machine learning-based Chinese sign language (CSL) recognition and generation must be
28 capable of accurately performing both processes.
29 In this study, an RNN-based framework is proposed for generating (expressing) and
30 recognizing (comprehending) signed skeletal sequences for bidirectional communication between
31 deaf and hearing people, as demonstrated in Fig. 1. Skeletal data require less memory, making it
32 easier to process and store large gesture databases. This is especially beneficial in the generation
33 algorithm, where the computational cost is relatively low, allowing for simple implementation.

1
Journal Pre-proof

Hearing
Skeleton-based Deaf
People
sign language
Kinect
ŝ k

Hearing
Data redirection and Deaf
People Generated sign
simulation animation

of
skeletons
s k̂

34

pro
35 Fig. 1. An illustration of bidirectional communication between deaf and hearing people using skeleton-based sign
36 language recognition and sign language generation technology. The first row depicts communication from a deaf
37 person to a hearing person using SLR technology (𝒔 denotes predicted labels and k denotes the skeletal sign
38 sequence). The second row represents communication from a hearing person to a deaf person using sign language
39 generation (SLG) technology (s denotes a sign label sequence and 𝒌 denotes the predicted sequence).
40
41
42
43
re-
The primary contributions of this study are as follows:
(1) We propose a uniform framework for bidirectional communication between deaf and
hearing persons based on skeletal sequence information. This framework could also be used for
44 sequence-to-sequence information interactions such as automated chatting (a chatbot), music
45 recognition and generation, as well as video recognition and generation.
lP
46 (2) We propose a sequence generation model using two-level probability generative
47 construction. This framework is a combination of a variational auto-coder (VAE) [48] and a
48 Gaussian mixture model (GMM). The GMM was used for modeling of random handwriting
49 sequences [14, 15]. Inspired by previous studies, we used a two-level probability model to construct
rna

50 a CSL skeleton sequence generative model. The first level produces random gesture encoding
51 sequences and the second level develops random skeletal gestures based on this encoding sequence.
52 The motivation for this approach was derived from a study by Liu et al. [16], in which a VAE-based
53 method was proposed to model flexible approximate posterior distributions. It is well known that
54 many distributions in nature are not Gaussian distributions. Applying a Gaussian mixed distribution
55 instead of a single Gaussian distribution is more appropriate for the description of real data in nature.
Jou

56 Hence, unlike in the original VAE method, the posterior distributions of latent variables in [16] were
57 established using a Gaussian mixture model instead of a single Gaussian function. Unfortunately,
58 Liu et al. [16] mainly discussed image data generation and did not consider sequence generation. In
59 this paper, we mainly discuss the skeleton-based sequence generation. Inspired by [16], we proposed
60 the two-level probability generation model, which was extended from [16] for skeleton-based
61 sequence generation. The first-level probability model utilized VAE-based generation technology,
62 whereas the second-level probability model made use of the GMM-based sampling method. The
63 corresponding results suggested that our proposed two-level probability model can generate more
64 diverse skeletal gestures than the traditional VAE-based and GAN-based techniques. In the included

2
Journal Pre-proof

65 validation study, our proposed RNN-based model produced recognizable CSL skeleton sequences
66 automatically.
67 (3) This proposed generation method compensated for a lack of training data in the form of
68 sequence signals by generating large quantities of new CSL skeletal data, which further improved
69 recognition accuracy. Collecting labeled sign language samples is time-consuming and existing
70 training sets are limited. As such, the proposed technique provides additional sequence signals for

of
71 enhanced training of the neural network.
72 In summary, a uniform framework for bidirectional communication was proposed, the core of
73 which is a two-level probability generative model. This approach combined VAE with GMM to

pro
74 produce increasingly diverse data styles. The proposed generation model compensated for a lack of
75 training data, which further improved recognition accuracy. This framework could also be used for
76 sequence-to-sequence information interactions in multiple fields, such as chatbot support and music
77 identification.
78 The remainder of this paper is organized as follows: Section 2 introduces the proposed model,
79 Section 3 provides analysis on the test results, Section 4 discusses these results, and Section 5
re-
80 provides our conclusions.
81
82 2. Related Work
83 2.1. Sign Language Recognition
lP
84 Conventional SLR models can be divided into two groups, based on either hidden Markov
85 model (HMM) or neural network (NN) classification techniques. HMM-based algorithms typically
86 require fewer training data but the resulting recognition accuracy is often lower. As a result, the
87 calculation cost is frequently unsuitable for real-time SLR [17, 18].
rna

88 In recent years, NN-based SLR methods have become more common, in part because they
89 offer a higher recognition accuracy. For example, Huang et al. used a convolutional neural network
90 (CNN) for SLR [5]. Liu et al. developed a long short-term memory (LSTM) model [19] to recognize
91 sign language. While NN-based techniques have proven to be effective for sequential information
92 processing, the associated discriminators often require prohibitively large training sets.
93 This study develops a skeleton-based CSL bidirectional communication framework and
Jou

94 focuses on SLR technologies based on RNNs. In recent years, SLR has been achieved using the
95 Kinect sensor, which can accurately capture color, depth information, and joint locations. For
96 example, Sun et al. [20] used joint locations to develop an SLR discriminator. Wang et al. [21]
97 proposed an SLR approach combining histograms of oriented gradients (HOG) with skeletal
98 features, producing a recognition accuracy of 84.2%. Sun et al. [22] later used a latent support vector
99 machine for SLR, achieving an accuracy of 86.0% across 73 ASL classes. RNNs are promising
100 tools for capturing the sequential dynamics of sign language. Donahue et al. [23] showed that
101 LSTMs (an extension of RNNs) provide significant improvements when supplied with sufficient
102 training data. Du et al. [24] proposed an end-to-end hierarchical RNN for skeleton-based action

3
Journal Pre-proof

103 recognition.
104 In this paper, an end-to-end SLR system (inspired by LSTM-based SLR methods) is proposed,
105 based on skeleton joint trajectories with bidirectional LSTM (Bi-LSTM). Only skeletal joints
106 provided by the Kinect were used (without color or depth information).
107 SLR research typically relies on existing sign language databases for labeled training data [25].
108 Common sources include the MSRC-12 Kinect gesture dataset [26], 73 ASL marked datasets [20],

of
109 12 American sign language datasets [27], 10 gesture datasets [28], and 24 static ASL tag word
110 datasets [29]. However, the limited size of these resources means they are often insufficient to meet
111 the practical requirements of large-scale SLR. For example, the well-known Challern database

pro
112 contains only 20 gestures [30]. As such, the automated generation of training data would be a
113 significant development in network training and could improve recognition accuracy for a variety
114 of techniques.
115 2.2. Sequence Generation with RNN
116 RNNs have been used to generate sequence information in a variety of fields including music
117 [31, 32], text [33], and motion capture [34]. RNN models can be trained to generate sequences using
re-
118 sequential training data, which represent a random sampling of a probability density function (PDF)
119 [35]. Predicted data can then be acquired from the learned distribution. However, generated data
120 inevitably differ from real data because the predictive distribution sampling in the RNN exhibits
121 strong randomness and complex reconstructions [36, 37]. As a type of RNN, long short-term
lP
122 memory (LSTM) has recently produced unprecedented results in various sequence processing tasks,
123 including speech and handwriting recognition [38, 39]. In this study, we demonstrate that LSTM
124 memory can be used to generate more complex and realistic gesture sequences.
125 Additional generation models have been proposed in recent years, including NADE [40],
rna

126 variational DRAW [41], automatic encoders [42], and generative adversarial networks (GANs) [43].
127 The generation model was developed by applying a minimum-maximum optimization framework
128 to train the actual and generated data. Based on this system, other researchers have proposed
129 extended models for producing high-quality images, such as LAPGAN [44] and DCGAN [45].
130 VAEs are a scalable generation tool that produce data using an encoding-decoding framework,
131 making them particularly suitable for large-scale datasets [46]. Although these techniques have been
Jou

132 successfully applied in multiple fields, they exhibit certain limitations, such as the selection of
133 posterior approximations [46]. These distributions are typically modeled using known probability
134 distribution functions. However, this approach has inherent limitations that prevent its use for
135 complex systems, resulting in a series of studies that have investigated robust posterior distribution
136 (PD) approximations. For example, Tran et al. used a variational Gaussian process (VGP) to
137 approximate PDs [47]. This approach adjusts the shape of the density function according to assigned
138 parameters, in order to match a complex PD. Similarly, ladder variational auto-encoders (LVAEs)
139 use a set of auxiliary latent variables to increase the flexibility of a distribution [48]. The importance
140 weighted automatic encoder (IWAE) [49] and the Hamilton variational inference (HVI) [50] use a

4
Journal Pre-proof

141 combination of variational reasoning and a Monte Carlo algorithm. HVI combines one or more
142 Markov chain Monte Carlo (MCMC) steps in a variational approximation, which can automatically
143 adapt to a PD. However, HVI is computationally expensive and may be difficult to apply in practical
144 tasks. Rezende and Mohammed proposed normalized traffic (NT), a new technique used to specify
145 any complex scalable approximate PD [51]. Liu et al. used a Gaussian mixture model (GMM) and
146 householder transforms to acquire a more flexible PD [16].

of
147 Inspired by these models, we propose a PD approximation method that combines VAE and
148 GMM sampling to generate sign language skeleton sequence data. The proposed method utilizes
149 the strong one-to-one correspondence of VAEs and the flexible posteriori approximation of GMMs.

pro
150 The resulting generated skeleton sequence is new and thus recognizable.
151 In summary, many effective SLR methods have been discussed; however, bidirectional
152 communication, i.e, combination of SLR and SLG, has not been investigated. In addition, many
153 generation methods have been proposed for image data generation, but only a few studies have been
154 conducted on RNN-based sequence generation. Moreover, these techniques have generally been
155 applied to specific problems, such as handwriting generation or sketch drawing generation, and
re-
156 therefore they are not universal and cannot be directly applied to SLG. Therefore, on the one hand,
157 this study focused on the natural interaction between deaf and hearing people through a unified
158 framework for solving bidirectional communication. On the other hand, previous RNN-based
159 sequence generation technology cannot be directly used for SLG; thus, we proposed a novel two-
lP
160 level probability model to solve skeleton-based SLG.
161 3. Material and methods
162 3.1 Material
163 A standard Kinect RGB-D database (the 500 CSL data set) was used to evaluate the proposed
rna

164 model [5]. It is the largest CSL repository currently available for recognition and analysis, consisting
165 of 500 different isolated signs such as “head”, “body”, “lady”, etc. (totaling 125,000 instances).
166 The associated CSL skeletal sequence is shown in Fig. 2. A sample input skeleton is provided
167 in Fig. 2(a) (constructed from 25 joint-based skeletons to 14 joint-based skeletons). Fig. 2(b) shows
168 examples of these sequences, which are composed of 25 joints and lines. Different color lines were
169 used to present the links between joints. While the original skeleton included 25 joints, only 14 were
Jou

170 considered in this study to reduce runtime and improve algorithm efficiency. This approach is viable
171 because CSL movements are primarily localized to upper limbs and gestures rarely involve other
172 parts of the body. The data size can also be reduced by assuming a joint size is B bytes. If an isolated
173 sign is composed of 50 frames, the total data size is 50×25B=1250 B bytes. By using 14 joints
174 instead of 25, the sign size is reduced to 50×14B=700 B byte (a reduction of ~40%). As such, the
175 presented experimental results were calculated using data from only 14 joints.

5
Journal Pre-proof

of
176

pro
177 Fig. 2. An illustration of CSL skeletal sequences in the 500 CSL data set. (a) A sample input skeleton from 25 joints
178 to 14. (b) The three isolated CSL signs (“Happening”, “Future”, and “Situation”) are shown and two samples of each
179 isolated sign are provided.
180
181 3.2 Methods
182 3.2.1 Framework Overview

Discriminator
re- Generator

s=(sk-1,sk,sk+1)

h2t-1 h2t h2t+1


Output layer Softmax h
lP
mh Σh
N(0,I)
1st Level sample
h2t-1 h2t h2t+1 probability model

Encoder Encoder Encoder hg

Hidden layer
h1t-1 h1t h1t+1 htg1 htg htg1
Encoder Encoder Encoder
Decoder Decoder Decoder
g g
d d d tg1
rna

t 1 t

2nd Level GMM GMM GMM


kt+1 probability model
kt-1 kt
Input layer
kˆtg1 kˆtg kˆtg1

Real skeleton sequence Generated skeleton sequence


183
184 Fig. 3. The proposed RNN-based CSL recognition and generation model architecture. The left column is the
185 discriminator and the right column is the generator.
Jou

186
187 The proposed RNN-based skeleton recognition and generation architecture is shown in Fig. 3.
188 This system included a discriminator and a generator. The discriminator was inspired by recent
189 achievements using bidirectional LSTM [6] and was used to represent CSL context information.
190 The goal of skeleton generation is communication from hearing people to deaf people. RNN-based
191 sequence generation was then developed, as described in Section 1 [14,15], for handwriting
192 recognition. These results suggest that an RNN generation frameworks is suitable for sequence data
193 generation. As such, in this paper, we develop an RNN-based CSL skeleton generation model.
194 The RNN in the discriminator includes input, hidden, and output layers, as shown in the left
6
Journal Pre-proof

195 column of Fig. 3. In the following description, k=(k1,…,kT) represents the skeleton sequence data,
196 h=(h1,…,hT) is the hidden variable sequence data, and s=(s1,…,sT) is the CSL label sequence data.
197 In the discriminator, the skeleton sequence data k was fed into hidden layers to calculate h and
198 output a predicted label sequence 𝒔.
199 In the generator, the real label s was used to predict a generative data distribution 𝑃(𝒌 |𝒔). The
𝑔 𝑔 𝑔
200 term 𝒌 = (𝑘1 , ⋯ , 𝑘𝑛 ) represented the final generated skeleton sequence. We proposed a two-level

of
201 probability model to generate skeleton sequences 𝒌 . The distribution P(h|s) was estimated in the
202 first-level probability model, given a label sequence s. Secondly, random sampling from P(h|s)
203 produced new data hg=(h1g,…, hng), which satisfied hg~N(mh,Σh). The hg terms were then decoded

pro
204 into a first-level generative sequence 𝒅 = 𝑑 , ⋯ , 𝑑 . In the second-level probability model,
205 𝑃 (𝑑 ) was assumed to be a Gaussian mixture model (GMM) distribution. Random sampling
206 from 𝑃 (𝑑 ) produced a second-level generative skeleton sequence 𝒌𝒈 = (𝑘 , ⋯ , 𝑘 ), where
207 𝑘 ~𝑃 (𝑑 ) (i=1, …, T), and the 𝒌𝒈 are recognizable to humans.
208 3.2.2 Preprocessing of Skeleton Data
209
210
re-
Skeleton data were first pre-processed to simplify further RNN-based system construction and
parameter learning. First, all skeleton sequences k were transferred into T frames using a fuzzy C-
211 means (FCM) clustering method. In this process, we sequentially selected T key frames from the
212 original skeleton data. The CSL dataset (DCSL) contained a total of v sign classes, in which each
lP
213 class included q skeleton sequences k. The length of each sequence was T, such that k=(k1, …, kT),
214 where 𝑘 = (𝑗 , 𝑗 ) . The parameter M indicates the joint number of skeleton ki, where jxi and jyi
215 are the x- and y-coordinates of the ith joint, respectively. A k-means method was used to cluster
216 skeletons in each class of DCSL, producing a total of F skeleton groups. The ith group is denoted as
217 Gi (i=1, …, F). With this convention, a sign skeleton sequence k can be described as:
rna

k  (k1, , kT ), where k j  G i , i  {1, , F }



218  , (1)
k j  ( jxi , jyi )iM1

219 where each skeleton sequence k corresponds to a CSL label s.
220 3.2.3 Discriminator
221 The skeleton-based CSL discriminator can be represented by probability inference: 𝑃(𝒔|𝒌) =
Jou

222 𝑃(𝒉|𝒌) ⋅ 𝑃(𝒔|𝒉), where 𝑃(𝒉|𝒌) is a calculation from the input layer to the hidden layer (an
223 encoding calculation) and 𝑃(𝒔|𝒉) corresponds to a calculation from the hidden layer to the output
224 layer (a regression calculation). In this study, Bi-LSTM (a type of RNN) was used for CSL
225 recognition. The sequence k was fed into the Bi-LSTM in one direction, from the input layer to the
226 hidden layer [8]. The resulting calculation is given by:

7
Journal Pre-proof

it   (Wi  [kt ,ht 1 ]  bi )



 f t   (W f  [kt ,ht 1 ]  b f )
c  tanh(W  [k ,h ]  b )
227  t c t t 1 c
(2)
ct  f t  ct 1  it  ct

ot   (Wo  [kt ,ht 1 ]  bo )

ht  ot  tanh(ct )

of
228 where σ(.) is a sigmoid function, c is the cell gate, i is the input gate, o is the output gate, and f is
229 the forget gate. This study used a bidirectional RNN framework in which the input was a skeleton
230 sequence k and the output was an encoded vector h. The sequence k was fed in both a forward
(denoted as k→) and backward direction (denoted as k←) to the bidirectional RNN, to produce the

pro
231
232 hidden states h = [h→; h←].
233 In this paper, for convenience, we use 𝑓 (𝑥; 𝜃) to represent the RNN functions, where 𝜃
234 represents the system parameters, and x is the input data. In the following, we use 𝑓 (. ) to
235 represent the RNN-based encoding function, 𝑓 (. ) to represent the RNN-based softmax function,
236 𝑓 (. ) to represent the RNN-based decoding function, and 𝑓 (. ) to represent the RNN-based
237
238
distribution sampling function.
re-
Based on the above definitions, we can fully describe the process of discriminator as follows.
239 Let hi be the ith hidden layer state. Using the neural network to recognize the CSL skeleton sequence,
240 we expressed the two hidden layers in Bi-LSTM from the input layer to the hidden layer as follows:
h1  f rnn
e1
(k; e1 )
lP
241  2 , (3)
h  f rnn (h1; e 2 )
e 2

242 where h1 and h2 are the first and second hidden layers of data encoded by Bi-LSTM, respectively.
243 Next, we represented the hidden layer to the output layer as follows:
244 sˆ  f rnn
s
(h2 ; s ) , (4)
rna

245 where 𝒔 is the predicted CSL label.


246 3.2.4. Generator
247 A. First Level
248 The first-level probability model can be expressed as P(dg|s) and the corresponding generative
249 calculations can be described as follows:
Jou

250 (1) Given a sign label s, the recognition model can be used to identify hidden variables h
251 corresponding to the label s. In other words, there is a probability relationship between s and h.
252 Assuming P(h|s) is a multivariate Gaussian distribution:
1
253 P(h | s)  N (h | mh , h )  exp{(h mh )T h 1 (h mh )} , (5)
( 2)n |  |1
254 where mh and Σh are the mean and covariance of h, respectively. By using a function to represent
255 this calculation, the sign label s and the output distribution parameters were input as follows:
256 [mh , h ]  frnnp1 (s; p1 ) (6)
257 (2) Random sampling of the distribution P(h|s) was used to obtain hg, such that hg~ P(h|s).

8
Journal Pre-proof

258 Further, an exponential operation was used to convert mh and Σh into standard deviation parameters.
259 A standard Gaussian distribution N(0,I) was then used to construct a random vector hg:
260 hg  mh h  N (0, I ) (7)

261 where ⊙ denotes a product operation. The parameters mh and Σh, were then input to the function
262 representing this distribution, producing an output hg as follows:
hg  frnnp2 ([mh , h ]; p2 ) .

of
263 (8)
264 (3) The vector hg was decoded to obtain first-level skeletal sequence generative data. An RNN
265 function was used to fit this calculation for hg and the output dg as follows:
d g  frnn (hg ; d ) .

pro
266 d
(9)
267 B. Second level
268 Decoding data 𝒅 = (𝑑 , ⋯ , 𝑑 ) were acquired from a first-level probability model, which
269 ensured the correct order of a CSL skeleton sequence. VAE-based coding and decoding were used
270 to ensure the decoding and encoding data had the same sequence order (i.e., the decoding sequence
271 𝒅 , the encoded sequence 𝒉 , and the original skeleton sequence k followed the same pattern). A
272
273 𝑃
re-
skeleton gesture group G was then identified for each 𝑑 , and the GMM distribution
(𝑑 ) (𝑑 ∈ 𝐺) was calculated. Finally, random sampling of 𝑃 (𝑑 ) was used to produce
274 𝒌 .
275 According to above steps, we could ensure that the final generative skeleton sequence order
lP
276 was right. Moreover, random sampling could ensure the generation of a variety of skeleton gestures.
277 Based on the above analysis, we found the optimal gesture group G∗ for each 𝑑 as follows:
278 𝐺 ∗ = 𝑚𝑖𝑛 ,⋯,𝐺𝐹 ||𝑑 − 𝑐𝑒𝑛𝑡𝑒𝑟(𝐺 )|| (10)
279 where 𝑐𝑒𝑛𝑡𝑒𝑟(𝐺 ) is the clustering center of the jth gesture group Gj and ||.|| denotes the Euler
rna

280 distance between vectors. The probability distribution of 𝑑 in G* was assumed to be a GMM. As
281 shown in Fig. 4, each component of this model corresponded to skeletal joints: Joint = (𝑗 , 𝑗 ),
282 assuming the PDF of each joint is a bivariate Gaussian distribution: 𝑃(Joint ) =
283 𝑁 Joint 𝜇 , 𝜇 , 𝜎 , 𝜎 , 𝑟 . The corresponding PDF 𝑃 (𝑑 ) was written as follows:

284 𝑃 𝑑 =∑ 𝜋 ∙ 𝑁 Joint 𝜇 , 𝜇 , 𝜎 , 𝜎 , 𝑟 , (𝑑 ∈ 𝐺 ∗ ), (11)


Jou

285 where M is the number of components in the GMM. The terms 𝜇 and 𝜇 denote the mean, 𝜎

286 and 𝜎 represent the standard deviation, and 𝜋 is the component weight [14]. The second-level
287 probability model is shown in Fig. 4.

9
Journal Pre-proof

of
288

pro
289 Fig. 4. An illustration of the GMM model representation for 𝑃 (𝑑 ). The PDF of each joint P(Jointi) is a bivariate
290 Gaussian distribution and the PDF of all joint combinations P(dig) is a GMM.
291
292 The 𝒅 = (𝑑 , ⋯ , 𝑑 ) terms were input to a function used to represent the GMM and
293 parameters were output as follows:
294 Θ =𝑓 (𝒅 ; 𝜃 ), (12)

295

296
where Θ

𝜎 ,𝜎 ,𝑟 ]
= {𝜃 ,⋯,𝜃
re-
} represents GMM parameters and 𝜃

represent parameters for each frame 𝑑 .


= [𝜋 , 𝜇 , 𝜇 ,

297 Next, random sampling from 𝑃 (𝑑 ) was used to acquire 𝑘 and 𝑘 ~𝑃 (𝑑 ). Using
298 a function to represent the generative calculation, we input Θ and output the final generative
lP
299 skeleton sequence 𝒌 = (𝑘 , ⋯ , 𝑘 ) as follows:
300 𝒌 =𝑓 (Θ ; 𝜃 ). (13)
301 3.2.5. System Training
rna
Jou

302
303 Fig. 5. An illustration of system training. The discriminator was first trained and then the generator was further
304 trained according to the loss function 𝑙𝑜𝑠𝑠(𝒔, 𝒔).

10
Journal Pre-proof

305
306 As shown in Fig. 5, system training can be divided into discriminator and generator training.
307 The left column in the figure shows a generator neural network model and the right column shows
308 a discriminator.
309 In our system, the generator training was based on a trained discriminator, hence, the
310 discriminator was first trained using data pairs {ki, si} (i=1,…,l). In the Bi-LSTM, ki represents

of
311 skeletal sequence data and si is the corresponding label. Once discriminator training was completed,
312 discriminator parameters were obtained. Next, as shown in Fig. 5, we linked the generator and
313 discriminator for training purposes. As shown in Fig. 5, the generated data 𝒌 were input to the

pro
314 discriminator, predicted labels 𝒔 were output, and generator parameters were adjusted using
315 𝑙𝑜𝑠𝑠(𝒔𝒊 , 𝐬 ) based on a gradient descent method. The term si is the original input label and 𝒔 is
316 the predicted label based on 𝒌 .
317 Training details are described as follows. As Fig. 5 shows, the parameters [mh,Σh] were
318 acquired from a given label s. The distribution P(h|s) was then sampled to obtain hg, which was
319
320
321
re-
decoded to produce dg. The GMM was sampled to acquire the 𝒌 and all parameters in the
generator were trained using the back propagation through time (BPTT) algorithm. In this process,
ki was input to the function 𝑓 (. ) and the predicted label 𝒔 was output as:
322 𝒔 =𝑓 (𝒌 ; 𝜃 ), (14)
323 where the parameter 𝜃 = {𝜃 , 𝜃 , 𝜃 }.
lP
324 If we used another function 𝑓 (⋅) to represent the generator, inputting the original label 𝒔
325 to the generator, the output can be described as:
326 𝒌 =𝑓 (𝒔 ; 𝜃 ), (15)
327 where 𝜃 = {𝜃 , 𝜃 , 𝜃 , 𝜃 , 𝜃 }.
rna

328 The loss function for the combined discriminator-generator system was then given by:
loss( s , sˆ )
329  loss( s , frnn
D
(kˆ g ; D )) (16)
 loss( s , f ( f ( s ; G ); D ))
D
rnn
G
rnn

330 According to Eq. (16), once the discriminator was trained, the parameter 𝜃 could be obtained. In
Jou

331 order to produce optimal generative results, the parameter 𝜃 was adjusted using the Gradient
332 descent method as follows:
loss( s , sˆ )
333 G  G   , (17)
G

334 where 𝜂 is an adjustment factor.


335
336 4. Experiments
337 4.1. Discriminator Evolution
338 4.1.1 Dataset and Baseline
339 We used the previously mentioned CSL dataset for discriminator estimation. The input data
11
Journal Pre-proof

340 included 500 groups and 12500 samples. The output labels included signed names corresponding to
341 500 signs. We selected the LSTM [19] method as a baseline. The GRU [15] was an extension of
342 LSTM [19].
343 4.1.2 Result and Discussion
344 The skeleton generation model was constructed based on a Bi-LSTM discriminator, which was
345 tested for CSL recognition using the Bi-LSTM model. The proposed technique was tested on a PC

of
346 with an Intel Core i5 CPU with 8 GB of RAM and the Microsoft Windows 10 operating system.
347 Table.1 Isolated CSL words recognition accuracy for different dataset sizes.
Categories, Num. of Samples BI-LSTM LSTM [19] (Baseline) GRU [15]

pro
50,50×250 95.68±0.65% 90.12±0.75% 91.17±0.85%
100, 100×250 90.22±1.18% 86.45±0.95% 87.78±1.02%
500, 500×250 82.55±2.23% 65.11±3.15% 68.12±2.16%

348
349 The dataset was randomly divided into 3 components: 70% training data, 10% validation data,
350 and 20% test data. Because the data was randomly divided, the testing results differed. Stable CSL
351 recognition accuracy required repeating the same experiment 10 times and calculating the average
re-
352 value and variance. Discriminator training details are shown in Appendix A. Table 1 shows the
353 comparative results using other algorithms. As shown, we tested our method, the LSTM method,
354 and the GRU method for 50 signs, 100 signs, and 500 signs, receptively. The results showed that as
355 the number of classes increased, the recognition accuracy of all methods declined. The main reason
lP
356 may have been that as the number of classes increased, there were more confusing classes in the
357 data set, resulting in a lower total recognition accuracy. In total, the recognition accuracy for 500
358 signs with our method was 82.55%. The accuracy was higher with the skeletal Bi-LSTM than with
359 LSTM [19] or the GRU method [15]. This demonstrates the applicability of the proposed framework
360 for identifying isolated CSL words.
rna

361 We also compared the discriminator against other metrics, such as size and time cost, the results
362 of which are included in Appendix B. This indicates that the system did not need much time to finish
363 recognizing the CSL tasks, which is a key performance metric in a bidirectional communication
364 system. In short, based on recognition accuracy, we conducted comparisons of different data size
365 and time costs. The results showed that our framework using the skeleton and Bi-LSTM
Jou

366 satisfactorily recognized isolated sign words for hearing people.


367
368 4.2. Generator Evolution
369 4.2.1 Dataset and Baseline
370 The CSL database was also used to train the RNN-based generative model, including 125,000
371 training samples from 500 CSL categories. Unlike discriminator training, which required dividing
372 the whole dataset into training data, testing data, and validation data, generator training used the
373 whole dataset to estimate system parameters based on learned parameters. As a result, the input to
374 the generator is a complete skeleton dataset for each sign and the output is generator parameters.

12
Journal Pre-proof

375 All skeleton gestures represented frequently used signs from a variety of individuals. The proposed
376 RNN model was capable of drawing 500 signs for different skeleton sequences, with each sequence
377 corresponding to an isolated CSL word in the dataset. A sign word label vector s was then input at
378 each step of the generation process and a series of tests were conducted to validate model
379 performance.
380 The VAE [42] and GAN [43] models were used as baseline methods for performance

of
381 evaluation. Additional extended algorithms were also included, such as DRAW [41], IWAE [48],
382 VGP [47], and VAE+HF [16]. These techniques are extensions of the VAE [42]. The LAPGAN [44],
383 and DCGAN [45] methods, which are extended GANs.

pro
384 4.2.2. Result and Discussion
385 A. Generation Details
386 Based on the proposed two-level probability model, we randomly generative gestures
387 according to gesture categories. We regularized the length of all training sequences to 50 frames to
388 enable the proposed framework to decide when and how to complete the drawing process. All auto-
389 generated CSL skeletons are readable by humans. Moreover, different gesture habits were reflected
re-
390 in the drawing process based on the GMM modeling of each gesture.
391 Fig. 6 shows various skeleton frames that were automatically generated according to different
392 gestures using a two-level probability model. Fig. 6 indicates that the proposed method generated
393 different skeleton gesture data, which were unique and recognizable by humans.
lP
394 Examples of generated CSL skeleton sequences using the proposed two-level probability
395 model were shown in Figs. 7-10. Multiple gesture styles are evident in each generated skeleton
396 sequence, such as cursive, regular, and fluent. These results demonstrate the model's ability to draw
397 CSL skeleton gestures automatically with a diversity of generated models for different gesture styles.
rna

398 However, the generated skeletons were not perfect, as there was some logical confusion in the
399 generated skeletons (causing semantic ambiguity in the signs). For example, gestures in the 1st and
400 4th rows of Fig. 6 are slightly different and the generated data from the two gestures are difficult to
401 distinguish. The same issue is seen in the 3rd and 5th rows. It is worth noting that sequence
402 recognition is based on the combination of multiple frames. As such, we had to generate a method
403 to estimate the quality of the generated data quantitatively; this topic is discussed later.
Jou

13
Journal Pre-proof

Real skeleton Generated skeletons

of
pro
404
405
re-
Fig. 6. Examples of real and generated skeleton gesture frames. In each raw image, the first picture shows the real
406 skeleton data and the other pictures show the generated skeleton data. To display the skeleton structure clearly, each
407 color (randomly selected) denotes one joint linked to another joint.
lP
408
rna

(a)
Jou

Time

(b) (c)
409
410 Fig. 7. Illustrations of real and generated skeleton sequences for the word “situation”. These data are displayed as
411 (a) a video, (b) a real skeleton sequence, and (c) generated skeleton sequences (includes 4 sequences).
412
413

14
Journal Pre-proof

(a)

of
pro
Time

(b) (c)
414
415 Fig. 8. Illustrations of real and generated skeleton sequences for the word “condition”. These data are displayed as
416 (a) a video, (b) a real skeleton sequence, and (c) generated skeleton sequences (includes 4 sequences).
417
418
re-
lP

(a)
rna

Time

(b) (c)
Jou

419
420 Fig. 9. Illustrations of real and generated skeleton sequences for the word “future”. These data are displayed as (a) a
421 video, (b) a real skeleton sequence, and (c) generated skeleton sequences (includes 4 sequences).
422
423

15
Journal Pre-proof

(a)

of
pro
Time

(b) (c)
424
425 Fig. 10. Illustrations of real and generated skeleton sequences for the word “circumstances”. These data are displayed
426 as (a) a video, (b) a real skeleton sequence, and (c) generated skeleton sequences (includes 4 sequences).
427
428
re-
Fig. 11 shows sample training details for the skeleton data generation using our proposed two-
429 level probability model. The skeleton joint data were used for training and they included various
430 CSL gestures. Skeleton gestures were optimized using 50 epochs. As shown in Fig. 11, at the
431 beginning, the generated skeleton data was in disorder and did not have any meaning. However, as
lP
432 the iterations increased, the parameters of the generator 𝜃 = {𝜃 , 𝜃 , 𝜃 , 𝜃 , 𝜃 } were
433 constantly optimized, and the generated data became more recognizable.
rna
Jou

434
435 Fig. 11 An illustration of training details for the skeleton data generator. From the 1st epoch to the 50th epoch, the
436 automatically drawn skeleton data (isolated CSL sign: “situation”) went from being disordered to recognizable.
437 Fig. 12 shows comparative results for the first-level probability model and the second-level
16
Journal Pre-proof

438 probability model. As indicated by the results, when the first-level generated skeleton dg was close
439 to the original skeleton k, there was less change between dg and k. For natural bidirectional
440 communication, the final generated skeletons must be both recognizable and include various styles.
441 The second-level generated skeleton 𝒌 was not only recognizable to humans, but also had a
442 variety of styles; thus, it fully met the requirements of natural interaction. The above experimental
443 results could be attributed to the following. In the first-level generation, the parameters 𝒎ℎ and Σℎ

of
444 were related to the original skeleton sequence k. Through the VAE-based encoding and decoding
445 process, the sequence data changed partially, but the basic style did not change. In the second-level
446 generation, we used dg to find an appropriate skeleton gesture group G*, which included different

pro
447 gestures. By establishing a GMM model in G*, and then randomly sampling from G*, the acquired
448 skeleton 𝒌 not only satisfied the various style requirements but was also recognizable.

re-
lP
rna
Jou

449
450 Fig. 12. A comparison of first-level and second-level probability generated skeletons. (a) The original skeleton
451 sequence k of the isolated CSL word “situation”, (b) a first-level probability model for the generated skeleton data
452 dg, (c) and a second-level probability model for the generated skeleton data 𝒌 .
453
454 We also compared the generator to some other performance metrics, such as size and time cost,
455 the results of which are shown in Appendix B. This indicates that the system only needed a short

17
Journal Pre-proof

456 period of time to finish a CSL skeleton generation task. This is a key performance metric for a
457 bidirectional communication system.
458 B. Recognizable Comparison
459 The quality of the generated skeletal data was further analyzed using a trained Bi-LSTM
460 discriminator to determine whether the sequences were identifiable. The proposed generation model
461 was used to produce 100 random skeletons in each of the 500 CSL categories, for a total of 50,000

of
462 test samples. These data were divided into 3 parts: 70% training data, 10% validation data, and 20%
463 test data. All data were fed into the Bi-LSTM discriminator for evaluation. Test results are shown in
464 Table 2. The generated skeleton sequences of most CSL categories were recognized by the Bi-LSTM

pro
465 discriminator. This result validates the ability of our proposed generative model to draw CSL
466 skeletons correctly.
467 The average recognition accuracy (test data) across all generated skeletal sequences was
468 79.12%, indicating that most of the skeletons were recognized correctly. However, this value is
469 lower than the 82.55% achieved with real samples. This performance loss was investigated by
470 examining some of the CSL skeletons generated in each category. Results showed that the majority
re-
471 of problematic skeletal gestures originated in semantically confusing CSL categories, in other words,
472 multiple gestures in different categories were highly similar. The RNN-based generation model
473 failed to capture these subtle details, preventing it from accurately drawing specific skeletons. As a
474 result, the recognition accuracy for generated data was lower than that of actual data. However,
lP
475 generated data were easily recognized in CSL categories not readily confused with other classes.
476 We also compared the recognition performance of generated skeleton sequences using different
477 generation methods. For a fair comparison, all generators were evaluated based on our Bi-LSTM
478 discriminator. The generated data was also divided into 3 parts, as shown in Table 2. Comparative
rna

479 results showed that data generated based on the VAE model achieved better performance than the
480 GAN model. This is likely because GANs are more suitable for processing image data, while the
481 model based on VAE may be more advantageous for processing sequential data.
482 Table 2. Recognition performance comparisons for different generators.
Method Recognition Accuracy of Generated Data
Our method 79.12±1.16%
VAE [42] (Baseline 1) 72.25±2.21%
Jou

variational DRAW [41] 75.25±2.08%


IWAE[48] 70.17±2.11%
VGP [47] 72.78±1.68%
VAE+HF [16] 78.48±0.51%
GAN [43] (Baseline 2) 66.32±1.33%
LAPGAN [44] 65.77±1.89%
DCGAN [45] 67.56±1.44%

483
484 We also compared recognition performance using different system parameters. In this paper,
485 two parameters served as key factors, including T (the frame length of skeleton sequences) and F
486 (the number of skeleton groups).
18
Journal Pre-proof

487 Parameter influence evolution was measured by randomly selecting 4 groups from 500 CSL
488 categories, with each group containing 10 isolated CSL words, the proposed generation model was
489 used to produce 100 random skeleton sequences in each of the CSL signs (total 4×10×100=4000
490 samples). The data were also divided into 3 parts using this same process. The first two groups were
491 used to test the effect of parameter T, the skeleton sequence length. The last two groups were used
492 to test the effect of parameter F, the number of skeleton gesture groups. Initial parameters were set

of
493 as T=36 and F=150 to reduce computationl runtime. As shown in Fig. 13, an increase in the skeleton
494 sequence length T produced a generative performance closer to that of the real data. For example,
495 as seen for F=150 in Fig. 13(a), the recognition accuracy (test data) of generated data (“Future”)

pro
496 increased from 87.65% to 92.52%. When T=50, real data recognition accuracy reached 93.33%. As
497 seen in Fig. 14, an increase in the number of clusters (F) produced a performance closer to that of
498 the actual data.
(a) Recognition accuracy of generated data (sign group 1) (b) Recognition accuracy of generated data (sign group 2)
1 1

0.8 0.8

0.6

0.4
re- 0.6

0.4

T=16
T=16
0.2 T=28
0.2 T=28
T=36
T=36
T=50
T=50
lP
Real data
Real data 0
0
e

nd

ce
l

t
s
n

n
ca

ul
m

os
n

io

tio

g
n

d
rm
ic
g

al
t

y
en

es
an
ou
go

Si
lit
in

un
io
ur

ag

Lo

ct

Ai
am

tu

rp
si
Fo

ea
at
n

im
t

Se

R
fic
re

Ar
Ac
So

nt

an
Fu

Pu
pe

yn
tu

Fo
or

va

ni
Tr
ap

Si

g
Ev

Ad

Si
H

499 Isolated CSL words Isolated CSL words

500 Fig. 13. Recognition performance comparisons for different parameters T.


(a) (b)
rna
Jou

501
502 Fig. 14. Recognition performance comparisons for different parameters F.
503
504 C. Data Augmentation
505 Since the generative model is capable of producing almost real skeleton sequences (with labels),
506 an attempt was made to apply it as a data augmentation strategy for the supervised training of the
507 discriminative model. As such, we randomly generated 100 skeleton sequences in each of the 500
508 classes using the generative model. Afterwards, real training samples and generated samples were

19
Journal Pre-proof

509 combined to establish a large dataset for re-training the discriminative model, which was then
510 evaluated using test samples. The combination of a Bi-LSTM and two-level probability model
511 further increased the recognition performance reported in Table 3, from 82.55% (real data) to 85.24%
512 (real data +generated data). This result validates the quality of the generative model, which is
513 capable of data augmentation to improve the performance of the discriminative model.
514

of
515 Table 3. Recognition performance comparisons for data augmentation
516 based on different compositions of discriminators and generators.
Discriminator Generator Accuracy(real) Accuracy(generated) Accuracy(real+generated)

pro
Bi-LSTM Our method 82.55 ± 2.23% 79.12 ± 1.16% 85.24 ± 1.83%
LSTM [19] Our method 65.11 ± 3.15% 61.25 ± 2.21% 67.33 ± 3.33%
GRU [15] Our method 68.21 ± 2.16% 65.48 ± 0.51% 69.21 ± 1.73%
Bi-LSTM VAE [42] (Based line 1) 72.25 ± 2.21% 70.33 ± 2.02% 72.98 ± 1.97%
LSTM [19] VAE [42] (Based line 1) 62.02 ± 2.89% 60.02 ± 1.87% 63.25 ± 3.16%
GRU [15] VAE [42] (Based line 1) 65.11 ± 1.56% 64.17 ± 1.58% 67.02 ± 2.28%
Bi-LSTM GAN [43] (Based line 2) 66.32 ± 1.33% 63.25 ± 0.98% 68.87 ± 2.57%
LSTM [19] GAN [43] (Based line 2) 61.12 ± 2.21% 60.17 ± 1.87% 62.22 ± 2.13%

517
GRU [15] GAN [43] (Based line 2)
re-
62.31 ± 3.55% 61.55 ± 1.63% 63.21 ± 2.93%

518 5. Conclusion
519 We presented a novel method for automatically recognizing and drawing CSL skeleton
520 sequences using an RNN conditional generation model trained on skeleton sequence data. We
lP
521 implemented a two-level probability model to describe skeleton sequences and a GMM for
522 describing skeleton gestures. The generation model automatically determined the sequence style
523 and length of the drawing for each gesture. We tested the proposed method on the 500 CSL data set,
524 which included 500 CSL categories and 12500 training samples. Our main findings are summarized
rna

525 below.
526 (1) A uniform framework was proposed for skeleton-based bidirectional communication. The
527 framework can also be extended for other sequence-to-sequence information interactions. Based on
528 the proposed framework, the recognition accuracy of real data in the complete data set reached
529 82.55%, the recognition accuracy of the generated data reached 79.12%. The recognition time was
530 approximately 0.003 s and the generation time was ~0.17 s. All results showed that our bidirectional
Jou

531 communication framework is feasible.


532 (2) A two-level probability generative model was proposed. Compared with previous work, the
533 proposed generation model had flexible approximate posterior distributions. We carried out a series
534 of experiments, including generator performance analysis, recognizable analysis, and training
535 details analysis on different iteration epochs, specifically first-level generation and a second-level
536 generation. The results showed that the proposed two-level probability model is effective.
537 (3) The proposed generation method compensated for a lack of training data. When we
538 combined real data and generated data, the recognition accuracy increased by 3%. These results
539 showed that the generated data can be used to improve the performance of a discriminator.
20
Journal Pre-proof

540 Future research will attempt to improve the performance of the generated skeleton sequences
541 by focusing on confusing CSL categories. In addition, this study investigated recognition and
542 generation only for isolated CSL words. The next step is to study skeleton-based continuous CSL
543 recognition and generation.
544
545 Acknowledgments

of
546 This work was supported by the Nature Science Foundation of China (Nos. 60972095,
547 61271362, and 61671362). We thank the anonymous editors for their linguistic assistance during
548 the preparation of this manuscript. Qinkun Xiao and Minying Qin both contributed equally to this

pro
549 work.
550
551 References
552 [1] Anwar Shamama, Sinha Subham Kumar, Vivek Snehanshu, Ashank Vishal. Hand gesture recognition: A survey.
553 Lecture Notes in Electrical Engineering, 2019, 511: 365-371.
554 [2] Kumar D. Anil, Sastry A.S.C.S., Kishore P.V.V., Kumar E. Kiran, Kumar M. Teja Kiran. S3DRGF: Spatial 3-D
555
556
557
2019, 26(1):169-173.
re-
relational geometric features for 3-D sign language representation and recognition. IEEE Signal Processing Letters,

[3] Avola Danilo, Bernardi Marco, Cinque Luigi, Foresti Gian Luca, Massaroni Cristiano. Exploiting Recurrent
558 Neural Networks and Leap Motion Controller for the Recognition of Sign Language and Semaphoric Hand Gestures.
559 IEEE Transactions on Multimedia, 2019, 21(1): 234-245.
lP
560 [4] Camgoz Necati Cihan, Hadfield Simon, Koller Oscar, Ney Hermann, Bowden Richard. Neural Sign Language
561 Translation. CVPR, 2018, 7784-7793.
562 [5] Huang Jie, ZhouWengang, Li Houqiang, LiWeiping. Attention based 3D-CNNs for Large-Vocabulary Sign
563 Language Recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2018.
564 [6] Huang Jie, ZhouWengang, Zhang Qilin, Li Houqiang, Li Weiping. Video-based sign language recognition
rna

565 without temporal segmentation. AAAI, 2018, 2257-2264.


566 [7] Kumar E. Kiran, Sastry A.S.C.S., Kishore P.V.V., Kumar M. Teja Kiran, Kumar D. Anil. Training CNNs for 3-D
567 Sign Language Recognition with Color Texture Coded Joint Angular Displacement Maps. IEEE Signal Processing
568 Letters, 2018, 25(5): 645-649.
569 [8] Kumar Eepuri Kiran, Kishore P.V.V., Kumar Maddala Teja Kiran, Kumar Dande Anil, Sastry A.S.C.S. Three-
570 Dimensional Sign Language Recognition with Angular Velocity Maps and Connived Feature ResNet. IEEE Signal
571 Processing Letters, 2018, 25(12):1860-1864.
Jou

572 [9] Hong Cheng, Lu Yang, and Zicheng Liu. A survey on 3d hand gesture recognition. IEEE Transactions on Circuits
573 and Systems for Video Technology, 2016, 29(9): 1659-1673.
574 [10] Li Yuan, Wang Xinggang, Liu Wenyu, Feng Bin. Deep attention network for joint hand gesture localization
575 and recognition using static RGB-D images. Information Sciences, 2018, 441: 66-78.
576 [11] Saqlain Shah Syed Muhammad, Abbas Naqvi Husnain, Khan Javed I., Ramzan Muhammad, Zulqarnain, Khan
577 Hikmat Ullah. Shape based Pakistan sign language categorization using statistical features and support vector
578 machines. IEEE Access, 2018, 6: 59242-59252.
579 [12] Koller Oscar, Zargaran Sepehr, Ney Hermann, Bowden Richard. Deep Sign: Enabling Robust Statistical
580 Continuous Sign Language Recognition via Hybrid CNN-HMMs. International Journal of Computer Vision, 2018,
581 126(12): 1311-1325.
582 [13] Kumar Pradeep, Roy Partha Pratim, Dogra Debi Prosad. Independent Bayesian classifier combination-based
21
Journal Pre-proof

583 sign language recognition using facial expression. Information Sciences, 2018, 428: 30-48.
584 [14] A. Graves, Generating sequences with recurrent neural networks, arXiv:1308.0850, 2014.
585 [15] Xuyao Zhang, Fei Yin, Yanming Zhang, Chenglin Liu, Yoshua Bengio. Drawing and Recognizing Chinese
586 Characters with Recurrent Neural Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018,
587 40(4): 849-862.
588 [16] Liu GuoJun, Liu Yang, Guo MaoZu, Li Peng, Li MingYu. Variational inference with Gaussian mixture model
589 and householder flow. Neural Networks, 2019, 109: 43-55.

of
590 [17] Sait Celebi, Ali Selman Aydin, Talha Tarik Temiz, and Tarik Arici. Gesture recognition using skeleton data
591 with weighted dynamic time warping. International Joint Conference on Computer Vision, Imaging and Computer
592 Graphics Theory and Applications. 2013, 620-625.

pro
593 [18] D. Guo,W. Zhou, H. Li, and M.Wang, Online early-late fusion based on adaptive HMM for sign language
594 recognition, ACM Trans. Multimedia Comput., Commun., 2017, 14(1):8.
595 [19] Tao Liu, Wengang Zhou, and Houqiang Li. Sign language recognition with long short-term memory. ICIP 2016,
596 2871-2875.
597 [20] C.Sun, T.Zhang,B.Bao,C.Xu, andT.Mei, “Discriminative exemplar coding for sign language recognition with
598 kinect,” IEEE Transactions on Cybernetics, 2013, 43(5): 1418–1428.
599 [21] H. Wang, X. Chai, Y. Zhou, and X. Chen, “Fast sign language recognition benefited from low rank
600
601
602
2015, 1: 1–6.
re-
approximation,” in IEEE International Conference and Workshops on Automatic Face and Gesture Recognition,

[22] Chao Sun, Tianzhu Zhang, Bing-Kun Bao, and Changsheng Xu, “Latent support vector machine for sign
603 language recognition with kinect,” ICIP, 2013, 4190–4194.
604 [23] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-
lP
605 term recurrent convolutional networks for visual recognition and description,” arXiv preprint arXiv:1411.4389, 2014.
606 [24] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,”
607 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, 1110–1118.
608 [25] Cao Dong, Ming Leu, and Zhaozheng Yin. American sign language alphabet recognition using microsoft kinect.
609 CVPR Workshops. 2015, 44-52.
rna

610 [26] Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. Instructing people for training
611 gestural interactive systems. SIGCHI Conference on Human Factors in Computing Systems. 2012, 1737-1746.
612 [27] Alexey Kurakin, Zhengyou Zhang, and Zicheng Liu. A real time system for dynamic hand gesture recognition
613 with a depth sensor. European Signal Processing Conference. 2012, 1975-1979.
614 [28] Zhou Ren, Junsong Yuan, and Zhengyou Zhang. Robust hand gesture recognition based on finger-earth mover’s
615 distance with a commodity depth camera. ACM Conference on Multimedia. 2011, 1093-1096.
616 [29] Hanjie Wang, Xiujuan Chai, Yu Zhou, and Xilin Chen. Fast sign language recognition benefited from low rank
Jou

617 approximation. IEEE Conference and Workshops on Automatic Face and Gesture Recognition. 2015, 1-6.
618 [30] Sergio Escalera, Xavier Baro, Jordi Gonzalez, et al. Chalearn looking at people challenge 2014: Dataset and
619 results. ECCV Workshop. 2014, 459-473.
620 [31] D. Eck and J. Schmidhuber. A first look at music composition using lstm recurrent neural networks. Technical
621 Report No. IDSIA-07-02, 2002:1-11.
622 [32] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional
623 sequences: Application to polyphonic music generation and transcription. ICML, 2012.
624 [33] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. ICML, 2011.
625 [34] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann machine. 2008, 1601-
626 1608.
627 [35] John F. Kolen, Stefan C. Kremer. Gradient Flow in Recurrent Nets: The Difficulty of Learning Long Term
22
Journal Pre-proof

628 Dependencies. Wiley-IEEE Press, 2001.


629 [36] G. W. Taylor and G. E. Hinton. Factored conditional restricted Boltzmann machines for modeling motion style.
630 ICML, 2009, 1025-1032.
631 [37] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 1997, 9(8): 1735-1780.
632 [38] A. Graves and J. Schmidhuber. Online handwriting recognition with multidimensional recurrent neural
633 networks. NIPS, 2008, 21.
634 [39] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. ICASSP,

of
635 2013.
636 [40] H. Larochelle and I. Murray, The neural autoregressive distribution estimator. ICAIS, 2011.
637 [41] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, DRAW: A recurrent neural network for image

pro
638 generation, ICML, 2015.
639 [42] D. Kingma and M. Welling, Auto-encoding variational bayes, arXiv:1312.6114, 2013.
640 [43] I.Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
641 Generative adversarial nets, NIPS, 2014.
642 [44] E. Denton, S. Chintala, A. Szlam, and R. Fergus, Deep generative image models using a laplacian pyramid of
643 adversarial networks, arXiv:1506.05751, 2015.
644 [45] A. Radford, L. Metz, and S. Chintala, Unsupervised representation learning with deep convolutional generative
645
646
647
re-
adversarial networks, arXiv: 1511.06434, 2015.
[46] Roberts, A., Engel, J., & Eck, D. Hierarchical variational autoencoders for music. NIPS. 2017.
[47] Tran, D., Ranganath, R., & Blei, D. M. The variational Gaussian process. ICLR, 2016.
648 [48] Sonderby, C. K., Raiko, T., Maalon, L., Sonderby, S. K., & Winther, O. Ladder variational autoencoders. NIPS,
649 2016, 29:3738–3746.
lP
650 [49] Burda, Y., Grosse, R., & Salakhutdinov, R. Importance weighted autoencoders. ICLR, 2016.
651 [50] Salimans, T., Kingma, D., & Welling, M. Markov chain Monte Carlo and variational inference: Bridging the
652 gap. ICML, 2015, 37: 1218–1226.
653 [51] Rezende, D. J., Mohamed, S., & Wierstra, D. Stochastic backpropagation and approximate inference in deep
654 generative models. ICML, 2014, 32(2): 1278–1286.
rna

655
656 Appendix A: Discriminator training details
657 Sample recognition training details are shown in Figs. 1-3. Confusion matrices are included in
658 Figs. 2-3. In this paper, we only showed training details for 50 CSL dataset, which is a sub-dataset
659 of the 500 CSL dataset that includes only 50 sign categories. Fig.1 shows a Bi-LSTM-based CSL
660 discriminator training process using skeleton data (50 categories and 50×250=12500 skeleton
Jou

661 sequence samples). The dataset was divided into 3 components: 70% training data, 10% validation
662 data, and 20% test data. Recognition accuracy for the test data was 95.68%, with an elapsed time of
663 4 min and 32 sec using 200 training epochs on a single GPU. Figs. 2 and 3 show the recognition
664 confusion matrices from 50 CSL signs for validation and test data, respectively. As seen in the
665 figures, the CSL recognition accuracy is high and the Bi-LSTM-based classifier achieved good
666 identification performance across different CSL classes.

23
Journal Pre-proof

of
pro
667
668 Fig. 1. Training accuracy and loss curves for a Bi-LSTM-based discriminator using a 50 CSL skeleton dataset.

re-
lP
rna
Jou

669
670 Fig. 2. A recognition confusion matrix for the validation data of a 50 CSL skeleton dataset. The validation data
671 comprised 10% of the total dataset (12500×10%=1250 samples). The recognition accuracy of the validation data
672 was 94.25%.

24
Journal Pre-proof

of
pro
673
674
re-
Fig. 3. Recognition confusion matrix test data for 50 CSL skeleton datasets. The test data comprised 20% of the total
675 dataset (12500×20%=2500 samples). The recognition accuracy of the test data was 95.68%.
676
Appendix B: Performance comparisons for system size and time cost
lP
677
678 Table 1. Performance comparisons for discriminator size and time cost.
Categories Size(samples data) Train time Test time (pre sample) Size(discriminator)
50 251 MB 2m42s 0.0004 s 3.62 MB
100 476 MB 4m46s 0.0004 s 6.89 MB
rna

300 1.29 GB 7m03s 0.0003 s 19.9 MB


500 2.21 MB 14m52s 0.0003 s 33.3 MB

679 As shown in Table 1, for the 500 class, the size of the original data was 2.21 GB and the size
680 of the discriminator was only 33.3 MB. A small discriminator size is important in a system for
681 bidirectional communication between hearing and deaf people. The table also shows that no matter
682 how long it took to train a discriminator, the testing time was short (only 0.003 s).
683 As shown in Table 2, for each isolated CSL word, the original data size was 4–5 MB. However,
Jou

684 the size of the generator was only ~0.035 MB. A small generator is important for a system of
685 bidirectional communication between hearing and deaf people. As shown in Table 2, regardless of
686 the time required to train the generator, the testing time was always short (~0.17 s).
687 Table 2. Performance comparisons for generator size and time cost.
Sign Size(skeleton) Train time Test time Size(generator)
Happening 4.58 MB 42.05 s 0.18 s 0.034 MB
Future 5.05 MB 48.26 s 0.17 s 0.038 MB
Situation 6.16 MB 45.13 s 0.20 s 0.037 MB
Condition 5.01 MB 55.69 s 0.21 s 0.031 MB
Dynamic 6.07 MB 51.21 s 0.23 s 0.031 MB

25
Journal Pre-proof

Form 6.55 MB 48.21 s 0.19 s 0.035 MB


Sound 5.51 MB 48.36 s 0.17 s 0.032 MB
Advantage 4.87 MB 44.12 s 0.19 s 0.041 MB
Reality 5.74 MB 41.58 s 0.18 s 0.044 MB
Actual 5.54 MB 40.11 s 0.16 s 0.041 MB

688

of
pro
re-
lP
rna
Jou

26
Conflict of Interest Statement

Journal Pre-proof

Compliance with ethical standards


Qinkun Xiao stated that he has no conflicts of interest.
Author Mingying Qin claims she has no conflicts of interest.
Author Yuting Yin claims she has no conflicts of interest.

of
pro
re-
lP
rna
Jou

Вам также может понравиться