Вы находитесь на странице: 1из 4

23rd Telecommunications forum TELFOR 2015 Serbia, Belgrade, November 24-26, 2015.

Recognizing emotions from videos by studying


facial expressions, body postures and hand gestures
Mihai Gavrilescu

1
features like motion direction, palm orientation, hand
Abstract A system for recognizing emotions from videos shape and handedness are associated with pleasure and
by studying facial expressions, hand gestures and body arousal factors specific to facial expressions. Moreover,
postures is presented. A stochastic context-free grammar Glowinski et. Al [5] used visual tracking techniques for
(SCFG) containing 8 combinations of hand gestures and body
hands from frontal and lateral views to show that the
postures for each emotion is used and we show that
increasing the number of combinations in SCFG improves quality of gestures improves the recognition of emotions.
the systems generalization for new hand gesture and body With these in mind, our paper aims building a system
posture combinations. We show that hand gestures and body for emotion recognition by studying facial expressions,
postures contribute to improving the emotion recognition alongside with body postures and hand gestures and show
rate with up to 5% for Anger, Sadness and Fear compared to the improvements such a system brings when compared to
the standard facial emotion recognition system, while for standard facial emotion recognition (FER) systems.
Happiness, Surprise and Disgust no significant improvement
was noticed. II. THEORETICAL MODEL
Keywords multimedia, video signal processing, affective Our research is based on the fact that there is a close
computing, face recognition, gesture recognition. link between the emotional dimensions found in facial
I. INTRODUCTION expressions and those determined from body postures [1]
and hand gestures [4]. Moreover, the research done in [6]
Recognizing the affective states of individuals is an
showed facial expressions, body movements and postures
important feature for humans in interpersonal interactions.
as well as hand gestures are linked at subconscious level in
On one hand, affective states modulate the way we the human mind when expressing emotions, hence
communicate and react; on the other hand, it offers clues studying them together is important if an emotion
on the emotional state of others and optimizes our recognition system aims human-like performances.
interaction with them. Hence, for a computer to be able to For facial expression recognition, we made use of the
understand and communicate with individuals, it has to be Facial Action Coding System (FACS) developed by
able to recognize and even simulate emotions. Various Eckman and Freisen [7] based on the idea that each
researches aimed building such system, but most of them emotion determines a facial muscular activity on the face
treat only one of the three components: body postures, that can be analyzed in order to determine the true
hand gestures or facial expressions. emotional state of the subject, as these muscular activities
In terms of body postures, Shibata and Kijima [1] made cannot be mimicked. FACS defines a set of 46 action units
use of pressure sensors on chairs and body accelerometers (AUs) that have the characteristic of being additive (one
for verifying the link between body gestures and postures action unit triggers another action unit and form together
and facial expression factors such as arousal and an Action Unit Cluster (AUC)) or non-additive (the AU is
pleasantness. Their research showed that body gestures independent of other AUs).
introduce another factor called dominance expressed by In what it concerns hand gestures our paper is based on
arms and legs, while still retaining the other two factors. the research done in [8], dividing gestures in three large
Similarly, Radeta and Maiocchi [2] used capture sensors to categories: deictic gestures (hand or finger pointing to an
analyze low level features of the skeletal postural joints of object), mimetic gestures (accepting / declining actions),
and iconic gestures (defining attributes like shape or size).
human body in relation to the 7 basic emotions,
We can also view gestures as a set of permutations
uncovering patterns between body postures and emotions
generated by the actions of the hand and arm and two
that can be exploited via machine learning techniques. types of systems can be defined: single gesture-based
Body postures were combined in [3] with facial recognition systems (SHGRS easy to implement, lack
expressions, researchers defining two type of action units: generalization) and multiple hand gesture-based
facial action units and body action units. The results recognition systems (MHGRS harder to implement, can
showed that facial expressions alongside with body generalize gestures). In this paper we build a SHGRS that
features offer better accuracy than standard facial emotion will also include the information related to body postures.
recognition. In our specific case, body postures will refer to the
Similar studies were conducted for hand gestures and subjects body being inclined forward or backwards, to the
movements. Kipp and Martin [4] showed that gesture left or to the right, or rotated towards left or right. These
body postures along with hand gestures are combined and
a context-free grammar (SCFG) [9] is defined in order to
Mihai Gavrilescu is with University Politehnica Bucharest, Department describe hand gesture and body posture combinations with
of Telecommunications, Romania, 1-3 Iuliu Maniu Blvd, 061071,
Bucharest 6, Romania (e-mail: mike.gavrilescu@gmail.com) the advantage that it can be extended to other gestures and
postures.

978-1-5090-0055-5/15/$31.00 2015 IEEE 720


III. PROPOSED ARCHITECTURE inclination of the body to left or right, forward or
backwards, as well as the rotation of the body towards left
A. Facial Expression Recognition (FER) system
or right. Local finger motion refers to the movement of
As previously mentioned, FACS contains 44 AUs out fingers considering the position of the hand unchanged.
of which, in this paper, we use only the first 27 AUs as Fig. 2 shows the overall architecture of the Gesture
they are more specific to the face-related expressions. We Emotion Recognition (GER) system. The first layer
implement a similar module as in our previous work [10], contains an AdaBoost block trained to determine if a
dividing the face in 5 components (each containing a set of frame contains body or hand features and flag it as
AUs that can only be additive inside the component) negative if not. Frames that pass the filter are sent in
which are: Brow (Haar-Cascade classifier [11]), Eye parallel to Haar-cascade classifiers [11] that return body,
(Haar-Cascade classifier [11]), Cheek (Viola/Jones Face hand, and finger features. These are fetched to three neural
detector classifier for cheek AU classification [12]), Lips networks (NN): Local Finger Motion Neural Network
(lip feature point tracking method [13]) and Wrinkles (LFM-NN), Global Body Motion Neural Network
(shortest distance classifier [14]). Fig. 1 shows this (GBM-NN), Global Hand Motion Neural Network
module. (GHM-NN). The output of each NN is sent to the Gesture
Detection Neural Network (GD-NN) providing the
recognized emotion.

Fig. 1. Facial Emotion Recognition (FER) System


For each frame, the head orientation is determined with a
k-nearest neighbors algorithm [15] and based on this the
Fig. 2. Gesture Emotion Recognition (GER) System
face is detected and segmented using the Viola-Jones face
detection algorithm [16]. The detected face is then As this is a pattern recognition task and a bottom-up
segmented in the 5 components using the same Viola- architecture the neural networks considered are feed-
Jones algorithm. For each segment the classifiers forward multi-perceptrons. All three neural networks have
previously mentioned will classify in an inter-subject 16 input nodes corresponding to 16 body-hand-finger
methodology the corresponding AUs for each segment and features pre-determined for our training set, as well as one
fetch them to a FER - Neural Network (NN) which takes hidden layer (for GBM-NN 34 hidden nodes, for GHM-
the final decision regarding the recognized emotions. The NN - 32 hidden nodes, for LFM-NN - 20 hidden nodes).
FERNN receives 27 features (for each 27 AUs classified) As previously mentioned, we make use of a stochastic
as input nodes, having only one hidden layer with 55 context-free grammar (SCGF) containing a set of hand
neurons and 6 output nodes representing the 6 basic gesture and body posture combinations that were pre-
emotions. determined from the training database as Haar-like feature
combinations related to specific emotions. The GBM-NN,
B. Hand Gesture and Body Posture recognition system
GHM-NN and LFM-NN are trained using
As previously mentioned, a SHGRS system was backpropagation and only when all are trained and the
implemented that will also include information related to error rate is low enough, the last GD-NN is considered for
body postures. The hand gesture recognition system has training using also backpropagation. The GD-NN has 18
three basic components: hand segmentation, hand tracking input nodes (the outputs of three neural networks), one
and gesture recognition. Also, the hand has two main hidden layer with 32 neurons and 6 output nodes
components: static hand pose and hand posture (static corresponding to the 6 basic emotions.
hand independent of the movement of the body), and hand The module functions in two stages:
gesture (dynamic movement of the hand). Moreover, hand - training stage: the Haar-like features are processed
gesture has also two components: global hand motion and and saved in the SCFG grammar as well as used to
the local finger motion. Global hand motion refers to train the LMF-NN, GBM-NN and GHM-NN and
changes in the orientation or position of the hand and it is GD-NN through backpropagation;
analyzed in relation with the global body motion as they - testing stage: Haar-like features determined from video
are mutually related. The global body motion analyses the testing sample are processed using the SCFG
changes in the upper body postures, referring to the

721
grammar by GBM-NN, GHM-NN and LMF-NN and a leave-one-out approach. The next step was to determine
the hand gesture and body posture combination and the Haar-like features for hand gestures and body postures
its corresponding emotion are determined. from the recorded frames for the same 63 subjects and use
C. Overall Architecture them to construct the SCFG. Then the GHN-NN, GBN-
NN, LFM-NN and GD-NN were trained in a similar
Fig. 3 presents the overall architecture of the system.
manner as the FER-NN was trained, so that for the
We can see that the video input signal with the recorded
recorded reaction of each subject the NNs will output the
subject is fetched in parallel to the two modules on
emotion induced by that video. After all these modules
different threads. C++ and OpenCV library were used for
were trained on 63 subjects, the entire system was tested
developing the classifier algorithms and neural networks.
on the remaining subject in an inter-subject methodology
One last Haar-cascade classifier is implemented (Emotion
with leave-one-out approach. 64 tests were conducted for
Recognition Classifier - ERC) that analyzes the output of
studying the inter-subject variability and recognition rates.
the two modules providing the final recognized emotion.
First we tested the GER system alone to verify its
accuracy. We first included only 2 combinations of hand
gestures and body postures for each emotion in the SCFG,
and then increased this to 4 and 6 to verify its degree of
generalization. Results are detailed in Table 1. We can see
that for controlled scenario (trained and tested on the same
combinations) as the number of combinations in SCFG
increase, the accuracy increases from 65.1% (SCFG with 2
combinations) to 75% (SCFG with 8 combinations). Also,
in uncontrolled scenario (trained on a set of combinations,
tested on the other set) we observed a similar increase
from 54.5% (SCFG with 2 combinations / tested on
remaining 6) to 60.4% (trained on 6 combinations / tested
on remaining 2). This indicates that as the number of hand
Fig. 3. Proposed model Overview
gesture and body posture combinations in SCFG increases,
the systems degree of generalization increases.
IV. EXPERIMENTAL RESULTS TABLE 1: EMOTION RECOGNITION RATES ON DIFFERENT NUMBER OF
In order to test the system we built our own database COMBINATIONS IN SCFG GRAMMAR.

containing recording of the facial expressions, hand Number of Emotion Recognition Emotion Recognition
combinations in Rate (%) controlled Rate (%) random
gestures and body postures of 64 subjects when watching SCFG / scenario scenario
6 emotion inducing videos for each of the 6 basic emotions emotion
(fear, anger, surprise, happiness, sadness, disgust). The 64 2 65.1% 54.5%
subjects participating in the experiment are 32 males and 4 67.5% 56.3%
6 70% 60.4%
32 females, with ages between 18 and 35. The protocol
8 75% -
used was sitting, with hands in view, only the upper body
being recorded. An example is shown in Fig 4. With the SCFG containing all the 8 combinations, we
performed a new inter-subject leave-one-out validation test
analyzing which of the 6 emotions is recognized by GER
with highest accuracy. Table 2 shows that the highest
sensitivity is recorded for Fear (86%), followed by Anger
and Happiness (77%), although the specificity in these
cases is not extremely good.
Fig. 4. Emotional reactions upper body recording TABLE 2: TRUE POSITIVE AND TRUE NEGATIVE RATES FOR THE GER
SYSTEM.
(a) Anger; (b) Sadness; (c) Happiness; (d) Disgust
Emotion True positive rate (%) True negative rate (%)
Videos used for inducing the 6 basic emotions are from Anger 77% 64%
the discrete LIRIS-ACCEDE database [17]. Because our Happiness 77% 65%
system needs longer videos with enough frames for Sadness 73% 68%
training, we combined similar 8-12 seconds videos Surprise 73% 70%
Disgust 64% 72%
pertaining to the same emotion from LIRIS-ACCEDE Fear 86% 62%
database in 1 minute inducing videos for each emotion.
All the neural networks and classifiers were trained with The lowest sensitivity is obtained for Disgust showing
frames from these recordings in an inter-subject that hand gesture and body posture combinations do not
methodology. At first the FER classifiers were trained on emulate successfully this type of emotion. The inter-
Cohn-Kanade [18] and tested on MMI database [19] in subject coefficient of variance (calculated as the ratio
order to achieve over 90% accuracy in cross-database test between the standard deviation and the mean value) is also
for each of the 27 AUs. With the classifiers trained, we high, reaching a value of 14.5%.
proceeded with training the FER-NN on recordings from Next we tested FER-GER system and compared it with
our own database so that it outputs the emotion that was the standard FER system using the same inter-subject
supposed to be induced by the video watched by the methodology and leave-one-out approach for cross
subject. FER-NN was trained on video samples from 63 validation. Results are detailed in Table 3 and show that
out of the 64 subjects in an inter-subject methodology with when using facial expressions together with body posture

722
and hand gestures it resulted in an increase in recognition [2] Radeta, M., Maiocchi, M., Towards automatic and unobtrusive
recognition of primary-process emotions in body postures, 2013
rates for Anger (up to 6%), Sadness (4%), and Fear
Humaine Association Conference on Affective Computing and
(4.5%). This shows the GER system introduces new Intelligent Interaction (ACII), pp. 695-700, Geneva, 2013.
emotional features compared to the facial expressions [3] Gunes, H., Piccardi, M., Fusing face and body gesture for machine
ones. For Happiness and Surprise differences were not recognition of emotions, IEEE International Workshop in Robot
and Human Interactive Communication, pp. 306-311, Nashville,
impressive, mainly because the combinations considered
August 2005.
in the SCFG do not bring anything new compared to facial [4] Kipp, M., Martin, J.-C., Gesture and emotion: can basic gestural
expressions. For Disgust, the recognition rate was not form features discriminate emotions?, 3rd International Conference
improved at all, expected considering the GER low on Affective Computing and Intelligent Interaction, pp. 1-8,
Amsterdam, September 2009.
recognition rates for Disgust. Moreover the inter-subject
[5] Glowinski, D., Dael, N., Camurri, A., Volpe, G., Mortillaro, M.,
coefficient of variance decreased to 6.2%. Scherer, K., Toward a Minimal Representation of Affective
TABLE 3: RECOGNITION RATES FOR STANDARD FER AND FOR FER-GER Gestures, IEEE Transactions on Affective Computing, 2 (2), pp.
SYSTEMS. 106-118, April 2011.
Emotion Recognition rate (%) Recognition rate (%) [6] Meeren, H. K. M., Heijnsbergen, C. C. R. J., Gelder, B., Rapid
standard FER FER + GER perceptual integration of facial expression and emotional body
Anger 81.6% 87.8% language, Procedings of the National Academy of Sciences of
Happiness 88.5% 89.2% USA, 102(45), pp.1651816523, November 2005.
[7] Ekman, P., Friesen, W. V., Facial Action Coding System:
Sadness 80.5% 84.4%
Investigators Guide, Consulting Psychologists Press, CA, 1978.
Surprise 88.4% 89%
[8] Shanis, J. M., Hedge, A., Comparison of mouse, touchpad and
Disgust 76.5% 76.7% multitouch input technologies, Proceedings of the Human Factors
Fear 85.3% 89.7% and Ergonomics Society, pp. 746-750, Denver, October 2003.
Compared with the state-of-the-art, the system showed [9] Chen, Q., Georganas, N. D., Petriu, E. M., "Hand gesture
recognition using Haar-like features and stochastic context-free
higher recognition accuracies than those in current grammar", IEEE Transactions on Instrumentation and
researches. Comparable results are detailed in Table 4. Measurement, 55(8), pp. 1562-1571, April 2008.
TABLE 4: COMPARISON WITH THE STATE-OF-THE-ART. [10] Gavrilescu, M., Proposed architecture of a fully Integrated
Modular Neural Network-based Facial Emotion Recognition system
Research Recognition rates [%] based on Facial Action Coding system (FACS), 10th International
Koelstra and Pantic - 2010 [20] 70.25% Conference on Communication, pp. 1-6, Bucharest, May 2014.
Valstar and Pantic - 2012 [12] 72% [11] Wilson, P. I. ,Fernandez, J., Facial feature detection using Haar
Gavrilescu - 2014 [10] 83.5% classifiers, Journal of Computing Sciences in Colleges, 21(4), pp.
This Work 86.4% 127-133, April 2006.
[12] Valstar, M., Pantic, M., Fully Automatic Facial Action Unit
The system was implemented on a 3.4 GHz i7 processor Detection and Temporal Phases of Factial Actions, IEEE
testbed with 8 GB of RAM memory. The average time to Transactions on Systems, Man, and Cybernetics, 42(1), pp. 28-43,
February 2012.
compute an emotion from 1 minute video was 20 seconds. [13] Lien, J. J.-J., Kanade, T., Cohn, J. F., Li, C. C., Detection,
V. CONCLUSION Tracking, and Classification of Action Units in Facial Expression,
Journal of Robotics and Autonomous Systems, 31(3), pp. 131-146,
An emotion recognition system based on facial 1999.
expressions, hand gestures and body postures was [14] Pantic, M., Tomc, M., Rothkrantz, L. J. M., A hybrid approach to
presented. We showed that the GER system can determine mouth features detection, Proceedings of IEEE International
Conference on Systems, Man, Cybernetics, pp. 1188 - 1193,
Anger, Happiness and Fear with accuracies of 75% and Tucson, Arizona, October 2001.
that there increasing the number of hand gesture and body [15] Feijun Jiang, Ekenel, H. K., Shi, B. E., Efficient and robust
posture combinations in the SCFG increases the systems integration of face detection and head pose estimation, 21st
International Conference on Pattern Recognition (ICPR), pp.1578-
degree of generalization for new combinations. Tested 1581, Tsukuba, 11-15 Nov. 2012.
together with FER system, we showed that for Anger, [16] Viola, P., Jones, M., Robust real-time face detection, IEEE
Sadness and Fear, the GER system improves the standard International Conference on Computer Vision, vol. 2, pp. 747,
Vancouver, 2001.
FER with up to 5% suggesting hand gestures and body [17] Baveye, Y., Dellandrea, E., Chamaret, C., Chen, L., LIRIS-
postures contain emotional information that cannot be ACCEDE: A Video Database for Affective Content Analysis,
acquired from facial expressions. For Happiness and IEEE Transactions on Affective Computing, 6(1), pp. 43-55, 2015.
Surprise the recognition rate increase is not high, hence [18] Kanade, T., Cohn, J. F., Tian, Y., Comprehensive database for
facial expression analysis, 4th IEEE International Conference on
hand gestures and body postures for these emotions do not Automatic Face and Gesture Recognition, pp. 46-53, Grenoble,
add new features compared to facial ones. As a drawback, 2000.
for Disgust the recognition rates were low, hence we need [19] Pantic, M., Valstar, M. F., Rademaker, R., Maat, L., Web-based
database for facial expression analysis, IEEE International
to consider other approach for this emotion. Moreover, no Conference on Multimedia and Expo, pp. 317 - 321, Amsterdam,
other way of measuring emotion was included in our July 2005.
system and this might have decreased the classification [20] Koelstra, S., Pantic, M., Patras, I., A dynamic texture based
approach to recognition of facial actions and their temporal
accuracy, hence considering an additional measure of models, IEEE Transactions on Pattern Analysis and Machine
emotion alongside with enriching the SCFG could increase Intelligence, 32(11), pp. 1940-1954, November 2010.
the accuracy and this is the subject of our future research.
REFERENCES
[1] Shibata, T., Kijima, Y., Emotion recognition modeling of siting
postures by using pressure sensors and accelerometers, 21st
International Conference on Pattern Recognition, pp. 1124-1127,
Tsukuba, 2012.

723

Вам также может понравиться