Emotion Recognition Using CNN and RNN

Video-Based Emotion
Recognition using CNN-

RNN and C3D Hybrid
Networks
Guided by: Done by:

Ms.Anjusree V K Reshma Sudarsan
S7 CSE Gamma
Roll no :33
1
Video-Based Emotion Recognition
using CNN-RNN and C3D Hybrid
Networks
 Authors: Yin Fan, Xiangju Lu, Dian Li, Yuanliu Liu
 Year of Publication: 2016
 Published in: 18th ACM International Conference on

Multimodal Interaction
Tokyo ,Japan
November 12-16, 2016
2
CONTENTS
 Introduction
 Overview of the system
 Convolutional Neural Network
 Recurrent Neural Network
 LSTM
 C3D
 Experiment
 Conclusion
 References
3
INTRODUCTION
 Emotions plays a major role in human interaction.
 Emotion recognition is challenging.
 EmotiW challenge (Emotion recognition in the Wild)

since 2013.
 Video-based emotion recognition using RNN and 3D

convolutional networks (C3D) hybrid ,combined with an
audio module.
4
OVERVIEW OF THE SYSTEM
5
CONVOLUTIONAL NEURAL NETWORK
 Uses: Image classification, Face recognition etc.
 Process certain features of the image and classifies it.
 Steps involved are: convolution operation, maxpooling,

flattening, full connection.
6
CONTD…
 Convolution operation is done to extract the relevant features.

 Size of the image is reduced.
7
CONTD…
 Different kernels to get different important features.
 Different feature maps.
8
CONTD…
ReLU (Rectified Linear Unit)
 Rectifier function to increase non linearity.
9
CONTD…
 Max pooling for spatial invariance
 Image size reduced.
 Flexibility of the neural network to find the distorted features.
 Features are preserved.
10
CONTD…
 Flattening is done to give the pooled image as the

input to the neural network.
10
CONTD…
 Every neuron in one layer is connected to every neuron in

another layer.
 The flattened matrix goes through a fully connected layer to
classify the image.
12
CONTD…
Softmax function is applied so that all the predicted

probabilities will add to 1.
13
RECURRENT NEURAL NETWORK
 RNN is widely used in translation, handwriting

recognition, speech recognition etc.
 RNN predict the next output of a sequence by taking
information from the previous sequence and combining it
with the input of the current sequence.
14
CONTD…
 Given an input sequence (x1,x2, ...,xn ), a RNN computes

the output sequence (y1,y2,...,yn) via the following equations.
where g is the hidden activation function, such as sigmoid or

hyperbolic tangent, and ht is the hidden information.
15
CONTD…
 RNNs are like short term memory.
 RNN has difficulties in learning long term

dependencies due to vanishing and exploding
gradient problem.
16
LSTM : A SPECIAL RNN
 Long Short-Term Memory RNN.
 LSTM learns from experience.
 Used when long time lags between important events.
 LSTM manages the vanishing gradient and exploding

gradient problem.
17
CONTD…
A LSTM unit is composed of a cell, an input gate,

an output gate and a forget gate.
Gates determine when the input is significant enough to

remember, when the unit should continue to remember or
forget the value, and when the unit should output the value.
18
CONTD…
A simple LSTM block with input, output and forget gate
Forget gate:
= Weight
= Output from the previous time stamp
= New input
= Bias
19
CONTD…
Input gate:
Updating memory cell:
Output gate:
20
CONTD…
Sigmoid Function Hyperbolic tangent

Function(tanh)
Sigmoid function turns output in the range 0 to 1.

 The range of the tanh function is from (-1 to 1).
21
3D CONVOLUTIONAL NEURAL
NETWORKS(C3D)
• C3D preserves the temporal information of the input

signals by adding a time dimension.
• Models appearance and motion simultaneously.
• C3D is generic, compact, simple, efficient.
22
SUPPORT VECTOR MACHINE
 Support vector machines are supervised learning

models used for classification and regression analysis.
 SVM constructs hyperplanes which are used for

classification.
 Any no. of hyperplanes are possible.
 Hyperplane is a subspace whose dimension is one less

than that of feature space.
23
CONTD…
Hyperplanes
24
CONTD…
SVM- Max. margin hyperplane
25
EXPERIMENT
 The CNN-RNN, C3D and audio SVM were trained

separately and their prediction scores were combined
into a final score.
 Challenge: Video based emotion recognition on AFEW

6.0 dataset.
 Data preprocessing: All faces of video frames are

extracted and aligned.
26
CONTD…
For CNN-RNN network used in the paper:
The CNN features of the faces are taken from fc6 layer of
VGG16-Face model fine-tuned with FER2013 face
emotion database.
16 features are given as input to the LSTM.
27
CONTD…
For C3D network used in the paper:
 A series of a length of 16 sequent faces for each video

clip is chosen as the inputs.
 The C3D net has 8 convolutions, 5 max-poolings, and 2

fully connected layers, followed by a softmax output
layer.
C3D Architecture
28
CONTD…
Accuracies when different number of CNN-RNN models

and C3D models are merged together:
29
CONTD…
Confusion matrices on AFEW6.0 dataset:
30
CONTD…
31
CONCLUSION
 Even though CNN-RNN and C3D separately can

recognize emotions from video, there was a better
accuracy when they were both combined and used.
 Accuracy of 59.02%
 Winner of EmotiW 2016 challenge.
32
REFERENCES
 [1] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. 2015. Learning spatiotemporal
features with 3d convolutional networks. In 2015 IEEE International Conference on Computer
Vision (ICCV) .4489-4497. IEEE.
 [2] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S.,
Saenko, K., & Darrell, T. 2015. Longterm recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition.2625-2634.
 [3] Yao, A., Shao, J., Ma, N. and Chen,Y. 2015. Capturing AUAware Facial Features and
Their Latent Relations for Emotion Recognition in the Wild. ACM ICMI.
 [4] Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile
and fast open-source audio feature extractor. InProceedings of the 18th ACM international
conference on Multimedia. 1459-1462. ACM.
 [5] Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., and Pal, C. 2015. Recurrent
neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction. 467474. ACM.
 [6] Dhall, A., Goecke, R., Lucey, S. and Gedeon, T. 2012. Collecting large, richly annotated
facial-expression databases from movies. IEEE Multimedia.
 [7] Liu, M., Wang, R., Li, S., Shan, S., Huang Z. and Chen, X.2014. Combining Multiple
Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI.
33
REFERENCES
 [8] Dhall, A., Goecke, R. and Gedeon, T. 2015. Automatic Group Happiness Intensity
Analysis. IEEE Transaction on Affective Computing.
 [9] Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç, Memisevic, R.
and Mirza, M. 2013. Combining modality specific deep neural networks for emotion
recognition in video. In Proceedings of the 15th ACM on International conference on
multimodal interaction. 543-550. ACM.
 [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature
embedding. In ACM MM.
 [11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D. and
Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.1-9.
 [12] He, K., Zhang, X., Ren, S. and Sun, J. 2015. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385.
 [13] Simonyan, K., & Zisserman, A. 2014. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556.
34
REFERENCES
 [14] Parkhi, O. M., Vedaldi, A., & Zisserman, A. 2015. Deep face recognition. In
British Machine Vision Conference (Vol. 1, No. 3, p. 6).
 [15] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K. and Li, F.F., L. 2009. Imagenet: A
large-scale hierarchical image database. In Computer Vision and Pattern
Recognition. CVPR. 248255. IEEE.
 [16] Carrier, P. L., Courville, A., Goodfellow, I. J., Mirza, M. and Bengio, Y. 2013 .FER-
2013 face database. Technical report, 1365, Université de Montréal.
 [17] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H. and
Schmidhuber, J. 2009. A novel connectionist system for unconstrained handwriting
recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5),
855-868.
 [18] Sak, H., Senior, A. W. and Beaufays, F. 2014. Long shortterm memory recurrent
neural network architectures for large scale acoustic modeling. In INTERSPEECH.
338-342.
 [19] Kim, B. K., Dong, S. Y., Roh, J., Kim, G. and Lee, S. Y. 2016. Fusing Aligned and
Non-Aligned Face Information for Automatic Affect Recognition in the Wild: A Deep
Learning Approach. In Computer Vision and Pattern Recognition. CVPR.
35
REFERENCES
 [20] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R. and Li, F.F. 2014.
Large-scale Video Classiﬁcation with Convolutional Neural Networks.
 [21] Ng, J., Hausknecht, M., Vijayanarasimhan S., Monga R., Vinyals O., Toderici
G.2015. Beyond Short Snippets: Deep Networks for Video Classification. In
Computer Vision and Pattern Recognition. CVPR. 4694-4702. IEEE.
 [22] Sharma S., Kiros R., Salakhutdinov R.2016 Action Recognition using Visual
Attention. Workshop track - ICLR.
 [23] Kaya, H., Gürpinar, F., Afshar, S. and Salah, A. A. 2015. Contrasting and
Combining Least Squares Based Learners for Emotion Recognition in the Wild. In
Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.
459-466. ACM.
 [24] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.m. and
Saenko, K. 2015. Sequence to sequencevideo to text. In Proceedings of the IEEE
International Conference on Computer Vision. 4534-4542.
 [25] Pan, P., Xu, Z., Yang, Y., Wu, F. and Zhuang, Y. 2015. Hierarchical Recurrent
Neural Encoder for Video Representation with Application to Captioning. arXiv
preprint arXiv:1511.03476.
36
REFERENCES
 [26] Graves, A., Mohamed, A. R. and Hinton, G. 2013. Speech recognition with deep
recurrent neural networks. In 2013 IEEE international conference on acoustics,
speech and signal processing. 6645-6649. IEEE.
 [27] Dhall, A., Goecke, R., Joshi, J., Hoey, J. and Gedeon, T. 2016. EmotiW 2016:
Video and Group-level Emotion Recognition Challenges, ACM ICMI 2016.
 [28] Jianguo L., Tao W., Yimin Z. 2011. ICCV: Face Detection using SURF Cascade.
In Computer Vision Workshops.
 [29] Fernández, S., Graves, A., Schmidhuber, J. 2007. An application of recurrent
neural networks to discriminative keyword spotting. In International Conference on
Artificial Neural Networks. 220-229. Springer Berlin Heidelberg.
37

Emotion Recognition Using CNN and RNN

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Emotion Recognition Using CNN and RNN

Загружено:

Авторское право:

Доступные форматы

Video-Based Emotion

Recognition using CNN-

Guided by: Done by:

 Authors: Yin Fan, Xiangju Lu, Dian Li, Yuanliu Liu

 Year of Publication: 2016

 Published in: 18th ACM International Conference on

 Convolutional Neural Network

 Recurrent Neural Network

 Emotions plays a major role in human interaction.

 Emotion recognition is challenging.

 EmotiW challenge (Emotion recognition in the Wild)

 Video-based emotion recognition using RNN and 3D

 Uses: Image classification, Face recognition etc.

 Process certain features of the image and classifies it.

 Steps involved are: convolution operation, maxpooling,

 Convolution operation is done to extract the relevant features.

 Different kernels to get different important features.

 Different feature maps.

ReLU (Rectified Linear Unit)

 Rectifier function to increase non linearity.

 Flattening is done to give the pooled image as the

 Every neuron in one layer is connected to every neuron in

Softmax function is applied so that all the predicted

 RNN is widely used in translation, handwriting

 Given an input sequence (x1,x2, ...,xn ), a RNN computes

where g is the hidden activation function, such as sigmoid or

 RNNs are like short term memory.

 RNN has difficulties in learning long term

 Long Short-Term Memory RNN.

 LSTM learns from experience.

 Used when long time lags between important events.

 LSTM manages the vanishing gradient and exploding

A LSTM unit is composed of a cell, an input gate,

Gates determine when the input is significant enough to

A simple LSTM block with input, output and forget gate

= Output from the previous time stamp

Updating memory cell:

Sigmoid Function Hyperbolic tangent

Sigmoid function turns output in the range 0 to 1.

• C3D preserves the temporal information of the input

 Support vector machines are supervised learning

 SVM constructs hyperplanes which are used for

 Any no. of hyperplanes are possible.

 Hyperplane is a subspace whose dimension is one less

SVM- Max. margin hyperplane

 The CNN-RNN, C3D and audio SVM were trained

 Challenge: Video based emotion recognition on AFEW

 Data preprocessing: All faces of video frames are

For CNN-RNN network used in the paper:

16 features are given as input to the LSTM.

For C3D network used in the paper:

 A series of a length of 16 sequent faces for each video

 The C3D net has 8 convolutions, 5 max-poolings, and 2

Accuracies when different number of CNN-RNN models

Confusion matrices on AFEW6.0 dataset:

 Even though CNN-RNN and C3D separately can

 Winner of EmotiW 2016 challenge.

Вам также может понравиться