Вы находитесь на странице: 1из 37

Video-Based Emotion

Recognition using CNN-


RNN and C3D Hybrid
Networks

Guided by: Done by:


Ms.Anjusree V K Reshma Sudarsan
S7 CSE Gamma
Roll no :33
1
Video-Based Emotion Recognition
using CNN-RNN and C3D Hybrid
Networks

 Authors: Yin Fan, Xiangju Lu, Dian Li, Yuanliu Liu

 Year of Publication: 2016

 Published in: 18th ACM International Conference on


Multimodal Interaction
Tokyo ,Japan
November 12-16, 2016

2
CONTENTS

 Introduction
 Overview of the system

 Convolutional Neural Network

 Recurrent Neural Network

 LSTM

 C3D

 Experiment

 Conclusion

 References

3
INTRODUCTION

 Emotions plays a major role in human interaction.

 Emotion recognition is challenging.

 EmotiW challenge (Emotion recognition in the Wild)


since 2013.

 Video-based emotion recognition using RNN and 3D


convolutional networks (C3D) hybrid ,combined with an
audio module.

4
OVERVIEW OF THE SYSTEM

5
CONVOLUTIONAL NEURAL NETWORK

 Uses: Image classification, Face recognition etc.

 Process certain features of the image and classifies it.

 Steps involved are: convolution operation, maxpooling,


flattening, full connection.

6
CONTD…

 Convolution operation is done to extract the relevant features.


 Size of the image is reduced.

7
CONTD…

 Different kernels to get different important features.

 Different feature maps.

8
CONTD…

ReLU (Rectified Linear Unit)

 Rectifier function to increase non linearity.

9
CONTD…
 Max pooling for spatial invariance
 Image size reduced.
 Flexibility of the neural network to find the distorted features.
 Features are preserved.

10
CONTD…

 Flattening is done to give the pooled image as the


input to the neural network.

10
CONTD…

 Every neuron in one layer is connected to every neuron in


another layer.
 The flattened matrix goes through a fully connected layer to
classify the image.

12
CONTD…

Softmax function is applied so that all the predicted


probabilities will add to 1.

13
RECURRENT NEURAL NETWORK

 RNN is widely used in translation, handwriting


recognition, speech recognition etc.
 RNN predict the next output of a sequence by taking
information from the previous sequence and combining it
with the input of the current sequence.

14
CONTD…

 Given an input sequence (x1,x2, ...,xn ), a RNN computes


the output sequence (y1,y2,...,yn) via the following equations.

where g is the hidden activation function, such as sigmoid or


hyperbolic tangent, and ht is the hidden information.

15
CONTD…

 RNNs are like short term memory.

 RNN has difficulties in learning long term


dependencies due to vanishing and exploding
gradient problem.

16
LSTM : A SPECIAL RNN

 Long Short-Term Memory RNN.

 LSTM learns from experience.

 Used when long time lags between important events.

 LSTM manages the vanishing gradient and exploding


gradient problem.

17
CONTD…

A LSTM unit is composed of a cell, an input gate,


an output gate and a forget gate.

Gates determine when the input is significant enough to


remember, when the unit should continue to remember or
forget the value, and when the unit should output the value.

18
CONTD…

A simple LSTM block with input, output and forget gate

Forget gate:
= Weight

= Output from the previous time stamp

= New input

= Bias
19
CONTD…
Input gate:

Updating memory cell:

Output gate:

20
CONTD…

Sigmoid Function Hyperbolic tangent


Function(tanh)

Sigmoid function turns output in the range 0 to 1.


 The range of the tanh function is from (-1 to 1).

21
3D CONVOLUTIONAL NEURAL
NETWORKS(C3D)

• C3D preserves the temporal information of the input


signals by adding a time dimension.
• Models appearance and motion simultaneously.
• C3D is generic, compact, simple, efficient.

22
SUPPORT VECTOR MACHINE

 Support vector machines are supervised learning


models used for classification and regression analysis.

 SVM constructs hyperplanes which are used for


classification.

 Any no. of hyperplanes are possible.

 Hyperplane is a subspace whose dimension is one less


than that of feature space.

23
CONTD…

Hyperplanes

24
CONTD…

SVM- Max. margin hyperplane

25
EXPERIMENT

 The CNN-RNN, C3D and audio SVM were trained


separately and their prediction scores were combined
into a final score.

 Challenge: Video based emotion recognition on AFEW


6.0 dataset.

 Data preprocessing: All faces of video frames are


extracted and aligned.

26
CONTD…

For CNN-RNN network used in the paper:

The CNN features of the faces are taken from fc6 layer of
VGG16-Face model fine-tuned with FER2013 face
emotion database.

16 features are given as input to the LSTM.

27
CONTD…

For C3D network used in the paper:

 A series of a length of 16 sequent faces for each video


clip is chosen as the inputs.

 The C3D net has 8 convolutions, 5 max-poolings, and 2


fully connected layers, followed by a softmax output
layer.

C3D Architecture

28
CONTD…

Accuracies when different number of CNN-RNN models


and C3D models are merged together:

29
CONTD…

Confusion matrices on AFEW6.0 dataset:

30
CONTD…

31
CONCLUSION

 Even though CNN-RNN and C3D separately can


recognize emotions from video, there was a better
accuracy when they were both combined and used.

 Accuracy of 59.02%

 Winner of EmotiW 2016 challenge.

32
REFERENCES
 [1] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. 2015. Learning spatiotemporal
features with 3d convolutional networks. In 2015 IEEE International Conference on Computer
Vision (ICCV) .4489-4497. IEEE.
 [2] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S.,
Saenko, K., & Darrell, T. 2015. Longterm recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition.2625-2634.
 [3] Yao, A., Shao, J., Ma, N. and Chen,Y. 2015. Capturing AUAware Facial Features and
Their Latent Relations for Emotion Recognition in the Wild. ACM ICMI.
 [4] Eyben, F., Wöllmer, M., & Schuller, B. (2010, October). Opensmile: the munich versatile
and fast open-source audio feature extractor. InProceedings of the 18th ACM international
conference on Multimedia. 1459-1462. ACM.
 [5] Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., and Pal, C. 2015. Recurrent
neural networks for emotion recognition in video. In Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction. 467474. ACM.
 [6] Dhall, A., Goecke, R., Lucey, S. and Gedeon, T. 2012. Collecting large, richly annotated
facial-expression databases from movies. IEEE Multimedia.
 [7] Liu, M., Wang, R., Li, S., Shan, S., Huang Z. and Chen, X.2014. Combining Multiple
Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild. ACM ICMI.

33
REFERENCES

 [8] Dhall, A., Goecke, R. and Gedeon, T. 2015. Automatic Group Happiness Intensity
Analysis. IEEE Transaction on Affective Computing.
 [9] Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, Ç, Memisevic, R.
and Mirza, M. 2013. Combining modality specific deep neural networks for emotion
recognition in video. In Proceedings of the 15th ACM on International conference on
multimodal interaction. 543-550. ACM.
 [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
Guadarrama, and T. Darrell. 2014. Caffe: Convolutional architecture for fast feature
embedding. In ACM MM.
 [11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D. and
Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.1-9.
 [12] He, K., Zhang, X., Ren, S. and Sun, J. 2015. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385.
 [13] Simonyan, K., & Zisserman, A. 2014. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556.

34
REFERENCES

 [14] Parkhi, O. M., Vedaldi, A., & Zisserman, A. 2015. Deep face recognition. In
British Machine Vision Conference (Vol. 1, No. 3, p. 6).
 [15] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K. and Li, F.F., L. 2009. Imagenet: A
large-scale hierarchical image database. In Computer Vision and Pattern
Recognition. CVPR. 248255. IEEE.
 [16] Carrier, P. L., Courville, A., Goodfellow, I. J., Mirza, M. and Bengio, Y. 2013 .FER-
2013 face database. Technical report, 1365, Université de Montréal.
 [17] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H. and
Schmidhuber, J. 2009. A novel connectionist system for unconstrained handwriting
recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5),
855-868.
 [18] Sak, H., Senior, A. W. and Beaufays, F. 2014. Long shortterm memory recurrent
neural network architectures for large scale acoustic modeling. In INTERSPEECH.
338-342.
 [19] Kim, B. K., Dong, S. Y., Roh, J., Kim, G. and Lee, S. Y. 2016. Fusing Aligned and
Non-Aligned Face Information for Automatic Affect Recognition in the Wild: A Deep
Learning Approach. In Computer Vision and Pattern Recognition. CVPR.

35
REFERENCES

 [20] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R. and Li, F.F. 2014.
Large-scale Video Classification with Convolutional Neural Networks.
 [21] Ng, J., Hausknecht, M., Vijayanarasimhan S., Monga R., Vinyals O., Toderici
G.2015. Beyond Short Snippets: Deep Networks for Video Classification. In
Computer Vision and Pattern Recognition. CVPR. 4694-4702. IEEE.
 [22] Sharma S., Kiros R., Salakhutdinov R.2016 Action Recognition using Visual
Attention. Workshop track - ICLR.
 [23] Kaya, H., Gürpinar, F., Afshar, S. and Salah, A. A. 2015. Contrasting and
Combining Least Squares Based Learners for Emotion Recognition in the Wild. In
Proceedings of the 2015 ACM on International Conference on Multimodal Interaction.
459-466. ACM.
 [24] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.m. and
Saenko, K. 2015. Sequence to sequencevideo to text. In Proceedings of the IEEE
International Conference on Computer Vision. 4534-4542.
 [25] Pan, P., Xu, Z., Yang, Y., Wu, F. and Zhuang, Y. 2015. Hierarchical Recurrent
Neural Encoder for Video Representation with Application to Captioning. arXiv
preprint arXiv:1511.03476.

36
REFERENCES

 [26] Graves, A., Mohamed, A. R. and Hinton, G. 2013. Speech recognition with deep
recurrent neural networks. In 2013 IEEE international conference on acoustics,
speech and signal processing. 6645-6649. IEEE.
 [27] Dhall, A., Goecke, R., Joshi, J., Hoey, J. and Gedeon, T. 2016. EmotiW 2016:
Video and Group-level Emotion Recognition Challenges, ACM ICMI 2016.
 [28] Jianguo L., Tao W., Yimin Z. 2011. ICCV: Face Detection using SURF Cascade.
In Computer Vision Workshops.
 [29] Fernández, S., Graves, A., Schmidhuber, J. 2007. An application of recurrent
neural networks to discriminative keyword spotting. In International Conference on
Artificial Neural Networks. 220-229. Springer Berlin Heidelberg.

37

Вам также может понравиться