Вы находитесь на странице: 1из 12

Indian Institute of Technology Kanpur

Facial Emotion Recognition using Deep


Learning

Ankit Awasthi (Y8084)

CS 676:Computer Vision

Supervisor:
Dr. Amitabha Mukerjee , Department of Computer Science Engineering, IIT Kanpur
Dr. P Guha, TCS Labs, Delhi,India
ABSTRACT

Facial emotion recognition is one of the most important cognitive functions that our brain
performs quite efficiently. State of the art facial emotion recognition techniques are mostly
performance driven and do not consider the cognitive relevance of the model. This project
is an attempt to look at the task of emotion recognition using deep belief networks which
is cognitively very appealing and at the same has been shown to perform very well for digit
recognition (Hinton et.al. 2006). We look at the effects of varying number of hidden layers
and hidden units on the performance of the model and attempt to develop important insights
into the features learnt by the model. Also we observe that as found various psychological
findings our model finds lower spatial frequency more useful for recognizing facial expressions
than higher spatial frequency data.

1
Contents
1 Introduction 3

2 Motivation 4

3 Restricted Boltzmann Machine 4

4 Deep Belief Networks 5

5 JAFFE Dataset 5

6 Results 5
6.1 First Hidden Layer Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Effect of Number of Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.3 Effect of Number of Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 Effect of Image Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

7 Discussion and Future Work 10

8 References 11

2
1 Introduction
Facial expression are important cues for non verbal communication among human beings.
This is only possible because humans are able to recognize emotions quite accurately and
efficiently. An automatic facial emotion recognition system is an important component in
human machine interaction. Apart from the commercial uses of an automatic facial emotion
recognition system it might be useful to incorporate some cues from the biological system
in the model and use the model to develop further insights into the cognitive processing of
our brain.

State of the art approaches in facial emotion recognition use Active Appearance Mod-
els(AAMs), FACS labels or some other sophisticated feature extraction scheme. AAMs can
be learned from a set of training images and can be fitted on a new face to generate the land-
mark positions which can further be used to design features. Thus, in an automatic setting
either the availability of landmark point on face images is assumed or can be obtained by
fitting the model. FACS labels attempt to decompose human emotions in terms of Action
Units(AUs) which correspond to specific muscle movements. FACS coding system is used
in psychology and animation to classify facial expressions in a consistent and systematic
manner.But as of now FACS labels can only be given by experts or trained individuals.

One problem with ad hoc feature extraction schemes is that we need to design separate
feature extraction mechanism foe each visual task to be perfomed. Moreover,it is known
that only some of the filters in the retina are hardcoded and the other units in the V1,V2
and higher areas of visual processing are learned.Hubel and Wiesel showed that irreversible
damage was produced in kittens by sufficient visual deprivation during the so called ”criti-
cal period”. Therefore,it makes much more sense to have generic scheme for learning what
transformations in the input space may lead to good features for performing a particular task.

There is ample evidence that our visual processing architecture is organized in different
levels. Each level transforms the input in a manner that facilitates the visual task to be
performed. Another appealing feature of deep learning models is that there can be feature
or sub-feature sharing. Computationally also, it has been shown that unsufficiently deep
architectures can be exponentially ineffiecient. Deep Learning was revolutionized by Hinton
et.al.[1] when they came up with a very efficient method for training multilayer neural net-
works.

3
2 Motivation
Deep Learning methods have performed very well in MNSIT digit recognition dataset[1].
Our setting is very similar to the task of digit recognition. Corresponding to the digit
labels we have emotion labels. But emotion recognition is much more complicated because
digit images are much simpler than face images depicting various expressions. Moreover
the variability in the images due to different identities hampers the performance. Human
accuracy in facial exression recognition is not as good as in digit recognition and is also aided
by other modes of information such as context,prior experience,speech among others.

3 Restricted Boltzmann Machine


The restricted Boltzmann machine(RBM) is a two-layer, undirected graphical model in which
there are no lateral connections. One layer of nodes is called the visible layer v,and the other
layer of nodes is called the hidden layer h. Each of these nodes are stochastic binary units
and each configuration of visible and hidden nodes is characterized by a energy which is
given by the following function

P P P
E(v, h) = − i,j vi Wij hj − j bj hj − i ci vi

Probabilistically,this is interprated as follows:

exp(−E(v,h)
P (v, h) = Z

If the visible units are real, energy function is defined as follows

1
vi2 −
P P P P
E(v, h) = 2 i i,j vi Wij hj − j bj hj − i ci vi

The hidden nodes are conditionally independent given the visible layer and vice versa.
In particular, the conditional probabilities are as follows

P (v|h; θ) = Πi p(vi |h) , P (h|v; θ) = Πj p(hj |v)


P
p(hj = 1|v) = sigmoid( i Wij vi + bj ) For binary visible layer,
P
p(vj = 1|h) = sigmoid( i Wij hi + cj )
P
p(hj = 1|v) = sigmoid( i Wij vi + bj ) For real valued visible layer, we have,
P
p(vj = 1|h) = N ( i Wij hi + cj , sigma)

The parameters of the RBM can be learned by maximizing the log-likelihood of training

4
data using gradient ascent. But the exact gradient of the log-likelihood is intractable,thus
contrastive divergence is used which works fairly well in practice.The exact gradient is in-
tractable which is approximated by

∂logp(v)
exact gradient : ∂Wij
= < vi hj >0 − < vi hj >∞
∂logp(v)
CD approximation: ∂Wij
= < vi hj >0 − < vi hj >n

4 Deep Belief Networks


RBMs are only intersting because they can be efficiently stacked up layer by layer to form
a deep network. First an RBM is trained on the visible layer.Once trained the weights are
frozen and the hidden layer activations act as the input for the next RBM. Thus a DBN
with any number of layers can be formed by stacking RBMs as mentoined above. It has
also been shown [1] that increasing the number improves the variational lower bound on the
probability of the training data. RBM acts as a fundamental unit in the whole DBN. There
are other models that can be used instead of RBMs such autoencoders(sparse),denoising
autoencoders.Details about sparse autoencoder can be found in [5]. Once the layer by layer
model has been trained a final supervised fine tuning step which adjusts the weights to im-
prove the performance on the particular task in hand

5 JAFFE Dataset
Japanese Female Facial Expression (JAFFE) Database - The database contains 213 images
of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female
models. Each image has been rated on 6 emotion adjectives by 60 Japanese subjects. Some
of the emotions (fear) have been reported to not have been expressed very well. But in this
project we are working with all the six emotions rather a reduced set of emotions.

6 Results
In this project , we experimented with a lot of different settings of the model hyperarameters
to find how they affect the performance. Few variants to the conventionall DBNs were tried
such as sparse DBNs and stacking up sparse autoencoders but the results did not show any
improvement and hence corresponding results have not been reported. In all the results,
the models were trained using 150 training images and tested on the remaining 63 images.

5
Figure 1: features learned by the first layer of DBN, image size: 24 x 24, hidden layer: 50
units

Deep Belief Networks typically require large amount of data but in our case we have only 213
images. Thus the results may change significantly if a larger dataset is used.The experiments
were performed at three resolutions: 100 X 100, 50 X 50 ,25 X 25. The results for various
experiments are stated as follows and would be discussed in the next section.

6.1 First Hidden Layer Features


Figure 1, 2, 3 show the features learned by the first hidden layer. It can be observed the when
the images are large it is difficult to the features learned are not of good quality. Moreover
the number of epochs required to get any meaningful features increases with the image size.
Although there is no quantitative way of discriminating between these features other than
the recognition task itself (which is also an indirect method), visually the features for smaller
image sizes appear to be better than that in case of bigger image size. Projection of higher
layer features on the input space is a non-trivial task and has not been dealt with here.

6.2 Effect of Number of Layers


Figure 4 shows how the performance varies over the number of epochs of supervised fine
tuning step for 24x24 image size. Figure 5, 6 show the performance for image sizes 50x50
and 100x100 respectively. As shown in the figures, increase in the number of layers resulted

6
Figure 2: features learned by the first layer of DBN, image size: 50 x 50,hidden layer: 100
units

Figure 3: features learned by the first layer of DBN, image size:100 x 100,hidden layer: 500
units

7
Figure 4: Performance of DBNs on 24 x 24 images against numner of epochs of supervised
finetuning

in slight improvement. It was also observed that further increasing the number of layers
deteriorates the performance. Results for such cases have not been reported. One possible
explaination for the same could be that increasing the number of parameters to be learned
and with the small dataset that we have it is difficult to learn many parameters.

6.3 Effect of Number of Hidden Units


This part was not exhaustively investigated through experiments, but in general the results
have been reported for the best configuration possible for the specified number of hidden
layers. One important observation was that it was important to have a sufficient reduction
in information from the visible layer to the first hidden layer. In other the number of hidden
units in the first layer should be significantly less than the number of visible units. This
forces the model to learn important features from the image.

6.4 Effect of Image Size


As shown in Figure 4, 5, 6 the performance improves when we move from high resolution
images to low resolution images. This complies with psychological findings that lower spa-
tial frequency band are favoured for facial expression recogntion and speaks of the cognitive
grounding of our model. However, based on such a small dataset, we refrain from making
any claims and the observed phenomenon may be solely because of the fewer parameters to
train in case of smaller images.

8
Figure 5: Performance of DBNs on 50 x 50 images against numner of epochs of supervised
finetuning

Figure 6: Performance of DBNs on 100 x 100 images against numner of epochs of supervised
finetuning

9
7 Discussion and Future Work
Accuracy of state of state of the art facial emotion recognition systems is much better than
arrived at in the project. Considering that the algorithm takes raw images rather landmark
points or FACS labels as input, it performs fairly well. The dataset used in the project was
quite small and prohibits any general claim about the success or failure of deep learning
methods. It is expected that a larger dataset would improve the accuracy of the algorithm
and better features would be learned. This comprises a major portion of our future work in
this project
. Observing the features one may say that algorithm is able to extract some meaningful
features. In the absence of any principled way of discrminating the receptive fields learned
by the model it becomes difficult to argue about the ’goodness’ or’badness’ of a feature other
than evaluating the classfication accuracy that the feature facilitates.
As observed increasing number of hidden layers resulted in a slight improvement in classifi-
cation, but further increase in hidden layers however deteriorated the results. The number
of hidden units in each layer was one of the hyperparameters which wasnt satisfactorily in-
vestigated but an important and somewhat counter-intuitive observation that came up was
that the number of hidden units in the first layer should be less than the number of visible
units which in other words means that there should be a significant redcution in the amount
of information from the visible layer to the first hidden layer. This is appealing because
soemhing very similar happens in our visul system where a lot of information is thrown out
in successive layers of processing. What this does is that it forces the hidden units to learn
the most important features. Led by this observation,we thought that sparsity constraints
might lead to even better features and accuracy but as it turned out that there was not any
improvement. Again, this might be attributed to the small dataset we are working with.
One of the imortant results coming out of this project is the observation that low resolution
images had better classification accuracy than higher resolution images. Various psycholog-
ical experiments done on human beings suggest that we make use of mid spatial frequency
band for recognizing emotions rather than thehigh spatial frequency band. Although here,we
do not present any quantitative similarities for spatial frequency versus classification accu-
racy, the few experiments that we performed suggest that lower spatial frequency informa-
tion is more useful for recognizing emotions which speaks for the cognitive relevance of the
model.In our future work we would like to work quantitative ways of evaluating cognitive
imporatnce of features which would help argue for DBNs as a very good model of our visual
system.

10
8 References

References
[1] Geoffrey E. Hinton, Yee-Whye Teh and Simon Osindero, A Fast Learning Algorithm for
Deep Belief Nets. Neural Computation, pages 1527-1554, Volume 18, 2008.

[2] Susskind, J.M. and Hinton, G.E. and Movellan, J.R. and Anderson, A.K., Generat-
ing facial expressions with deep belief nets, Affective Computing, Emotion Modelling,
Synthesis and Recognition, pages 421-440,2009

[3] Michael J. Lyons, Shigeru Akamatsu, Miyuki Kamachi & Jiro Gyoba , Coding Facial
Expressions with Gabor Wavelets,Proceedings, Third IEEE International Conference on
Automatic Face and Gesture Recognition,pp 200-205, 1-19.April 14-16 1998

[4] Geoffrey E. Hinton (2010). A Practical Guide to Training Restricted Boltzmann Ma-
chines,Technical Report,Volume 1

[5] Andrew Ng. (2010). Sparse autoencoder(lecture notes).

[6] Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng, Sparse deep belief net model for
visual area V2 NIPS,2007

11