Академический Документы
Профессиональный Документы
Культура Документы
Speech Communication
journal homepage: www.elsevier.com/locate/specom
a r t i c l e i n f o a b s t r a c t
Article history: Conventional approaches for Speech Emotion Recognition (SER) usually assume that the feature distri-
Received 26 July 2016 butions between training and test set are identical. However, this assumption does not hold in many
Revised 20 May 2017
real scenarios. Although many Domain Adaptation (DA) methods have been proposed to solve this prob-
Accepted 21 June 2017
lem, the conventional emotion discriminative information is ignored. In this paper, we propose a DA
Available online 12 July 2017
based method called Emotion-discriminative and Domain-invariant Feature Learning Method (EDFLM) for
Keywords: SER, in which both the domain divergence and emotion discrimination are considered to learn emotion-
Domain adaptation discriminative and domain-invariant features by using emotion label constraint and domain label con-
Speech emotion recognition straint. Furthermore, to disentangle the emotion-related factors from the emotion-unrelated factors, we
Neural network introduce an orthogonal term to encourage the input to be disentangled into two blocks: emotion-related
and emotion-unrelated features. Our method can learn emotion-discriminative and domain-invariant fea-
tures through a back propagation network which uses the acoustic features of INTERSPEECH 2009 Emo-
tion Challenge as the input rather than raw speech signals. Experiments on the INTERSPEECH 2009 Emo-
tion Challenge two-class task show that the performance of our method is superior to other state-of-the-
arts methods.
© 2017 Published by Elsevier B.V.
http://dx.doi.org/10.1016/j.specom.2017.06.006
0167-6393/© 2017 Published by Elsevier B.V.
2 Q. Mao et al. / Speech Communication 93 (2017) 1–10
In SER, conventional methods aim to learn the emotion-specific tive Shared Gaussian Process Latent Variable Model (DS-GPLVM)
features that are robust to the nuisance factors, such as speaker proposed in Eleftheriadis et al. (2015) is one of the state-of-the-
variation and environment distortion (Ding et al., 2012; Mao et al., art multi-task learning methods for multi-pose FER, in which DS-
2014). So we really hope to get such a powerful feature represen- GPLVM first learns a discriminative manifold shared by multiple
tation in the DA method for SER. Just using the emotion label pre- views of a facial expression, and then facial expression classifica-
dictor cannot achieve an excellent emotion-discriminative feature tion is performed in the expression manifold, either in the view-
representation. invariant manner (using only a single view of the expression) or in
In this paper, we propose a new DA method called Emotion- the multi-view manner (using multiple views of the expression). In
discriminative and Domain-invariant Feature Learning Method (ED- addition, the supervised Latent Dirichlet Process model (Blei and
FLM) for SER. In this method, we first introduce an orthogonal McAuliffe, 2008) is also a multi-task learning approach.
term to disentangle the emotion-related factors and the emotion-
unrelated factors. Then our model learns a hierarchical non-linear 2.2. Deep learning for SER
transformation of the emotion-related features through a Back
Propagation (BP) network. On top of the feature extraction, two Recently, deep learning based methods seem to have the abil-
predictors are imposed on the high level emotion feature represen- ity to learn hierarchical non-linear transformation of the input, and
tation: emotion label predictor and domain label predictor. Thus they have been successfully applied in the field of affective com-
the emotion-discrimination and domain-invariance of the features puting. SER also benefits a lot from deep learning based mod-
are encouraged. In summary, our main contributions in this work els. Schmidt and Kim (2011) sought to employ regression-based
are as follows: Deep Belief Networks (DBNs) to learn emotion-based features di-
rectly from magnitude spectra for music emotion recognition. In
(1) By introducing an orthogonal term into the objective func-
Le and Provost (2013), several hybrid classifiers for affective com-
tion of BP network at the beginning, we can extract affect-
puting were evaluated, in which DBNs were used to estimate the
salient features for SER by disentangling emotions from
emotion probabilities and Hidden Markov Models (HMMs) cap-
other factors such as speaker variance and noise. Specifically,
tured the temporal property of the emotion. Han et al. (2014) uti-
the learned features in the first layer are divided into two
lized Deep Neural Networks (DNNs) to produce an emotion state
blocks, related to emotion and other remaining factors, re-
probability distribution for each speech segment and then con-
spectively. The emotion-related features are input into the
structed utterance-level features, and finally an Extreme Learning
second layer of BP network, which is discriminative and ro-
Machine (ELM) was used to recognize the utterance-level emo-
bust, leading to great performance improvement on SER.
tions. Tian et al. (2015) compared knowledge-inspired disfluency
(2) Emotion discrimination and domain invariance are
and non-verbal vocalization features in emotional speech with sta-
combined into the deep architecture to learn emotion-
tistical features describing acoustic characteristics of the samples
discriminative and domain-invariant features for SER. This
by using Long Short-Term Memory Recurrent Neural Networks
is achieved by jointly optimizing the underlying features as
(LSTM-RNNs) which outperformed SVM given enough samples. In
well as two classifiers operating on these features: (i) the
Fayek et al. (2016), a DNN was employed to learn a mapping from
emotion label predictor that predicts the emotion label and
Fourier-transform based filter banks to emotion classes by using
(ii) the domain label predictor that classifies the source and
soft labels produced from multiple annotators to model the sub-
the target domains.
jectiveness in emotion recognition, which improved performance
The rest of this paper is organized as follows. We introduce the compared with using ground truth labels obtained by majority vot-
related work in Section 2. Section 3 presents our method in details. ing between the same annotators.
Section 4 describes SER benchmark databases, original acoustic fea-
tures, and reports our experimental results. Conclusions are dis- 2.3. Domain adaption
cussed in Section 5.
Although deep learning methods have a powerful feature learn-
2. Related work ing ability, most of them are sensitive to small perturbations in the
input features. So it is necessary to expand deep learning methods
In SER, how to extract the most discriminative features has to cross domain scenarios. DA aims to reduce the distribution dis-
been an important research area. Meanwhile, due to the power- crepancy between different domains. With the increasing attention
ful ability of feature learning in deep learning, more and more of DA, a large number of DA methods have been proposed over
efforts are dedicated into speech emotion feature learning using the past decades, which can be broadly classified into two groups:
deep neural networks. But most of these works are based on the instance-based DA and feature-based DA.
same assumption: the training and test set have the same feature For the first one, different weights to the old source domain
distribution. This assumption does not hold in many practical sce- samples are calculated, so that the weighted source domain sam-
narios. Recently, many DA methods are proposed to deal with this ples can be approximately deemed as holding the same distribu-
problem. In the following, we first review some multi-task learn- tion to the target domain, or the most related source domain sam-
ing and deep learning methods for SER, and then discuss the DA ples would be selected according to the weights. For instance, Dai
methods. Finally, we introduce some DA methods for cross-domain et al. (2007) utilized a small part of the target domain labeled sam-
SER. ples to update the weights of the source domain samples, so that
the most important part of the source domain was known. Hassan
2.1. Multi-task learning et al. (2013) applied three methods to determine the weights of the
source domain samples, and then incorporated these weights into
There are very few multi-task learning algorithms for SER, but the SVM classifiers. Kan et al. (2014) learned the targetization coef-
many multi-task learning algorithms are proposed for multi-pose ficients in a common subspace by using the sparse target domain
Face Expression Recognition (FER). Multi-pose FER learns the fa- neighbors, and the coefficients targetized the source domain sam-
cial expression discriminative information by sharing the pool of ples into the target domain.
features or manifold under various poses instead of training mod- For the second one, the original input features are usually
els and tuning parameters for each pose separately. Discrimina- transformed into a new space, and in this space the new fea-
Q. Mao et al. / Speech Communication 93 (2017) 1–10 3
ture representation of different domain samples can be consid- audio files and yi(s ) is the corresponding emotion label. D t =
ered to have the same distribution. In other words, the learned
{xi(t ) , yi(t ) }|ni=1
t
denotes the target domain dataset and xi(t ) is also
features have the property of domain-invariance. Many approaches
have been proposed. Glorot et al. (2011) proposed to combine the a feature vector of the ith utterance and yi(t ) is the corresponding
samples from different domains as the input to Stack Denoising emotion label. The number of samples in the source and the tar-
Auto-encoders (SDA), so as to learn a robust feature representation get domain datasets is denoted as ns and nt , respectively. Here, the
across domains. Chopra et al. (2013) considered the feature distri- source and target domains share the same feature and label space,
bution shift between different domains, and constructed interme- i.e., each feature vector x ∈ Rk and the corresponding emotion label
diate databases by combining parts of the source and target do- y ∈ {1, 2, . . . , r} (r is the number of emotion labels), but the feature
mains samples, so that all the databases seemed to form a path vectors of these two domains are from two different distributions.
from the source to target. Each database was used to learn a fea- Let X = {x|x ∈ {D s ∪ D t }} be all the samples, i.e., feature vectors
ture extraction using deep networks, and finally all the path fea- and their emotion labels from the source and target domains. It is
ture representations were combined as a new feature representa- worth noting that the emotion labels of the target domain are not
ns +nt
tion. used in the training time. Let D = {di }|i=1 represent the set of
Above-mentioned methods did not seem to encode the domain domain labels, where di = 1 if xi , yi ∈ D s , and di = 0 if xi , yi ∈ D t .
label information into the feature learning. Meanwhile, divergence Let Y = {y|y ∈ {yi(s ) |ni=1 s
}} be the set of the emotion labels of the
across different domains did not explicitly reduce. To approach source samples. At training time, we have access to X , D , and Y .
this problem, several approaches have been proposed. Zhuang et al. Our aim is to predict the emotion labels of samples in target do-
(2015) introduced the label information in feature learning, and main.
used Kullback–Leibler (KL) divergence to minimize the difference
3.2. Emotion-discriminative and domain-invariant feature learning
of domain distributions. Liu et al. (2015) considered domain super-
vision and emotion supervision, and modified the SDAs incorporat-
In this section, we present our transfer feature learning method
ing these two terms. Ganin and Lempitsky (2015) also introduced
using a BP network in SER. The architecture of our method is
label and domain information to learn discriminative and domain-
shown in Fig. 1, which has three parts: feature extractor, emotion
invariant features. A gradient reversal layer had been used to up-
predictor and domain predictor. Specifically, we use the 384 at-
date the parameters with gradient descent for feed-forward mod-
tributes extracted by the open source toolkit openEAR (Eyben et al.,
els.
2009) of the speech signal as the input of our model. And then,
the input attributes will be divided into two parts: emotion-related
2.4. Domain adaption for cross-corpus SER features and emotion-unrelated features using the disentangling
penalty during the feature extraction. The emotion-related features
For SER, many DA-based methods are also proposed. Deng are further input into the higher layer to learn high-level feature
et al. (2013) proposed to learn a sparse autoencoder-based feature representation and predict emotions and domains. The prediction
mapping rule with the access to a small part of the target do- error and the reconstruction error will be used to adjust the pa-
main labeled samples, and then the source domain samples re- rameters of the entire network with back-propagation. The iter-
constructed by this mapping were used to train a classifier. Deng ations finish until convergence. In our method, gradient reversal
et al. (2014a) have used both the source and target domain samples layer between the feature extractor and domain label predictor is
as input, modified the basic autoencoder ,and proposed a Shared- introduced to search the saddle point between emotion and do-
Hidden-Layer Autoencoder (SHLA) approach to learn common fea- main. This gradient reversal layer multiplies the gradient by a cer-
ture representations shared across different domains. Also, paper tain negative constant during the backpropagation based training.
Deng et al. (2014b) introduced an adaptive denoising autoencoder Otherwise, the training proceeds in a standard way and minimizes
method, in which prior knowledge learned from the target domain the emotion prediction loss (for source examples) and the domain
was used to regularize the training on the source domain. In ad- classification loss (for all samples). Gradient reversal ensures that
dition, Abdelwahab and Busso (2015) investigated several impor- the feature distributions over the two domains are similar, thus re-
tant factors in semi-supervised DA: the number of target domain sulting in the domain-invariant features, the detail of which is de-
labeled samples, speaker diversity in the labeled set, the perfor- scribed in the next section.
mance between the spontaneous and naturalistic samples, and the Specifically, in the part of feature extraction, it is assumed that
best approach to adapt the models. there are N + 1 layers and the features extracted from the feature
Current works about DA for SER just consider the divergence extractor form the feature vector f shown in Fig. 1. And k(n) denotes
of feature distribution, and ignore the emotion-discriminative in- the number of the nodes in the nth layer, and h(n) (x) represents
formation in SER. In this paper, we propose a new DA method the feature representation in the nth layer, n ∈ [0, 1, . . . , N]. It is
for SER, in which domain discrepancy is minimized and emotion- assumed that h(0 ) (x ) = x, then the output of x in the nth layer is
discrimination is encouraged to learn emotion-discriminative and denoted as:
domain-invariant features for SER. Furthermore, we introduce an T
orthogonal term into the loss function to disentangle the emotion- h(n ) (x ) = ϕ (W (n ) ) h(n−1) (x ) + b(n ) , (1)
related factors from all of the features. Therefore, our model can where W (n ) ∈ Rk(n )×k(n−1 ) is a weight matrix, and b(n) ∈ Rk(n) is a
learn a high level emotion-discriminative and domain-invariant bias vector. The function ϕ is a nonlinear activation function such
feature representation of the emotion-related features through a as widely used sigmoid function. Parameters in feature extraction
BP network. are θ f = {W (i ) , b(i ) }N
i=1
.
To reduce the influence of the emotion-unrelated factors (such
3. Proposed methodology as speakers, contents and environment variations), here we intro-
duce an orthogonal term to disentangle the emotion-related factors
3.1. Problem formalization from the other factors during feature extraction. For input x, we
map it into two distinct blocks of features: one that encodes the
Assume D s = {xi(s ) , yi(s ) }|ni=1
T
s
denotes the source domain dataset, discriminative factors of input, h(e ) (x ) = h(1 ) (x ) = ϕ (W (1 ) ) x +
in which xi(s ) is the ith utterance-level feature vector of the b(1 ) , and one that encodes all other factors, h(o) (x ) = ϕ (wT x + c ).
4 Q. Mao et al. / Speech Communication 93 (2017) 1–10
Fig. 1. The framework of EDFLM. Three important parts are included: feature extractor, emotion label predictor (dashed box), and domain label predictor (solid line box).
A gradient reversal layer is inserted between feature extraction layer and domain predict layer and it multiplies the gradient by a certain negative constant during the
backpropagation based training. Otherwise, the training proceeds in a standard way and minimizes the emotion prediction loss (for source examples) and the domain
classification loss (for all samples). Gradient reversal ensures that the feature distributions over source and target domains are made similar, thus leading to domain-invariant
features. To further get the emotion-discriminative features, we map the original input into two blocks: emotion-related and emotion-unrelated features (dot-dash line box).
Parameters are θe = {W (1 ) , b(1 ) } and θo = {w, c}. Meanwhile, both main prediction Ld are denoted as:
feature blocks are trained to cooperate to reconstruct their com-
1
mon input x. We consider the mean squared error as the loss func- L y ( θ f , θy ) = L y ( G y ( h ( x ; θ f ) ; θy ) ; y )
ns
tion and the reconstruction loss term Lrecon is then formalized as: x∈D s
1 1
r T (N )
eθyi h (x )
Lrecon (θe , θo ) = Lrecon (x; g(h(x; θe , θo ))) =− 1{y = i} log r , (4)
ns + nt ns θ T h (N ) ( x )
x∈X x∈D s i=1 e yj
1 j=1
= x − g(h(x; θe , θo ))2 , (2)
ns + nt
x∈X
1
where Lrecon is the reconstruction error, i.e., the squared error be- L d ( θ f , θd ) = L d ( G d ( h ( x ; θ f ) ; θd ) ; d )
ns + nt
tween the input x and its reconstructed output g(h(x; θ e , θ o )). The x∈X
where α , λ and β weigh the contribution of the emotion predic- Algorithm 1 Emotion-discriminative and Domain-invariant Fea-
tion, the domain prediction and the orthogonal term respectively. tures Learning Model (EDFLM).
Here, −λ ensures that the feature distributions over the two do-
Input: Source domain samples D s , target domain samples D t .
mains are similar.
Output: Weights and biases in feature extraction θ f =
We estimate the parameters in the entire network together by
using the samples in X which includes all the samples in the
{W (i) , b(i) }Ni=1 .
source domain and the target domain. During the training, the 1: Do forward propagation for data points in a batch;
computations of the emotion prediction loss and the orthogonal Compute the loss of the reconstruction term Lrecon ;
loss are performed on D s which is the source domain dataset with Compute the loss of the orthogonal term Lorth ;
emotion labels, and for other samples in X the losses above are Compute the loss of emotion label predictor Ly ;
not computed. The domain prediction loss and the reconstruction Compute the loss of domain label predictor Ld ;
loss are computed on X . The backpropagation algorithm is used to 2: Compute the partial derivatives of all variables in Eq. (8);
compute the partial derivatives of all parameters, then all param- 3: Update the variables according to Eq. (8);
eters are updated jointly. That is to say, θ f , θ o , and θ d are trained 4: Iteratively perform Step1-3 until the algorithm converges;
using dataset X , while θ y is trained using D s . 5: Compute the high level features h(N ) (x ), and use them to train
the classifier as discussed in Section 3.4.
3.3. Optimization with backpropagation
In order to minimize the third term (the loss of the domain trained with the source domain labeled samples and is applied to
label) and optimize Eq. (6) with SGD, a saddle point of Eq. (6) is the target domain test samples for emotion label prediction.
thus introduced:
Ganin and Lempitsky (2015) demonstrated that SGD can be We evaluate our method in the INTERSPEECH 2009 Emotion
adapted for the search of the saddle point. This can be accom- Challenge (EC) two-class task, in which the spontaneous FAU Aibo
plished by introducing a special gradient reversal layer between Emotion Corpus (FAU AEC) is used. FAU AEC contains 9 h of Ger-
the feature extractor and domain label predictor. During the for- man speech of 51 children interacting with Sony’s pet robot Aibo
ward propagation, GRL does nothing. But in the backward propa- at two different schools, Ohm and Mont. We treat FAU AEC as tar-
gation, the gradient is multiplied by a negative number −λ. get domain database, in which the 9959 samples of school Ohm
All the weights and biases can be updated using the gradient are used as target domain training samples, the 8257 samples of
descent algorithm as follows: Mont for target testing.
Additionally, we choose two publicly available databases, the
∂ Lrecon ∂ Ly ∂ Ld ∂ Lorth
W (1 ) ← W (1 ) − μ ( +α −λ +β ), Airplane Behavior Corpus (ABC) (Schuller et al., 2007) and the
∂ W (1 ) ∂ W (1 ) ∂ W (1 ) ∂ W (1 ) database of German emotional speech (Emo-DB) (Burkhardt et al.,
∂ Lrecon ∂ Ly ∂L ∂L
b(1) ← b(1) − μ( + α (1) − λ (d1) + β orth ), 2005), as the source sets. They are highly different from the tar-
∂b ( 1 ) ∂b ∂b ∂ b(1) get set FAU AEC in terms of age, type of emotion and recording
∂ Ly ∂ Ld situation, and degree of spontaneity.
W (m ) ← W (m ) − μ ( α −λ ),
∂ W (m ) ∂ W (m ) Paper (Scherer, 2005) gives the emotion distribution in the
∂ Ly ∂ Ld Arousal-Valence dimensional emotion space (A-V space) accord-
b(m ) ← b(m ) − μ(α (m ) − λ (m ), ing to psychology. For the comparability with FAU AEC, the di-
∂b ∂b )
∂ Lrecon ∂L verse emotion classes in the databases FAU AEC, ABC and Emo-DB
w ← w − μ( + β orth ), are first mapped onto the valence axis of the A-V space (Scherer,
∂w ∂w
2005), and then make the emotions located at the negative side of
∂ Lrecon ∂ Lorth
c ← c − μ( +β ), the valence axis of the A-V space with the label of “negative” since
∂c ∂c they denote the unpleasant emotions. On the contrary, the emo-
∂L
θd ← θd − μ ( d ) , tions located at the positive side of the valence axis of the dimen-
∂ θd sional emotion space are labeled as “positive” since they denote
∂L the pleasant emotions. The detailed mapping strategy is shown in
θy ← θy − μ ( α y ) ,
∂ θy Table 1 (Schuller et al., 2010; Deng et al., 2014a). Besides, Table 2
(8) summarizes the three datasets and shows the difference among
them, including age, language, speech, emotion, etc.
where m = 2, . . . , N and μ is the learning rate. For α and β , they
weigh the contribution of the emotion predictor loss and the or-
thogonal term, respectively, and λ is the parameter of gradient re- 4.2. Acoustic features
versal layer, which also weighs the contribution of domain predic-
tor loss. The details of the proposed algorithm are summarized in For the raw input feature representation, we keep in line with
Algorithm 1, and the framework of the model is shown in Fig. 1. the INTERSPEECH 2009 EC (Schuller et al., 2009) and use a base-
line feature set. It consists of 12 functionals applied to 2 × 16
3.4. Classifier construction acoustic Low-Level Descriptors (LLDs) including their first order
delta regression coefficients as shown in Table 3. Therefore, the fea-
After all the parameters are optimized in training time, we get ture vector per chunk contains 16 × 2 × 12 = 384 attributes. To en-
the feature h(N) (x), i.e., feature f in Fig. 1 to train the classifier. Here, sure reproducibility, the open source toolkit openEAR (Eyben et al.,
we choose the widely used SVMs as the classifier. The classifier is 2009) is utilized to extract 384 attributes.
6 Q. Mao et al. / Speech Communication 93 (2017) 1–10
Table 1
Emotion categories mapping onto negative and positive valence for three databases.
FAU AEC angry, touchy, emphatic, reprimanding motherese, neutral, joyful, rest
ABC aggressive, intoxicated, nervous, tired cheerful, neutral, rest
Emo-DB anger, boredom, disgust, fear, sadness joy, neutral
Table 2
Summary of the three chosen datasets.
Corpus Age Language Speech Emotion #Valence #All h:mm #m #f Rec Rate
− + kHz
FAU AEC children German variable natural 5823 12,393 18,216 9:20 21 30 normal 16
ABC adults German fixed acted 213 217 430 1:15 4 4 studio 16
Emo-DB adults German fixed acted 352 142 494 0:22 5 5 studio 16
Age (adults or children). Number of utterances per binary valence (#Valence, Negative ( − ), Positive ( + )), and overall number of
utterances (#All). Total audio time. Number of female (#f) and male (#m) subjects. Recording conditions (studio/normal). Sampling
Rate.
Source Set k(n ) α β λ • Cross Training (CT): In this method, the training and the test
ABC 100 1 2 2 samples come from different datasets and the feature distribu-
Emo-DB 100 1 1 1 tions of these samples are different. Specifically, the source do-
main database (ABC or Emo-DB) is used to train an SVM, and
Mont of target domain dataset is used to test. This method is
4.3. Parameter selection & experimental setup used to evaluate the performance without transferring in an
inter-corpus scenario.
In this paper, we choose one third of samples in the target • Matched Training (MT): In this method, the training samples
dataset as the validation set. The parameters, the number of hid- come from Ohm of target domain dataset. A number of samples
den units k(n) ∈ {20 0, 10 0, 50}, hyper-parameters α , β ∈ {0.1, 0.5, are randomly picked (repeated ten times) from Ohm to train
1, 2, 3}, and coefficient λ ∈ {0.1, 0.5, 1, 2, 3}, are tuned by per- the SVM, and the samples in Mont are used to test. In order to
mutation and combination using the validation set. The selected compare with CT, the number of picked training samples is the
values of parameters are listed in Table 4, and we will use these same with that of ABC or Emo-DB, respectively. This method
parameters to conduct our experiments in the following. Further- is used to evaluate the performance without transferring in an
more, we initialize W(i) as a matrix with ones on the main diag- intra-corpus scenario.
onal and zeros elsewhere, and set the bias b(i) as 0 for each layer • SHLA (Deng et al., 2014a): The transfer learning method Shared-
of the network. The dimension of h(e) (x) is the same with the di- Hidden-Layer Autoencoder (SHLA) in paper (Deng et al., 2014a)
mension of h(o) (x) which is smaller than the input dimension. In is used to learn features by source and target training datasets,
addition, we train the models in a batch way. Specifically, in or- then the source domain set is used to train the SVM.
der to make each batch during training include samples in source • AUDA (Deng et al., 2014b): The Autoencoder-based Unsuper-
and target domains, we split the D s and D t into the same num- vised Domain Adaptation (AUDA) method in paper (Deng et al.,
ber of mini-batches, then a mini-batch is selected randomly from 2014b) is used to learn features by source and target training
D s and D t respectively to make up of one batch to train the mod- datasets, then the source domain set is used to train the SVM.
els. It is worth noting that the emotion labels of the target do- • SAFTL (Deng et al., 2013): The Sparse Autoencoder-based Fea-
main samples are not used during training. Our models are trained ture Transfer Learning (SAFTL) method in paper (Deng et al.,
by 50 epochs and each epoch includes 25 iterations of backprop- 2013) is used to learn features by target training dataset, then
agation, i.e., 25 batches. That is to say, we use 1250 iterations to the reconstructed source domain set is used to train the SVM.
train each model. In this paper, linear SVM is used to classify the
emotion. When training SVMs, we balance the training samples of 4.4. Visualization
positive and negative classes by Synthetic Minority Over-sampling
TEchnique (SMOTE) (Chawla et al., 2002). Grid search is performed In order to provide intuitive understanding of the learned fea-
in the set of {2−1 , 20 , 21 , 22 , 23 , 24 , 25 } for C of linear-SVM, and the tures by our method, we visualize the distributions of the features
value of parameter (C = 20 = 1) that gives the best results on the in input layer and hidden layer (without orthogonal term or with-
validation set is chosen as the one used for cross-validation. out domain prediction term) on source set (ABC) and target set,
Q. Mao et al. / Speech Communication 93 (2017) 1–10 7
Table 5
UAR comparison when Lorth is computed on D s and X , with
the target testing set being Mont and the source domain
dataset being ABC and Emo-DB respectively. The UAR of
EDFLM2 is reported in %, and the highest one is highlighted
in bold.
which are shown in Fig. 2(a), (b) and (c) respectively. In Fig. 2 red
denotes the features in source set ABC and blue denotes the fea-
tures of target set. Here, the features are projected by Principal
Component Analysis (PCA), and we choose the first two dimen-
sions having the biggest contributions to visualize.
Fig. 2(a) clearly shows that the distributions between differ-
ent domains in input layer are different. As it can be seen in
Fig. 2(b) and (c), the discrepancy between two domains is allevi-
ated with the learned hidden features by the emotion supervision.
Furthermore, in Fig. 2(d), with the emotion-discriminative features
inspired by orthogonal term, the discrepancy nearly vanishes.
Table 6
The differences among EDFLM and noor_EDFLM methods corresponding to different network depth and with/without
domain prediction term, and their UAR results with the source set being ABC and Emo-DB respectively. The UAR is
reported in %, and the highest one is highlighted in bold.
Layers Method Emotion label Domain lagbel Orthogonal term Source set
Table 7 achieves 64.36%. Our method achieves the highest average UAR
Performance of the emotion predictor and the domain predic-
(65.62%) on ABC database.
tor on the target testing set Mont, with the source set being
ABC. The UAR is reported in %. Comparing with ABC, Emo-DB’s average UAR obtained by the
MT method increases to 61.20% because of the larger size of Emo-
Target Database Emotion predictor Domain predictor
DB leading to more samples selected from the Ohm dataset. The
Mont 60.29 88.21 CT method only obtains average UAR around chance level (51.01%).
The SHLA, AUDA and SAFTL methods get 56.52%, 56.90% and
55.58%, and noor_EDFLM2 achieves 60.60%. EDFLM2 also reaches
the highest average UAR (61.63%) on Emo-DB.
Table 8 also clearly shows that, on both datasets ABC and Emo-
for the source set ABC, the deep one EDFLM2 achieves 65.62%, DB, noor_EDFLM2 and EDFLM2 get better average accuracy than
while the shallow one EDFLM1 only achieves 64.04%. Meanwhile, MT and CT, since noor_EDFLM2 and EDFLM2 are unsupervised
in comparison with the shallow one noor_EDFLM1 (61.47%), a large transfer learning methods which can make full use of the infor-
improvement has been achieved by the deep one noor_EDFLM2 mation of unlabeled samples to train the model while MT and CT
(64.36%). The reason may be that the deep one learns two hierar- only use the information of labeled samples which are very limited
chical non-linear transformations of the input, while the shallow in ABC and Emo-DB. Moreover, our method proposed in this paper
one learns only one non-linear transformation. For Emo-DB, we outperforms other three well-established transfer learning meth-
have similar conclusions. EDFLM2 achieves 61.63%, while EDFLM1 ods (SHLA, AUDA and SAFTL).
only achieves 59.91%. In order to evaluate the statistical significance of the improve-
Furthermore, we can also see from Table 6 that the one with ments of our method compared with other methods AUDA, SAFTL,
domain prediction term performs better than that without domain SHLA, MT and CT, we have conducted the one-sided t-test experi-
prediction term. Specifically, for ABC, EDFLM2 reaches 65.62% while ment by using our method EDFLM2 and other five methods on two
EDFLM2 (λ = 0 ) obtains 65.23%, and noor_EDFLM2 achieves 64.36% source datasets ABC and Emo-DB. The results of t-test are shown
while noor_EDFLM2 (λ = 0 ) gets 63.86%. This is mainly because in Table 9. From Table 9, we can see that the values of significance
the domain information prompts the knowledge transfer between level of our method EDFLM2 compared with AUDA, SAFTL, SHLA
different domains. This makes domain-invariant features obtained. and CT are less than 0.001 on two source datasets ABC and Emo-
Performance of the emotion predictor and the domain pre- DB. It indicates that the improvements of our method EDFLM2
dictor in Fig. 1: This section gives UAR of the emotion predictor are significant compared with AUDA, SAFTL, SHLA and CT. Fur-
and the domain predictor on the target testing dataset (Mont) with thermore, the significance level of EDFLM2 compared with MT ob-
the source set ABC. The detailed information is listed in Table 7. tained on ABC is still less than 0.001, which also shows a high level
It clearly shows that the domain predictor gets a relatively higher of statistical significance. However, the significance level obtained
UAR than the emotion predictor since all the training samples of on Emo-DB is greater than 0.250. It is mainly because the sam-
the domain predictor have labels. ple number in Emo-DB is greater than that in ABC which leads to
Comparison with other methods with and without transfer: more samples selected by MT from the Ohm dataset of the target
To further evaluate the performance of our method, we compare it domain.
with the methods without transfer (CT and MT) and other transfer Although paper (Schuller et al., 2011) gets UAR close to 71%
methods: SHLA (Deng et al., 2014a), AUDA (Deng et al., 2014b) and which is higher than our method (61.47% (N = 1)) by using ma-
SAFTL (Deng et al., 2013), using the basic 384 acoustic features as jority voting in the two class problem, they are not comparable
the input. since the experiments of these two methods are conducted under
Table 8 shows the average UAR comparisons for several methods different experiment settings. More specifically, the method in pa-
(MT, CT, SHLA (Deng et al., 2014a), AUDA (Deng et al., 2014b), SAFTL per (Schuller et al., 2011) used 9959 labeled samples to train the
(Deng et al., 2013), noor_EDFLM2 , EDFLM2 ), with the source set be- model. However, EDFLM2 proposed in our paper does not need any
ing ABC and Emo-DB respectively. The training and testing sam- labeled samples of target domain set by using the unsupervised
ples used by different methods and whether each method trans- way.
fers knowledge between source and target domains are also shown
clearly. Overall, our EDFLM2 achieves the best performance com- 5. Conclusion
pared with other methods.
More specifically, for ABC, the MT method achieves 60.57%, In this paper, we propose a domain adaptation based method
while the CT method only achieves 56.03%. Meanwhile, the SHLA, for speech emotion recognition, in which both the domain diver-
AUDA and SAFTL methods reach 63.45%, 63.92% and 56.67% respec- gence and emotion discrimination are considered. A hierarchical
tively. EDFLM2 can boost the UAR to 65.62%, while noor_EDFLM2 non-linear transformation of the input is learned through a back
Q. Mao et al. / Speech Communication 93 (2017) 1–10 9
Table 8
UAR comparison for different methods, with the source set being ABC and Emo-DB respectively. The UAR is reported in %, and the highest
one is highlighted in bold. The training and testing samples used by various methods and whether each method transfers knowledge between
source and target domains are also shown in the table.
Table 9 Chopra, S., Balakrishnan, S., Gopalan, R., 2013. DLID: deep learning for domain adap-
The t-test results of EDFLM2 compared with AUDA, SAFTL, SHLA, tation by interpolating between domains. In: Proceedings of the 2013 ICML
MT and CT on two source datasets ABC and Emo-DB. The signifi- Workshop on Challenges in Representation Learning, pp. 1–8.
cance level α is reported. Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of single-layer networks in unsuper-
vised feature learning. In: Proceedings of the 2011 International Conference on
AUDA SAFTL SHLA MT CT Artificial Intelligence and Statistics (AISTATS), pp. 215–223.
Dai, W., Yang, Q., Xue, G., Yu, Y., 2007. Boosting for transfer learning. In: Proceedings
ABC <0.001 <0.001 <0.001 <0.001 <0.001
of the 2007 International Conference on Machine learning (ICML), pp. 193–200.
Emo-DB <0.001 <0.001 <0.001 >0.250 <0.001
Daumé III, H., Marcu, D., 2006. Domain adaptation for statistical classifiers. J. Artif.
Intell. Res. 26, 101–126.
Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B., 2014a. Introducing shared-hid-
den-layer autoencoders for transfer learning and their application in acoustic
propagation network model. To achieve emotion-discriminative emotion recognition. In: Proceedings of the 2014 IEEE International Conference
and domain-invariant features, on top of the feature extrac- on Acoustics, Speech and Signal Processing (ICASSP), pp. 4818–4822.
Deng, J., Zhang, Z., Eyben, F., Schuller, B., 2014b. Autoencoder-based unsupervised
tion, two predictors are imposed on the features: emotion la- domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21
bel predictor and domain label predictor. To further disentan- (9), 1068–1072.
gle the emotion-related factors from the emotion-unrelated fac- Deng, J., Zhang, Z., Marchi, E., Schuller, B., 2013. Sparse autoencoder-based feature
transfer learning for speech emotion recognition. In: Proceedings of the 2013 In-
tors, we introduce an orthogonal term to encourage the input ternational Conference on Affective Computing and Intelligent Interaction (ACII),
to be disentangled into two blocks: emotion-discriminative and pp. 511–516.
emotion-unrelated features. The method is evaluated on the IN- Ding, N., Sethu, V., Epps, J., Ambikairajah, E., 2012. Speaker variability in emotion
recognition-an adaptation based approach. In: Proceedings of the 2012 IEEE
TERSPEECH 2009 Emotion Challenge two-class task. Experimental
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
results with two public available corpora show that our approach pp. 5101–5104.
can further enhance the performance compared with conventional Eleftheriadis, S., Rudovic, O., Pantic, M., 2015. Discriminative shared Gaussian pro-
cesses for multiview and view-invariant facial expression recognition. IEEE
approaches.
Trans. Image Process. 24 (1), 189–204.
Eyben, F., Wollmer, M., Schuller, B., 2009. OpenEAR-introducing the Munich open–
source emotion and affect recognition toolkit. In: Proceedings of the 2009 In-
Acknowledgment ternational Conference on Affective Computing and Intelligent Interaction and
Workshops (ACIIW), pp. 1–6.
Fayek, H.M., Lech, M., Cavedon, L., 2016. Modeling subjectiveness in emotion recog-
This work is supported by the National Natural Science Founda-
nition with deep neural networks: ensembles vs. soft labels. In: Proceed-
tion of China (No. 61272211, No. 61672267 and No. 61502208), the ings of the 2016 International Joint Conference on Neural Networks (IJCNN),
Six Talent Peaks Foundation of Jiangsu Province (No.DZXX-027), the pp. 566–570.
Ganin, Y., Lempitsky, V., 2015. Unsupervised domain adaptation by backpropaga-
Open Project Program of the National Laboratory of Pattern Recog-
tion. In: Proceedings of the 2015 International Conference on Machine Learning
nition (NLPR, No. 20170 0 022) and the general Financial Grant from (ICML), pp. 1180–1189.
the China Postdoctoral Science Foundation (No. 2015M570413). Glorot, X., Bordes, A., Bengio, Y., 2011. Domain adaptation for large-scale sentiment
classification: a deep learning approach. In: Proceedings of the 2011 Interna-
tional Conference on Machine Learning (ICML), pp. 513–520.
References Han, K., Yu, D., Tashev, I., 2014. Speech emotion recognition using deep neural
network and extreme learning machine. In: Proceedings of the 2014 Confer-
ence of the International Speech Communication Association (INTERSPEECH),
Abdelwahab, M., Busso, C., 2015. Supervised domain adaptation for emotion recog-
pp. 223–227.
nition from speech. In: Proceedings of the 2015 IEEE International Conference
Hassan, A., Damper, R., Niranjan, M., 2013. On acoustic emotion recognition: com-
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5058–5063.
pensating for covariate shift. IEEE Trans. Audio Speech Lang. Process. 21 (7),
Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends ® Mach. Learn.
1458–1468.
2 (1), 1–127.
Kan, M., Wu, J., Shan, S., Chen, X., 2014. Domain adaptation for face recognition:
Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer
targetize source domain bridged by common subspace. Int. J. Comput. Vis. 109
learning. Unsupervised Transf. Learn. Chall. Mach. Learn. 7, 19.
(1–2), 94–109.
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise training
Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hid-
of deep networks. Adv. Neural Inf. Process. Syst. 19, 153.
den Markov models with deep belief networks. In: Proceedings of the 2013
Blei, D., McAuliffe, J., 2008. Supervised topic models. Adv. Neural Inf. Process. Syst.
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),
20, 121–128.
pp. 216–221.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B., 2005. A database of
Liu, B., Huang, M., Sun, J., Zhu, X., 2015. Incorporating domain and sentiment su-
German emotional speech. In: Proceedings of the 2005 Conference of the Inter-
pervision in representation learning for domain adaptation. In: Proceedings
national Speech Communication Association (INTERSPEECH), 5, pp. 1517–1520.
of the 2015 International Joint Conference on Artificial Intelligence (IJCAI),
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic mi-
pp. 1277–1283.
nority over-sampling technique. J. Artif. Intell. Res. 16, 321–357.
10 Q. Mao et al. / Speech Communication 93 (2017) 1–10
Mao, Q., Dong, M., Huang, Z., Zhan, Y., 2014. Learning salient features for speech Schuller, B., Batliner, A., Steidl, S., Seppi, D., 2011. Recognizing realistic emotions and
emotion recognition using convolutional neural networks. IEEE Trans. Multime- affect in speech: state of the art and lessons learnt from the first challenge.
dia 16 (8), 2203–2213. Speech Commun. 53 (9–10), 1062–1087.
Pan, S., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. Schuller, B., Steidl, S., Batliner, A., 20 09. The INTERSPEECH 20 09 emotion challenge.
22 (10), 1345–1359. In: Proceedings of the 2009 Conference of the International Speech Communi-
Scherer, K.R., 2005. What are emotions? And how can they be measured? Soc. Sci. cation Association (INTERSPEECH), 2009, pp. 312–315.
Inf. 44 (4), 695–729. Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A.,
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. Rigoll, G., 2010. Cross-corpus acoustic emotion recognition: variances and
61, 85–117. strategies. IEEE Trans. Affect. Comput. 1 (2), 119–131.
Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep Tian, L., Moore, J.D., Lai, C., 2015. Emotion recognition in spontaneous and acted di-
belief networks. In: Proceedings of the 2011 IEEE Workshop on Applications of alogues. In: Proceedings of the 2015 International Conference on Affective Com-
Signal Processing to Audio and Acoustics (WASPAA), pp. 65–68. puting and Intelligent Interaction (ACII), pp. 698–704.
Schuller, B., Arsic, D., Rigoll, G., Wimmer, M., Radig, B., 2007. Audiovisual behav- Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q., 2015. Supervised representation learn-
ior modeling by combined feature spaces. In: Proceedings of the 2007 IEEE In- ing: transfer learning with deep autoencoders. In: Proceedings of the 2015 In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, ternational Joint Conference on Artificial Intelligence (IJCAI), pp. 4119–4125.
pp. II–733.