Вы находитесь на странице: 1из 10

Speech Communication 93 (2017) 1–10

Contents lists available at ScienceDirect

Speech Communication
journal homepage: www.elsevier.com/locate/specom

Learning emotion-discriminative and domain-invariant features for


domain adaptation in speech emotion recognition
Qirong Mao∗, Guopeng Xu, Wentao Xue, Jianping Gou, Yongzhao Zhan
Department of Computer Science and Communication Engineering, Jiangsu University, China

a r t i c l e i n f o a b s t r a c t

Article history: Conventional approaches for Speech Emotion Recognition (SER) usually assume that the feature distri-
Received 26 July 2016 butions between training and test set are identical. However, this assumption does not hold in many
Revised 20 May 2017
real scenarios. Although many Domain Adaptation (DA) methods have been proposed to solve this prob-
Accepted 21 June 2017
lem, the conventional emotion discriminative information is ignored. In this paper, we propose a DA
Available online 12 July 2017
based method called Emotion-discriminative and Domain-invariant Feature Learning Method (EDFLM) for
Keywords: SER, in which both the domain divergence and emotion discrimination are considered to learn emotion-
Domain adaptation discriminative and domain-invariant features by using emotion label constraint and domain label con-
Speech emotion recognition straint. Furthermore, to disentangle the emotion-related factors from the emotion-unrelated factors, we
Neural network introduce an orthogonal term to encourage the input to be disentangled into two blocks: emotion-related
and emotion-unrelated features. Our method can learn emotion-discriminative and domain-invariant fea-
tures through a back propagation network which uses the acoustic features of INTERSPEECH 2009 Emo-
tion Challenge as the input rather than raw speech signals. Experiments on the INTERSPEECH 2009 Emo-
tion Challenge two-class task show that the performance of our method is superior to other state-of-the-
arts methods.
© 2017 Published by Elsevier B.V.

1. Introduction and unsupervised DA. Unsupervised DA is more challenging and


is more in line with the practical situations. So in this paper, we
The problem of automatically predicting the emotional states in mainly deal with unsupervised DA for Speech Emotion Recognition
speech emotion recognition has been the subject of increasing at- (SER).
tention among the speech community. Many conventional state-of- Recently, deep leaning has achieved state-of-the-art perfor-
the-art speech emotion recognition methods usually assume that mance on many machine learning tasks (Bengio, 2009). The suc-
the features of the training and test samples are drawn from the cess of deep learning mainly contributes to the ability of extracting
same distribution. This assumption does not hold in many real abstract hierarchical non-linear features of the input (Schmidhuber,
world applications. This is mainly because the speech signals from 2015; Bengio et al., 2007; Coates et al., 2011). Meanwhile, deep
different domains are highly dissimilar in terms of speakers, type learning has shown to suit well to DA (Bengio, 2012).
of emotion, recording situation and degree of spontaneity. A clas- Although previous deep-learning based DA methods aim to
sifier just trained on a specific corpus and then applied directly to learn a more powerful feature representation to reduce the dis-
another corpus cannot be expected to have excellent performance. crepancy between the source and target domains, most of them
Domain adaptation (DA), proposed by Daumé III and Marcu do not take the label information into account in training time. For
(2006), has proven to be efficient for this problem. DA is one spe- SER, we are eager to learn an emotion-discriminative and domain-
cial type of transfer learning problem. The feature distributions of invariant feature representation. So, to meet the above two re-
samples in source and target domains are different, but the tasks of quirements at the same time, we should find a tradeoff between
the source and the target remain the same (Daumé III and Marcu, them. Thus seeking for a saddle point of the loss function is an ur-
2006; Pan and Yang, 2010). Based on whether the target domain gent demand. Ganin and Lempitsky (2015) propose a feasible ap-
data is partially unlabeled or completely unlabeled, DA techniques proach to find this saddle point by introducing a gradient reversal
are commonly classified into two categories: semi-supervised DA layer (GRL). It is simple to implement such a layer in any feed-
forward models, and the parameters update can be done using the

standard stochastic gradient descent (SGD).
Corresponding author.
E-mail address: mao_qr@ujs.edu.cn (Q. Mao).

http://dx.doi.org/10.1016/j.specom.2017.06.006
0167-6393/© 2017 Published by Elsevier B.V.
2 Q. Mao et al. / Speech Communication 93 (2017) 1–10

In SER, conventional methods aim to learn the emotion-specific tive Shared Gaussian Process Latent Variable Model (DS-GPLVM)
features that are robust to the nuisance factors, such as speaker proposed in Eleftheriadis et al. (2015) is one of the state-of-the-
variation and environment distortion (Ding et al., 2012; Mao et al., art multi-task learning methods for multi-pose FER, in which DS-
2014). So we really hope to get such a powerful feature represen- GPLVM first learns a discriminative manifold shared by multiple
tation in the DA method for SER. Just using the emotion label pre- views of a facial expression, and then facial expression classifica-
dictor cannot achieve an excellent emotion-discriminative feature tion is performed in the expression manifold, either in the view-
representation. invariant manner (using only a single view of the expression) or in
In this paper, we propose a new DA method called Emotion- the multi-view manner (using multiple views of the expression). In
discriminative and Domain-invariant Feature Learning Method (ED- addition, the supervised Latent Dirichlet Process model (Blei and
FLM) for SER. In this method, we first introduce an orthogonal McAuliffe, 2008) is also a multi-task learning approach.
term to disentangle the emotion-related factors and the emotion-
unrelated factors. Then our model learns a hierarchical non-linear 2.2. Deep learning for SER
transformation of the emotion-related features through a Back
Propagation (BP) network. On top of the feature extraction, two Recently, deep learning based methods seem to have the abil-
predictors are imposed on the high level emotion feature represen- ity to learn hierarchical non-linear transformation of the input, and
tation: emotion label predictor and domain label predictor. Thus they have been successfully applied in the field of affective com-
the emotion-discrimination and domain-invariance of the features puting. SER also benefits a lot from deep learning based mod-
are encouraged. In summary, our main contributions in this work els. Schmidt and Kim (2011) sought to employ regression-based
are as follows: Deep Belief Networks (DBNs) to learn emotion-based features di-
rectly from magnitude spectra for music emotion recognition. In
(1) By introducing an orthogonal term into the objective func-
Le and Provost (2013), several hybrid classifiers for affective com-
tion of BP network at the beginning, we can extract affect-
puting were evaluated, in which DBNs were used to estimate the
salient features for SER by disentangling emotions from
emotion probabilities and Hidden Markov Models (HMMs) cap-
other factors such as speaker variance and noise. Specifically,
tured the temporal property of the emotion. Han et al. (2014) uti-
the learned features in the first layer are divided into two
lized Deep Neural Networks (DNNs) to produce an emotion state
blocks, related to emotion and other remaining factors, re-
probability distribution for each speech segment and then con-
spectively. The emotion-related features are input into the
structed utterance-level features, and finally an Extreme Learning
second layer of BP network, which is discriminative and ro-
Machine (ELM) was used to recognize the utterance-level emo-
bust, leading to great performance improvement on SER.
tions. Tian et al. (2015) compared knowledge-inspired disfluency
(2) Emotion discrimination and domain invariance are
and non-verbal vocalization features in emotional speech with sta-
combined into the deep architecture to learn emotion-
tistical features describing acoustic characteristics of the samples
discriminative and domain-invariant features for SER. This
by using Long Short-Term Memory Recurrent Neural Networks
is achieved by jointly optimizing the underlying features as
(LSTM-RNNs) which outperformed SVM given enough samples. In
well as two classifiers operating on these features: (i) the
Fayek et al. (2016), a DNN was employed to learn a mapping from
emotion label predictor that predicts the emotion label and
Fourier-transform based filter banks to emotion classes by using
(ii) the domain label predictor that classifies the source and
soft labels produced from multiple annotators to model the sub-
the target domains.
jectiveness in emotion recognition, which improved performance
The rest of this paper is organized as follows. We introduce the compared with using ground truth labels obtained by majority vot-
related work in Section 2. Section 3 presents our method in details. ing between the same annotators.
Section 4 describes SER benchmark databases, original acoustic fea-
tures, and reports our experimental results. Conclusions are dis- 2.3. Domain adaption
cussed in Section 5.
Although deep learning methods have a powerful feature learn-
2. Related work ing ability, most of them are sensitive to small perturbations in the
input features. So it is necessary to expand deep learning methods
In SER, how to extract the most discriminative features has to cross domain scenarios. DA aims to reduce the distribution dis-
been an important research area. Meanwhile, due to the power- crepancy between different domains. With the increasing attention
ful ability of feature learning in deep learning, more and more of DA, a large number of DA methods have been proposed over
efforts are dedicated into speech emotion feature learning using the past decades, which can be broadly classified into two groups:
deep neural networks. But most of these works are based on the instance-based DA and feature-based DA.
same assumption: the training and test set have the same feature For the first one, different weights to the old source domain
distribution. This assumption does not hold in many practical sce- samples are calculated, so that the weighted source domain sam-
narios. Recently, many DA methods are proposed to deal with this ples can be approximately deemed as holding the same distribu-
problem. In the following, we first review some multi-task learn- tion to the target domain, or the most related source domain sam-
ing and deep learning methods for SER, and then discuss the DA ples would be selected according to the weights. For instance, Dai
methods. Finally, we introduce some DA methods for cross-domain et al. (2007) utilized a small part of the target domain labeled sam-
SER. ples to update the weights of the source domain samples, so that
the most important part of the source domain was known. Hassan
2.1. Multi-task learning et al. (2013) applied three methods to determine the weights of the
source domain samples, and then incorporated these weights into
There are very few multi-task learning algorithms for SER, but the SVM classifiers. Kan et al. (2014) learned the targetization coef-
many multi-task learning algorithms are proposed for multi-pose ficients in a common subspace by using the sparse target domain
Face Expression Recognition (FER). Multi-pose FER learns the fa- neighbors, and the coefficients targetized the source domain sam-
cial expression discriminative information by sharing the pool of ples into the target domain.
features or manifold under various poses instead of training mod- For the second one, the original input features are usually
els and tuning parameters for each pose separately. Discrimina- transformed into a new space, and in this space the new fea-
Q. Mao et al. / Speech Communication 93 (2017) 1–10 3

ture representation of different domain samples can be consid- audio files and yi(s ) is the corresponding emotion label. D t =
ered to have the same distribution. In other words, the learned
{xi(t ) , yi(t ) }|ni=1
t
denotes the target domain dataset and xi(t ) is also
features have the property of domain-invariance. Many approaches
have been proposed. Glorot et al. (2011) proposed to combine the a feature vector of the ith utterance and yi(t ) is the corresponding
samples from different domains as the input to Stack Denoising emotion label. The number of samples in the source and the tar-
Auto-encoders (SDA), so as to learn a robust feature representation get domain datasets is denoted as ns and nt , respectively. Here, the
across domains. Chopra et al. (2013) considered the feature distri- source and target domains share the same feature and label space,
bution shift between different domains, and constructed interme- i.e., each feature vector x ∈ Rk and the corresponding emotion label
diate databases by combining parts of the source and target do- y ∈ {1, 2, . . . , r} (r is the number of emotion labels), but the feature
mains samples, so that all the databases seemed to form a path vectors of these two domains are from two different distributions.
from the source to target. Each database was used to learn a fea- Let X = {x|x ∈ {D s ∪ D t }} be all the samples, i.e., feature vectors
ture extraction using deep networks, and finally all the path fea- and their emotion labels from the source and target domains. It is
ture representations were combined as a new feature representa- worth noting that the emotion labels of the target domain are not
ns +nt
tion. used in the training time. Let D = {di }|i=1 represent the set of
Above-mentioned methods did not seem to encode the domain domain labels, where di = 1 if xi , yi ∈ D s , and di = 0 if xi , yi ∈ D t .
label information into the feature learning. Meanwhile, divergence Let Y = {y|y ∈ {yi(s ) |ni=1 s
}} be the set of the emotion labels of the
across different domains did not explicitly reduce. To approach source samples. At training time, we have access to X , D , and Y .
this problem, several approaches have been proposed. Zhuang et al. Our aim is to predict the emotion labels of samples in target do-
(2015) introduced the label information in feature learning, and main.
used Kullback–Leibler (KL) divergence to minimize the difference
3.2. Emotion-discriminative and domain-invariant feature learning
of domain distributions. Liu et al. (2015) considered domain super-
vision and emotion supervision, and modified the SDAs incorporat-
In this section, we present our transfer feature learning method
ing these two terms. Ganin and Lempitsky (2015) also introduced
using a BP network in SER. The architecture of our method is
label and domain information to learn discriminative and domain-
shown in Fig. 1, which has three parts: feature extractor, emotion
invariant features. A gradient reversal layer had been used to up-
predictor and domain predictor. Specifically, we use the 384 at-
date the parameters with gradient descent for feed-forward mod-
tributes extracted by the open source toolkit openEAR (Eyben et al.,
els.
2009) of the speech signal as the input of our model. And then,
the input attributes will be divided into two parts: emotion-related
2.4. Domain adaption for cross-corpus SER features and emotion-unrelated features using the disentangling
penalty during the feature extraction. The emotion-related features
For SER, many DA-based methods are also proposed. Deng are further input into the higher layer to learn high-level feature
et al. (2013) proposed to learn a sparse autoencoder-based feature representation and predict emotions and domains. The prediction
mapping rule with the access to a small part of the target do- error and the reconstruction error will be used to adjust the pa-
main labeled samples, and then the source domain samples re- rameters of the entire network with back-propagation. The iter-
constructed by this mapping were used to train a classifier. Deng ations finish until convergence. In our method, gradient reversal
et al. (2014a) have used both the source and target domain samples layer between the feature extractor and domain label predictor is
as input, modified the basic autoencoder ,and proposed a Shared- introduced to search the saddle point between emotion and do-
Hidden-Layer Autoencoder (SHLA) approach to learn common fea- main. This gradient reversal layer multiplies the gradient by a cer-
ture representations shared across different domains. Also, paper tain negative constant during the backpropagation based training.
Deng et al. (2014b) introduced an adaptive denoising autoencoder Otherwise, the training proceeds in a standard way and minimizes
method, in which prior knowledge learned from the target domain the emotion prediction loss (for source examples) and the domain
was used to regularize the training on the source domain. In ad- classification loss (for all samples). Gradient reversal ensures that
dition, Abdelwahab and Busso (2015) investigated several impor- the feature distributions over the two domains are similar, thus re-
tant factors in semi-supervised DA: the number of target domain sulting in the domain-invariant features, the detail of which is de-
labeled samples, speaker diversity in the labeled set, the perfor- scribed in the next section.
mance between the spontaneous and naturalistic samples, and the Specifically, in the part of feature extraction, it is assumed that
best approach to adapt the models. there are N + 1 layers and the features extracted from the feature
Current works about DA for SER just consider the divergence extractor form the feature vector f shown in Fig. 1. And k(n) denotes
of feature distribution, and ignore the emotion-discriminative in- the number of the nodes in the nth layer, and h(n) (x) represents
formation in SER. In this paper, we propose a new DA method the feature representation in the nth layer, n ∈ [0, 1, . . . , N]. It is
for SER, in which domain discrepancy is minimized and emotion- assumed that h(0 ) (x ) = x, then the output of x in the nth layer is
discrimination is encouraged to learn emotion-discriminative and denoted as:
domain-invariant features for SER. Furthermore, we introduce an  T 
orthogonal term into the loss function to disentangle the emotion- h(n ) (x ) = ϕ (W (n ) ) h(n−1) (x ) + b(n ) , (1)
related factors from all of the features. Therefore, our model can where W (n ) ∈ Rk(n )×k(n−1 ) is a weight matrix, and b(n) ∈ Rk(n) is a
learn a high level emotion-discriminative and domain-invariant bias vector. The function ϕ is a nonlinear activation function such
feature representation of the emotion-related features through a as widely used sigmoid function. Parameters in feature extraction
BP network. are θ f = {W (i ) , b(i ) }N
i=1
.
To reduce the influence of the emotion-unrelated factors (such
3. Proposed methodology as speakers, contents and environment variations), here we intro-
duce an orthogonal term to disentangle the emotion-related factors
3.1. Problem formalization from the other factors during feature extraction. For input x, we
map it into two distinct blocks of features: one that encodes the

Assume D s = {xi(s ) , yi(s ) }|ni=1
T
s
denotes the source domain dataset, discriminative factors of input, h(e ) (x ) = h(1 ) (x ) = ϕ (W (1 ) ) x +

in which xi(s ) is the ith utterance-level feature vector of the b(1 ) , and one that encodes all other factors, h(o) (x ) = ϕ (wT x + c ).
4 Q. Mao et al. / Speech Communication 93 (2017) 1–10

Fig. 1. The framework of EDFLM. Three important parts are included: feature extractor, emotion label predictor (dashed box), and domain label predictor (solid line box).
A gradient reversal layer is inserted between feature extraction layer and domain predict layer and it multiplies the gradient by a certain negative constant during the
backpropagation based training. Otherwise, the training proceeds in a standard way and minimizes the emotion prediction loss (for source examples) and the domain
classification loss (for all samples). Gradient reversal ensures that the feature distributions over source and target domains are made similar, thus leading to domain-invariant
features. To further get the emotion-discriminative features, we map the original input into two blocks: emotion-related and emotion-unrelated features (dot-dash line box).

Parameters are θe = {W (1 ) , b(1 ) } and θo = {w, c}. Meanwhile, both main prediction Ld are denoted as:
feature blocks are trained to cooperate to reconstruct their com-
1 
mon input x. We consider the mean squared error as the loss func- L y ( θ f , θy ) = L y ( G y ( h ( x ; θ f ) ; θy ) ; y )
ns
tion and the reconstruction loss term Lrecon is then formalized as: x∈D s

1  1 
r T (N )
eθyi h (x )
Lrecon (θe , θo ) = Lrecon (x; g(h(x; θe , θo ))) =− 1{y = i} log r , (4)
ns + nt ns  θ T h (N ) ( x )
x∈X x∈D s i=1 e yj
1   j=1
= x − g(h(x; θe , θo ))2 , (2)
ns + nt
x∈X
1 
where Lrecon is the reconstruction error, i.e., the squared error be- L d ( θ f , θd ) = L d ( G d ( h ( x ; θ f ) ; θd ) ; d )
ns + nt
tween the input x and its reconstructed output g(h(x; θ e , θ o )). The x∈X

function g is the decoder function to reconstruct x based on the 1  1 T (N )


eθdi h (x )
feature vector h(x; θ e , θ o ), which is a nonlinear activation func- =− 1{d = i} log , (5)
ns + nt 
1
eθdj h (x )
T (N )
tion such as widely used sigmoid function. θ e and θ o are the pa- x∈X i=0

rameters corresponding to emotion-related features and emotion- j=0

unrelated features (θ e is part of θ f ). (N )


where θy = {θy1 , . . . , θyc }, θyi ∈ R1×k are the parameters of emo-
For disentangling features, by asking each sensitivity vector (N )
∂ hi(e ) (x ) tion label predictor, and θd = {θd0 , θd1 }, θdi ∈ R1×k are the param-
∂x of the ith emotion-discriminative feature hi(e ) (x ) to prefer eters of domain label predictor. θ f is the parameter of network of
∂ h(jo) (x ) feature extractor. Gy and Gd map the feature into emotion label
being orthogonal to every sensitivity vector ∂x of the jth non-
(o) and domain label respectively, and Ly and Ld measure the loss of
discriminative feature h j (x ), these two blocks h (x) and h (x)
(e) (o)
emotion label prediction and domain label prediction for a single
can be encouraged to represent different directions of variation in training sample.
the input x. In our study, that is to say, the emotion-related factors Putting all the components of the loss function together, we get
can be efficiently disentangled from all other factors. The loss of the loss function of the whole network as follows:
the orthogonal term is then formalized as: 
1
L ( θ f , θy , θd , θo ) = Lrecon (x; g(h(x; θe , θo )))
1 
 ∂ h(e) (x ) T ∂ h(o) (x ) 2 ns + nt
x∈X
Lorth (θe , θo ) = i j
. (3)
ns ∂ x ∂ x 1 
x∈D i, j
s +α L y ( G y ( h ( x ; θ f ) ; θy ) ; y )
ns
x∈D s
After we get the high level feature representation h(N) (x) of the
1 
input x, we want to use it to predict the emotion label and do- −λ L d ( G d ( h ( x ; θ f ) ; θd ) ; d )
main label. During the training, only the source domain samples ns + nt
x∈X
in D s are used for emotion label prediction. All the source and tar-
+β Lorth (θe , θo )
get samples in X are used for domain label prediction. Softmax re-
gression is used to predict emotion label and domain label. More = Lrecon (θe , θo ) + α Ly (θ f , θy )
formally, the loss of the emotion prediction Ly and that of the do- −λLd (θ f , θd ) + β Lorth (θe , θo ), (6)
Q. Mao et al. / Speech Communication 93 (2017) 1–10 5

where α , λ and β weigh the contribution of the emotion predic- Algorithm 1 Emotion-discriminative and Domain-invariant Fea-
tion, the domain prediction and the orthogonal term respectively. tures Learning Model (EDFLM).
Here, −λ ensures that the feature distributions over the two do-
Input: Source domain samples D s , target domain samples D t .
mains are similar.
Output: Weights and biases in feature extraction θ f =
We estimate the parameters in the entire network together by
using the samples in X which includes all the samples in the
{W (i) , b(i) }Ni=1 .
source domain and the target domain. During the training, the 1: Do forward propagation for data points in a batch;
computations of the emotion prediction loss and the orthogonal Compute the loss of the reconstruction term Lrecon ;
loss are performed on D s which is the source domain dataset with Compute the loss of the orthogonal term Lorth ;
emotion labels, and for other samples in X the losses above are Compute the loss of emotion label predictor Ly ;
not computed. The domain prediction loss and the reconstruction Compute the loss of domain label predictor Ld ;
loss are computed on X . The backpropagation algorithm is used to 2: Compute the partial derivatives of all variables in Eq. (8);
compute the partial derivatives of all parameters, then all param- 3: Update the variables according to Eq. (8);
eters are updated jointly. That is to say, θ f , θ o , and θ d are trained 4: Iteratively perform Step1-3 until the algorithm converges;
using dataset X , while θ y is trained using D s . 5: Compute the high level features h(N ) (x ), and use them to train
the classifier as discussed in Section 3.4.
3.3. Optimization with backpropagation

In order to minimize the third term (the loss of the domain trained with the source domain labeled samples and is applied to
label) and optimize Eq. (6) with SGD, a saddle point of Eq. (6) is the target domain test samples for emotion label prediction.
thus introduced:

(θˆf , θˆy , θˆo ) = arg min L(θ f , θy , θo, θˆd ), 4. Experiments


θ f , θy , θo
(7)
(θˆd ) = arg maxL(θˆf , θˆy , θˆo, θd ). 4.1. Database
θd

Ganin and Lempitsky (2015) demonstrated that SGD can be We evaluate our method in the INTERSPEECH 2009 Emotion
adapted for the search of the saddle point. This can be accom- Challenge (EC) two-class task, in which the spontaneous FAU Aibo
plished by introducing a special gradient reversal layer between Emotion Corpus (FAU AEC) is used. FAU AEC contains 9 h of Ger-
the feature extractor and domain label predictor. During the for- man speech of 51 children interacting with Sony’s pet robot Aibo
ward propagation, GRL does nothing. But in the backward propa- at two different schools, Ohm and Mont. We treat FAU AEC as tar-
gation, the gradient is multiplied by a negative number −λ. get domain database, in which the 9959 samples of school Ohm
All the weights and biases can be updated using the gradient are used as target domain training samples, the 8257 samples of
descent algorithm as follows: Mont for target testing.
Additionally, we choose two publicly available databases, the
∂ Lrecon ∂ Ly ∂ Ld ∂ Lorth
W (1 ) ← W (1 ) − μ ( +α −λ +β ), Airplane Behavior Corpus (ABC) (Schuller et al., 2007) and the
∂ W (1 ) ∂ W (1 ) ∂ W (1 ) ∂ W (1 ) database of German emotional speech (Emo-DB) (Burkhardt et al.,
∂ Lrecon ∂ Ly ∂L ∂L
b(1) ← b(1) − μ( + α (1) − λ (d1) + β orth ), 2005), as the source sets. They are highly different from the tar-
∂b ( 1 ) ∂b ∂b ∂ b(1) get set FAU AEC in terms of age, type of emotion and recording
∂ Ly ∂ Ld situation, and degree of spontaneity.
W (m ) ← W (m ) − μ ( α −λ ),
∂ W (m ) ∂ W (m ) Paper (Scherer, 2005) gives the emotion distribution in the
∂ Ly ∂ Ld Arousal-Valence dimensional emotion space (A-V space) accord-
b(m ) ← b(m ) − μ(α (m ) − λ (m ), ing to psychology. For the comparability with FAU AEC, the di-
∂b ∂b )
∂ Lrecon ∂L verse emotion classes in the databases FAU AEC, ABC and Emo-DB
w ← w − μ( + β orth ), are first mapped onto the valence axis of the A-V space (Scherer,
∂w ∂w
2005), and then make the emotions located at the negative side of
∂ Lrecon ∂ Lorth
c ← c − μ( +β ), the valence axis of the A-V space with the label of “negative” since
∂c ∂c they denote the unpleasant emotions. On the contrary, the emo-
∂L
θd ← θd − μ ( d ) , tions located at the positive side of the valence axis of the dimen-
∂ θd sional emotion space are labeled as “positive” since they denote
∂L the pleasant emotions. The detailed mapping strategy is shown in
θy ← θy − μ ( α y ) ,
∂ θy Table 1 (Schuller et al., 2010; Deng et al., 2014a). Besides, Table 2
(8) summarizes the three datasets and shows the difference among
them, including age, language, speech, emotion, etc.
where m = 2, . . . , N and μ is the learning rate. For α and β , they
weigh the contribution of the emotion predictor loss and the or-
thogonal term, respectively, and λ is the parameter of gradient re- 4.2. Acoustic features
versal layer, which also weighs the contribution of domain predic-
tor loss. The details of the proposed algorithm are summarized in For the raw input feature representation, we keep in line with
Algorithm 1, and the framework of the model is shown in Fig. 1. the INTERSPEECH 2009 EC (Schuller et al., 2009) and use a base-
line feature set. It consists of 12 functionals applied to 2 × 16
3.4. Classifier construction acoustic Low-Level Descriptors (LLDs) including their first order
delta regression coefficients as shown in Table 3. Therefore, the fea-
After all the parameters are optimized in training time, we get ture vector per chunk contains 16 × 2 × 12 = 384 attributes. To en-
the feature h(N) (x), i.e., feature f in Fig. 1 to train the classifier. Here, sure reproducibility, the open source toolkit openEAR (Eyben et al.,
we choose the widely used SVMs as the classifier. The classifier is 2009) is utilized to extract 384 attributes.
6 Q. Mao et al. / Speech Communication 93 (2017) 1–10

Table 1
Emotion categories mapping onto negative and positive valence for three databases.

Corpus Negative Positive

FAU AEC angry, touchy, emphatic, reprimanding motherese, neutral, joyful, rest
ABC aggressive, intoxicated, nervous, tired cheerful, neutral, rest
Emo-DB anger, boredom, disgust, fear, sadness joy, neutral

Table 2
Summary of the three chosen datasets.

Corpus Age Language Speech Emotion #Valence #All h:mm #m #f Rec Rate

− + kHz

FAU AEC children German variable natural 5823 12,393 18,216 9:20 21 30 normal 16
ABC adults German fixed acted 213 217 430 1:15 4 4 studio 16
Emo-DB adults German fixed acted 352 142 494 0:22 5 5 studio 16

Age (adults or children). Number of utterances per binary valence (#Valence, Negative ( − ), Positive ( + )), and overall number of
utterances (#All). Total audio time. Number of female (#f) and male (#m) subjects. Recording conditions (studio/normal). Sampling
Rate.

Table 3 Due to the unbalance of different classes, we choose the un-


Overview of the standardized feature set provided by the INTERSPEECH
weighted average recall (UAR), namely the mean accuracy over the
2009 EC.
accuracy of each class. Furthermore, to avoid ‘lucky’ or ‘unlucky’
LLDs (16 × 2) Functionals (12) selection, the reported performance in UAR is the average over 10
() ZCR mean runs.
() RMS Energy standard deviation In order to evaluate the performance of our method, we first vi-
() F0 kurtosis, skewness sualize the distribution of features in input layer and hidden layer
() HNR extremes: value, rel, position, range
(with and without orthogonal term) between source set and target
() MFCC 1–12 linear regression: offset, slope, mean square error
set. Then we analyze the effect of three factors for our method: the
depth of the model, with and without orthogonal term, and with
Table 4
and without domain information. Finally, we evaluate our method
The final selected parameters for EDFLM2 method with the
source set being ABC and Emo-DB respectively.
based on UAR and compare it with the following methods:

Source Set k(n ) α β λ • Cross Training (CT): In this method, the training and the test
ABC 100 1 2 2 samples come from different datasets and the feature distribu-
Emo-DB 100 1 1 1 tions of these samples are different. Specifically, the source do-
main database (ABC or Emo-DB) is used to train an SVM, and
Mont of target domain dataset is used to test. This method is
4.3. Parameter selection & experimental setup used to evaluate the performance without transferring in an
inter-corpus scenario.
In this paper, we choose one third of samples in the target • Matched Training (MT): In this method, the training samples
dataset as the validation set. The parameters, the number of hid- come from Ohm of target domain dataset. A number of samples
den units k(n) ∈ {20 0, 10 0, 50}, hyper-parameters α , β ∈ {0.1, 0.5, are randomly picked (repeated ten times) from Ohm to train
1, 2, 3}, and coefficient λ ∈ {0.1, 0.5, 1, 2, 3}, are tuned by per- the SVM, and the samples in Mont are used to test. In order to
mutation and combination using the validation set. The selected compare with CT, the number of picked training samples is the
values of parameters are listed in Table 4, and we will use these same with that of ABC or Emo-DB, respectively. This method
parameters to conduct our experiments in the following. Further- is used to evaluate the performance without transferring in an
more, we initialize W(i) as a matrix with ones on the main diag- intra-corpus scenario.
onal and zeros elsewhere, and set the bias b(i) as 0 for each layer • SHLA (Deng et al., 2014a): The transfer learning method Shared-
of the network. The dimension of h(e) (x) is the same with the di- Hidden-Layer Autoencoder (SHLA) in paper (Deng et al., 2014a)
mension of h(o) (x) which is smaller than the input dimension. In is used to learn features by source and target training datasets,
addition, we train the models in a batch way. Specifically, in or- then the source domain set is used to train the SVM.
der to make each batch during training include samples in source • AUDA (Deng et al., 2014b): The Autoencoder-based Unsuper-
and target domains, we split the D s and D t into the same num- vised Domain Adaptation (AUDA) method in paper (Deng et al.,
ber of mini-batches, then a mini-batch is selected randomly from 2014b) is used to learn features by source and target training
D s and D t respectively to make up of one batch to train the mod- datasets, then the source domain set is used to train the SVM.
els. It is worth noting that the emotion labels of the target do- • SAFTL (Deng et al., 2013): The Sparse Autoencoder-based Fea-
main samples are not used during training. Our models are trained ture Transfer Learning (SAFTL) method in paper (Deng et al.,
by 50 epochs and each epoch includes 25 iterations of backprop- 2013) is used to learn features by target training dataset, then
agation, i.e., 25 batches. That is to say, we use 1250 iterations to the reconstructed source domain set is used to train the SVM.
train each model. In this paper, linear SVM is used to classify the
emotion. When training SVMs, we balance the training samples of 4.4. Visualization
positive and negative classes by Synthetic Minority Over-sampling
TEchnique (SMOTE) (Chawla et al., 2002). Grid search is performed In order to provide intuitive understanding of the learned fea-
in the set of {2−1 , 20 , 21 , 22 , 23 , 24 , 25 } for C of linear-SVM, and the tures by our method, we visualize the distributions of the features
value of parameter (C = 20 = 1) that gives the best results on the in input layer and hidden layer (without orthogonal term or with-
validation set is chosen as the one used for cross-validation. out domain prediction term) on source set (ABC) and target set,
Q. Mao et al. / Speech Communication 93 (2017) 1–10 7

Table 5
UAR comparison when Lorth is computed on D s and X , with
the target testing set being Mont and the source domain
dataset being ABC and Emo-DB respectively. The UAR of
EDFLM2 is reported in %, and the highest one is highlighted
in bold.

Source set Lorth on D s Lorth on X

ABC 65.62 65.50


Emo-DB 61.63 61.27

which are shown in Fig. 2(a), (b) and (c) respectively. In Fig. 2 red
denotes the features in source set ABC and blue denotes the fea-
tures of target set. Here, the features are projected by Principal
Component Analysis (PCA), and we choose the first two dimen-
sions having the biggest contributions to visualize.
Fig. 2(a) clearly shows that the distributions between differ-
ent domains in input layer are different. As it can be seen in
Fig. 2(b) and (c), the discrepancy between two domains is allevi-
ated with the learned hidden features by the emotion supervision.
Furthermore, in Fig. 2(d), with the emotion-discriminative features
inspired by orthogonal term, the discrepancy nearly vanishes.

4.5. Performance evaluation

Performance comparison when the orthogonal penalty term


is computed on the source domain dataset and on the en-
tire dataset: To verify the reasonality of computing Lorth on the
source domain dataset D s , we compare the experimental results
of EDFLM2 (our EDFLM model with two feature representation lay-
ers) when Lorth is computed on D s and the entire dataset X , re-
spectively. The UAR results reported in % are listed in Table 5. From
the table we can see that Lorth computed on the entire dataset X
does not give better UAR results than that on the source domain
dataset D s no matter which dataset (ABC or Emo-DB) is selected
as D s . Furthermore, the training will spend much more time when
Lorth is computed on the entire dataset X . Therefore, the computa-
tion of Lorth is performed on the source domain dataset D s instead
of on X in our experiments.
The effect of depth, orthogonal term and domain informa-
tion: In order to analyze the effect of depth, orthogonal term
and domain information in our method, we analyze the results
with different depth of network, with/without orthogonal term
and with/without domain prediction term in Eq. (6). The method
without orthogonal term is denoted as noor_EDFLM. For the
depth of the model, we analyze the shallow model (N = 1) and
the deep model (N = 2). We denote the shallow one of EDFLM
and noor_EDFLM method as EDFLM1 and noor_EDFLM1 respec-
tively, the deep one as EDFLM2 and noor_EDFLM2 . Furthermore,
the methods which do not contain domain prediction term and
with and without orthogonal term are denoted as EDFLM (λ = 0 )
and noor_EDFLM (λ = 0 ), respectively. The differences among the
above methods and the average UAR results of all these methods
with respect to the source set being ABC and Emo-DB are shown
in Table 6.
The results in Table 6 clearly show that the methods with the
orthogonal term outperform the corresponding method having the
same condition without this term. Specifically, on ABC, EDFLM2
achieves 65.62% while noor_EDFLM2 only obtains 64.36%. This is
mainly because the orthogonal term disentangles the emotion-
Fig. 2. The distributions of the first two dimensions from PCA projections of input
discriminative factors from all features. The emotion-discriminative layer, hidden layer (without orthogonal term), hidden layer (without domain pre-
features are more important than the domain-related features. And diction term) and hidden layer (EDFLM) between source set ABC (red) and target
on the source set Emo-DB, we get the similar conclusion (EDFLM2 set (blue). (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)
61.63% versus noor_EDFLM2 60.60%).
Our experiments also show that the deep one performs better
than the shallow one under the same conditions. More specifically,
8 Q. Mao et al. / Speech Communication 93 (2017) 1–10

Table 6
The differences among EDFLM and noor_EDFLM methods corresponding to different network depth and with/without
domain prediction term, and their UAR results with the source set being ABC and Emo-DB respectively. The UAR is
reported in %, and the highest one is highlighted in bold.

Layers Method Emotion label Domain lagbel Orthogonal term Source set

ABC (%) Emo-DB (%)

N=1 noor_EDFLM1 yes yes no 61.47 59.39


noor_EDFLM1 (λ = 0 ) yes no no 61.40 59.31
EDFLM1 (λ = 0 ) yes no yes 63.87 59.45
EDFLM1 yes yes yes 64.04 59.91
N=2 noor_EDFLM2 yes yes no 64.36 60.60
noor_EDFLM2 (λ = 0 ) yes no no 63.86 60.45
EDFLM2 (λ = 0 ) yes no yes 65.23 61.32
EDFLM2 yes yes yes 65.62 61.63

Table 7 achieves 64.36%. Our method achieves the highest average UAR
Performance of the emotion predictor and the domain predic-
(65.62%) on ABC database.
tor on the target testing set Mont, with the source set being
ABC. The UAR is reported in %. Comparing with ABC, Emo-DB’s average UAR obtained by the
MT method increases to 61.20% because of the larger size of Emo-
Target Database Emotion predictor Domain predictor
DB leading to more samples selected from the Ohm dataset. The
Mont 60.29 88.21 CT method only obtains average UAR around chance level (51.01%).
The SHLA, AUDA and SAFTL methods get 56.52%, 56.90% and
55.58%, and noor_EDFLM2 achieves 60.60%. EDFLM2 also reaches
the highest average UAR (61.63%) on Emo-DB.
Table 8 also clearly shows that, on both datasets ABC and Emo-
for the source set ABC, the deep one EDFLM2 achieves 65.62%, DB, noor_EDFLM2 and EDFLM2 get better average accuracy than
while the shallow one EDFLM1 only achieves 64.04%. Meanwhile, MT and CT, since noor_EDFLM2 and EDFLM2 are unsupervised
in comparison with the shallow one noor_EDFLM1 (61.47%), a large transfer learning methods which can make full use of the infor-
improvement has been achieved by the deep one noor_EDFLM2 mation of unlabeled samples to train the model while MT and CT
(64.36%). The reason may be that the deep one learns two hierar- only use the information of labeled samples which are very limited
chical non-linear transformations of the input, while the shallow in ABC and Emo-DB. Moreover, our method proposed in this paper
one learns only one non-linear transformation. For Emo-DB, we outperforms other three well-established transfer learning meth-
have similar conclusions. EDFLM2 achieves 61.63%, while EDFLM1 ods (SHLA, AUDA and SAFTL).
only achieves 59.91%. In order to evaluate the statistical significance of the improve-
Furthermore, we can also see from Table 6 that the one with ments of our method compared with other methods AUDA, SAFTL,
domain prediction term performs better than that without domain SHLA, MT and CT, we have conducted the one-sided t-test experi-
prediction term. Specifically, for ABC, EDFLM2 reaches 65.62% while ment by using our method EDFLM2 and other five methods on two
EDFLM2 (λ = 0 ) obtains 65.23%, and noor_EDFLM2 achieves 64.36% source datasets ABC and Emo-DB. The results of t-test are shown
while noor_EDFLM2 (λ = 0 ) gets 63.86%. This is mainly because in Table 9. From Table 9, we can see that the values of significance
the domain information prompts the knowledge transfer between level of our method EDFLM2 compared with AUDA, SAFTL, SHLA
different domains. This makes domain-invariant features obtained. and CT are less than 0.001 on two source datasets ABC and Emo-
Performance of the emotion predictor and the domain pre- DB. It indicates that the improvements of our method EDFLM2
dictor in Fig. 1: This section gives UAR of the emotion predictor are significant compared with AUDA, SAFTL, SHLA and CT. Fur-
and the domain predictor on the target testing dataset (Mont) with thermore, the significance level of EDFLM2 compared with MT ob-
the source set ABC. The detailed information is listed in Table 7. tained on ABC is still less than 0.001, which also shows a high level
It clearly shows that the domain predictor gets a relatively higher of statistical significance. However, the significance level obtained
UAR than the emotion predictor since all the training samples of on Emo-DB is greater than 0.250. It is mainly because the sam-
the domain predictor have labels. ple number in Emo-DB is greater than that in ABC which leads to
Comparison with other methods with and without transfer: more samples selected by MT from the Ohm dataset of the target
To further evaluate the performance of our method, we compare it domain.
with the methods without transfer (CT and MT) and other transfer Although paper (Schuller et al., 2011) gets UAR close to 71%
methods: SHLA (Deng et al., 2014a), AUDA (Deng et al., 2014b) and which is higher than our method (61.47% (N = 1)) by using ma-
SAFTL (Deng et al., 2013), using the basic 384 acoustic features as jority voting in the two class problem, they are not comparable
the input. since the experiments of these two methods are conducted under
Table 8 shows the average UAR comparisons for several methods different experiment settings. More specifically, the method in pa-
(MT, CT, SHLA (Deng et al., 2014a), AUDA (Deng et al., 2014b), SAFTL per (Schuller et al., 2011) used 9959 labeled samples to train the
(Deng et al., 2013), noor_EDFLM2 , EDFLM2 ), with the source set be- model. However, EDFLM2 proposed in our paper does not need any
ing ABC and Emo-DB respectively. The training and testing sam- labeled samples of target domain set by using the unsupervised
ples used by different methods and whether each method trans- way.
fers knowledge between source and target domains are also shown
clearly. Overall, our EDFLM2 achieves the best performance com- 5. Conclusion
pared with other methods.
More specifically, for ABC, the MT method achieves 60.57%, In this paper, we propose a domain adaptation based method
while the CT method only achieves 56.03%. Meanwhile, the SHLA, for speech emotion recognition, in which both the domain diver-
AUDA and SAFTL methods reach 63.45%, 63.92% and 56.67% respec- gence and emotion discrimination are considered. A hierarchical
tively. EDFLM2 can boost the UAR to 65.62%, while noor_EDFLM2 non-linear transformation of the input is learned through a back
Q. Mao et al. / Speech Communication 93 (2017) 1–10 9

Table 8
UAR comparison for different methods, with the source set being ABC and Emo-DB respectively. The UAR is reported in %, and the highest
one is highlighted in bold. The training and testing samples used by various methods and whether each method transfers knowledge between
source and target domains are also shown in the table.

Method Training Testing Transfer Source set

ABC (%) Emo-DB (%)

CT ABC/Emo-DB for SVM training Mont no 56.03 51.01


MT Ohm (same with ABC/Emo-DB in sample number) for SVM Mont no 60.57 61.20
training
SHLA (Deng et al., 2014a) ABC/Emo-DB and unlabeled Ohm for shared feature Mont yes 63.45 56.52
learning and ABC/Emo-DB for SVM training
AUDA (Deng et al., 2014b) ABC/Emo-DB and Unlabeled Ohm for DAE and ADAE Mont yes 63.92 56.90
training and ABC/Emo-DB for SVM training
SAFTL (Deng et al., 2013) 50 labeled samples (25 for positive, 25 for negative) from Mont yes 56.67 55.58
Ohm to train two class-specific SAEs and ABC/Emo-DB
reconstructed by SAEs for SVM training
noor_EDFLM2 ABC/Emo-DB and Unlabeled Ohm for feature learning and Mont yes 64.36 60.60
ABC/Emo-DB for SVM training
EDFLM2 ABC/Emo-DB and Unlabeled Ohm for feature learning and Mont yes 65.62 61.63
ABC/Emo-DB for SVM training

Table 9 Chopra, S., Balakrishnan, S., Gopalan, R., 2013. DLID: deep learning for domain adap-
The t-test results of EDFLM2 compared with AUDA, SAFTL, SHLA, tation by interpolating between domains. In: Proceedings of the 2013 ICML
MT and CT on two source datasets ABC and Emo-DB. The signifi- Workshop on Challenges in Representation Learning, pp. 1–8.
cance level α is reported. Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of single-layer networks in unsuper-
vised feature learning. In: Proceedings of the 2011 International Conference on
AUDA SAFTL SHLA MT CT Artificial Intelligence and Statistics (AISTATS), pp. 215–223.
Dai, W., Yang, Q., Xue, G., Yu, Y., 2007. Boosting for transfer learning. In: Proceedings
ABC <0.001 <0.001 <0.001 <0.001 <0.001
of the 2007 International Conference on Machine learning (ICML), pp. 193–200.
Emo-DB <0.001 <0.001 <0.001 >0.250 <0.001
Daumé III, H., Marcu, D., 2006. Domain adaptation for statistical classifiers. J. Artif.
Intell. Res. 26, 101–126.
Deng, J., Xia, R., Zhang, Z., Liu, Y., Schuller, B., 2014a. Introducing shared-hid-
den-layer autoencoders for transfer learning and their application in acoustic
propagation network model. To achieve emotion-discriminative emotion recognition. In: Proceedings of the 2014 IEEE International Conference
and domain-invariant features, on top of the feature extrac- on Acoustics, Speech and Signal Processing (ICASSP), pp. 4818–4822.
Deng, J., Zhang, Z., Eyben, F., Schuller, B., 2014b. Autoencoder-based unsupervised
tion, two predictors are imposed on the features: emotion la- domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21
bel predictor and domain label predictor. To further disentan- (9), 1068–1072.
gle the emotion-related factors from the emotion-unrelated fac- Deng, J., Zhang, Z., Marchi, E., Schuller, B., 2013. Sparse autoencoder-based feature
transfer learning for speech emotion recognition. In: Proceedings of the 2013 In-
tors, we introduce an orthogonal term to encourage the input ternational Conference on Affective Computing and Intelligent Interaction (ACII),
to be disentangled into two blocks: emotion-discriminative and pp. 511–516.
emotion-unrelated features. The method is evaluated on the IN- Ding, N., Sethu, V., Epps, J., Ambikairajah, E., 2012. Speaker variability in emotion
recognition-an adaptation based approach. In: Proceedings of the 2012 IEEE
TERSPEECH 2009 Emotion Challenge two-class task. Experimental
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
results with two public available corpora show that our approach pp. 5101–5104.
can further enhance the performance compared with conventional Eleftheriadis, S., Rudovic, O., Pantic, M., 2015. Discriminative shared Gaussian pro-
cesses for multiview and view-invariant facial expression recognition. IEEE
approaches.
Trans. Image Process. 24 (1), 189–204.
Eyben, F., Wollmer, M., Schuller, B., 2009. OpenEAR-introducing the Munich open–
source emotion and affect recognition toolkit. In: Proceedings of the 2009 In-
Acknowledgment ternational Conference on Affective Computing and Intelligent Interaction and
Workshops (ACIIW), pp. 1–6.
Fayek, H.M., Lech, M., Cavedon, L., 2016. Modeling subjectiveness in emotion recog-
This work is supported by the National Natural Science Founda-
nition with deep neural networks: ensembles vs. soft labels. In: Proceed-
tion of China (No. 61272211, No. 61672267 and No. 61502208), the ings of the 2016 International Joint Conference on Neural Networks (IJCNN),
Six Talent Peaks Foundation of Jiangsu Province (No.DZXX-027), the pp. 566–570.
Ganin, Y., Lempitsky, V., 2015. Unsupervised domain adaptation by backpropaga-
Open Project Program of the National Laboratory of Pattern Recog-
tion. In: Proceedings of the 2015 International Conference on Machine Learning
nition (NLPR, No. 20170 0 022) and the general Financial Grant from (ICML), pp. 1180–1189.
the China Postdoctoral Science Foundation (No. 2015M570413). Glorot, X., Bordes, A., Bengio, Y., 2011. Domain adaptation for large-scale sentiment
classification: a deep learning approach. In: Proceedings of the 2011 Interna-
tional Conference on Machine Learning (ICML), pp. 513–520.
References Han, K., Yu, D., Tashev, I., 2014. Speech emotion recognition using deep neural
network and extreme learning machine. In: Proceedings of the 2014 Confer-
ence of the International Speech Communication Association (INTERSPEECH),
Abdelwahab, M., Busso, C., 2015. Supervised domain adaptation for emotion recog-
pp. 223–227.
nition from speech. In: Proceedings of the 2015 IEEE International Conference
Hassan, A., Damper, R., Niranjan, M., 2013. On acoustic emotion recognition: com-
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5058–5063.
pensating for covariate shift. IEEE Trans. Audio Speech Lang. Process. 21 (7),
Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends ® Mach. Learn.
1458–1468.
2 (1), 1–127.
Kan, M., Wu, J., Shan, S., Chen, X., 2014. Domain adaptation for face recognition:
Bengio, Y., 2012. Deep learning of representations for unsupervised and transfer
targetize source domain bridged by common subspace. Int. J. Comput. Vis. 109
learning. Unsupervised Transf. Learn. Chall. Mach. Learn. 7, 19.
(1–2), 94–109.
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise training
Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hid-
of deep networks. Adv. Neural Inf. Process. Syst. 19, 153.
den Markov models with deep belief networks. In: Proceedings of the 2013
Blei, D., McAuliffe, J., 2008. Supervised topic models. Adv. Neural Inf. Process. Syst.
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU),
20, 121–128.
pp. 216–221.
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B., 2005. A database of
Liu, B., Huang, M., Sun, J., Zhu, X., 2015. Incorporating domain and sentiment su-
German emotional speech. In: Proceedings of the 2005 Conference of the Inter-
pervision in representation learning for domain adaptation. In: Proceedings
national Speech Communication Association (INTERSPEECH), 5, pp. 1517–1520.
of the 2015 International Joint Conference on Artificial Intelligence (IJCAI),
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic mi-
pp. 1277–1283.
nority over-sampling technique. J. Artif. Intell. Res. 16, 321–357.
10 Q. Mao et al. / Speech Communication 93 (2017) 1–10

Mao, Q., Dong, M., Huang, Z., Zhan, Y., 2014. Learning salient features for speech Schuller, B., Batliner, A., Steidl, S., Seppi, D., 2011. Recognizing realistic emotions and
emotion recognition using convolutional neural networks. IEEE Trans. Multime- affect in speech: state of the art and lessons learnt from the first challenge.
dia 16 (8), 2203–2213. Speech Commun. 53 (9–10), 1062–1087.
Pan, S., Yang, Q., 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. Schuller, B., Steidl, S., Batliner, A., 20 09. The INTERSPEECH 20 09 emotion challenge.
22 (10), 1345–1359. In: Proceedings of the 2009 Conference of the International Speech Communi-
Scherer, K.R., 2005. What are emotions? And how can they be measured? Soc. Sci. cation Association (INTERSPEECH), 2009, pp. 312–315.
Inf. 44 (4), 695–729. Schuller, B., Vlasenko, B., Eyben, F., Wollmer, M., Stuhlsatz, A., Wendemuth, A.,
Schmidhuber, J., 2015. Deep learning in neural networks: an overview. Neural Netw. Rigoll, G., 2010. Cross-corpus acoustic emotion recognition: variances and
61, 85–117. strategies. IEEE Trans. Affect. Comput. 1 (2), 119–131.
Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep Tian, L., Moore, J.D., Lai, C., 2015. Emotion recognition in spontaneous and acted di-
belief networks. In: Proceedings of the 2011 IEEE Workshop on Applications of alogues. In: Proceedings of the 2015 International Conference on Affective Com-
Signal Processing to Audio and Acoustics (WASPAA), pp. 65–68. puting and Intelligent Interaction (ACII), pp. 698–704.
Schuller, B., Arsic, D., Rigoll, G., Wimmer, M., Radig, B., 2007. Audiovisual behav- Zhuang, F., Cheng, X., Luo, P., Pan, S.J., He, Q., 2015. Supervised representation learn-
ior modeling by combined feature spaces. In: Proceedings of the 2007 IEEE In- ing: transfer learning with deep autoencoders. In: Proceedings of the 2015 In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2, ternational Joint Conference on Artificial Intelligence (IJCAI), pp. 4119–4125.
pp. II–733.

Вам также может понравиться