Вы находитесь на странице: 1из 13

Information Fusion 53 (2020) 209–221

Contents lists available at ScienceDirect

Information Fusion
journal homepage: www.elsevier.com/locate/inffus

Full Length Article

A snapshot research and implementation of multimodal information fusion

for data-driven emotion recognition
Yingying Jiang a, Wei Li a, M. Shamim Hossain b,∗, Min Chen a,∗, Abdulhameed Alelaiwi b,
Muneer Al-Hammadi c
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

a r t i c l e i n f o a b s t r a c t

Keywords: With the rapid development of artificial intelligence and mobile Internet, the new requirements for human-
Artificial intelligence computer interaction have been put forward. The personalized emotional interaction service is a new trend in
Multimodal information fusion the human-computer interaction field. As a basis of emotional interaction, emotion recognition has also intro-
Data-driven emotion recognition
duced many new advances with the development of artificial intelligence. The current research on emotion
recognition mostly focuses on single-modal recognition such as expression recognition, speech recognition, limb
recognition, and physiological signal recognition. However, the lack of the single-modal emotional information
and vulnerability to various external factors lead to lower accuracy of emotion recognition. Therefore, multi-
modal information fusion for data-driven emotion recognition has been attracting the attention of researchers in
the affective computing filed. This paper reviews the development background and hot spots of the data-driven
multimodal emotion information fusion. Considering the real-time mental health monitoring system, the current
development of multimodal emotion data sets, the multimodal features extraction, including the EEG, speech,
expression, text features, and multimodal fusion strategies and recognition methods are discussed and summa-
rized in detail. The main objective of this work is to present a clear explanation of the scientific problems and
future research directions in the multimodal information fusion for data-driven emotion recognition field.

1. Introduction teraction more intelligent. Namely, it makes the human-machine inter-

action as natural, cordial, vivid, emotional and temperature as human-
With the rapid development of mobile Internet and artificial intel- human interaction is, thus realizing a deep human-computer interac-
ligence technology, the more and more communication has turned to tion mode and understanding. Also, the emotion recognition plays an
human-machine communication. Also, the demand for an AI-based ma- important role in the human-computer emotional interaction. Emotion
chine to recognize the user’s emotions and give the corresponding feed- recognition enables machines to perceive human emotional states and
back is becoming stronger. People expect the interactive machines to produce the ability of empathy. In the United States and Europe, many
have the ability of observation, understanding, and abundant emotion powerful laboratories have established special research groups to re-
similar to human beings, thus putting forward the new requirements search and develop emotional systems and received sponsorship and
for human-computer interaction [1]. However, the existing human- support from some leading companies in that field. For instance, the
computer interaction mode of many service robots is mechanical and famous Emotional Computing Team of the MIT Media Laboratory de-
monotonous, relying only on the keywords matching and background veloped an emotional computing system, which collects data by using
search, which is not intelligent enough and lacks the understanding of the biosensors and a camera capable of recording facial expressions;
the semantic context [2,3]. Therefore, we need to add emotional ele- the collected data is then processed by the so-called “Emotional Assis-
ments and intentional elements, and use affective computing technology tant” adjustment program to recognize the human emotions [4]. Fur-
to achieve emotional interaction. Emotional interaction has become the ther, the Softbank company in Japan launched an emotional escort robot
main trend in the human-computer interaction in the advanced informa- named Pepper, which can identify user emotions by analyzing facial ex-
tion age. Besides, emotional interaction makes the human-computer in- pressions [5]. The Inner-scope, a neuroscience company, can predict
whether the movie will make a splash by observing the highlights that
make the audiences’ brains highly active [6]. In [7], the authors propose

Corresponding authors. a novel smart cushion system for detecting the user’s stress state. In [8],
E-mail addresses: yingyingjiang@hust.edu.cn (Y. Jiang),
the authors propose a novel emotional cognitive system, which can
mshossain@ksu.edu.sa (M.S. Hossain), minchen@ieee.org (M. Chen).

Received 18 February 2019; Received in revised form 6 June 2019; Accepted 9 June 2019
Available online 13 June 2019
1566-2535/© 2019 Elsevier B.V. All rights reserved.
Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

analyze and predict postpartum depression based on prenatal data. ity) was not considered. Due to the great improvement in the develop-
In [9], the authors propose a creative gaming system to help users im- ment of deep learning and some other AI technologies, many findings of
provise. However, the current research on emotion recognition mostly the multimodal emotion recognition have been obtained in the last two
focuses on single-modal recognition such as expression recognition, years. The aforementioned papers don’t cover the multimodal emotion
speech recognition, limb recognition, and physiological signal recogni- recognition of AI technology fully. Also, we consider that the physio-
tion. Nevertheless, the lack of the single-modal emotional information logical change is take control of by the automatic nervous system and
and vulnerability to various external factors lead to lower accuracy of endocrine system, and almost not controlled by the subjective ideas,
emotion recognition (i.e., the facial expression is easily occluded, and so the emotion recognition based on the physiological signal is objec-
speech is vulnerable to the interference from the surrounding-noise). tive [18]. Thus, with the change in the humans subtle physiological
Emotion denotes a subjective attitude of humans nervous system state (such as EEG and electrodermal activity), the specific fluctuations
toward the external relations. Brain first sends the instructions for in human emotions can be observed and the corresponding emotional
the corresponding feedback which influences the human facial expres- change can be recognized. For instance, when people become nervous
sion, frequency and speed of voice, and body language expressions, under pressure or excited because of evil motive, the sympathetic nerve
and also influences the human organs such as heart, arms, legs, brain, will cause the relevant somatic reactions, such as heartbeat acceleration,
etc [10]. Therefore, considering a certain complementarity among dif- blood pressure increase, breath acceleration, body temperature rise, and
ferent modal emotion data, researchers have started to use the facial even muscle or skin tremble [19–21]. Compared with the emotion recog-
expression, blink, gestures and some other psychophysical signals in nition based on face recognition and movement, the recognition based
the emotion recognition research. For example, in [11], the authors use on the EEG data or other physiological has higher credibility because
three physiology signals, namely EDA, PPG and EMG, to identify human it is natural and cannot be disguised or changed artificially. Besides,
emotions together. Multimodal information fusion for data-driven emo- due to great achievements in the field of dry electrodes and wearable
tion recognition has been attracting the attention of researchers in the technology., the emotion analysis based on the EEG data obtained in a
affective computing filed. Compared to the single-mode emotion recog- real environment (not limited to the laboratory environment) is more
nition, multimodal information fusion for data-driven emotion recog- available [22,23].
nition has higher accuracy. The multimodal emotion data fusion and Compared with the related literature [1,16,17], this paper mainly
recognition was firstly proposed by Bigun and Duc in 1997 [12]. They focuses on the data-driven multimodal emotion data fusion and recogni-
fused the facial and voice data and put forward a statistical method tion with AI technology. Considering the real-time emotion health mon-
based on the Bayesian theory. In recent years, the technology of arti- itoring system, the progress in key technologies related to the dataset,
ficial intelligence and muti-sensor data fusion [13,14] has been devel- feature extraction, features fusion and classifying in the multimodal
oping rapidly. Therefore, great progress has been achieved in research emotion recognition field is analyzed and summarized. This paper aims
on the multimodal emotion data fusion and recognition. The multimodal to comprehensively explain the data-driven multimodal emotion infor-
emotion recognition has abundant and wide prospect of application. Be- mation fusion and help clearly understand the scientific problems and
sides, it helps to provide some useful functions to the empty-nest elderly future research direction in that field.
and children [15]. By capturing human emotion, the psychological com-
fort for the empty-nest elderly and children can be achieved, helping to 2. Motivation example of multimodal information fusion
solve their psychological problems and undertake the load of psycholo-
gists. Through dialogue, a machine equipped with the mature artificial Fig. 1 shows the real-time emotion health surveillance system used in
intelligence considers the patient’s emotion and helps to alleviate dis- this paper. In this system, the following tasks are completed: the collec-
ease. tion of multimodal emotion signal, labeling and selection of unlabeled
With the aim to help those who are interested in the emotion recogni- emotion dataset on the edge cloud, multimodal emotion data fusion
tion to know the multimodal emotion recognition comprehensively, we recognition and analysis of AI algorithm on the remote cloud, and the
need to present a comprehensive and systematic survey. Although there emotional feedback or decision-making control of the intelligent emo-
existed a few review papers about multimodal emotion recognition, for tional interaction robot. A real-time personalized psychological health
example, the survey paper [1] analyzed the major trends and system- guardian can be offered to users.
level factors correlated to the effects of multimodal emotion recognition.
And paper [16] reviewed muti-sensor fusion. The review [17] mainly 2.1. Multimodal emotion data collection layer
discussed the development history of affective computing, multimodal
emotion dataset, methods for multimodal features extraction(such as vi- The method for designing a high-efficiency emotion data collec-
sual, voice and textual features), multimodal fusion technology, and ap- tion is difficult in emotion recognition and interaction. The collected
plicable API. However, the physiological signal (such as an EEG modal-

Fig. 1. An example of the real-time emotion healthcare system.

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

multimodal emotion data include the EEG data, voice data, expression data from the social network, the CNN (Convolutional Neural Network)
data, and social contact information of users extracted from their smart- is used to learn unstructured textual emotion data and extract the tex-
phones. In the research findings of neurosciences and cognition science, tual features. Finally, a feature fusion layer and a softmax classifier are
emotional production has a high correlation with the physiological ac- designed in the neural network. The connection parameters are deter-
tivity of the cerebral cortex. This offers the theoretical basis for recog- mined by supervised learning. Then the multimodal emotion data can
nizing user’s emotion by researching the activities of his cerebral cortex be fused and cognized. Fig. 6 is the accuracy and loss function with
(the EEG data) [24]. The collection of EEG data relies on wearable tech- different epoches.
nology that has been developing rapidly. The 22-channel brain-wearable
equipment designed independently is used in this work. An ADS1299-8 2.4. Emotional feedback layer
chip of TI company is used for EEG signal collection. And a CH559L
is used as master chip. A comfortable, convenient, and non-intrusive Based on the fusion result of multimodal emotion data obtained from
brain-wearable equipment can be designed to collect user’s EEG data in the remote cloud, the system can accurately cognize the user’s emo-
real time with high efficiency [25]. A MIC module and camera module tions [32]. After cognizing the sorrowful and depressed emotion in a
are configured in an intelligent emotional interaction robot. They are user, the intelligent robot can provide the corresponding emotion treat-
respectively used to collect user’s voice data and expression data on site ment through the emotional interaction and provide comfort to the user.
in real time. User’s voice, intonation, and expression can largely indicate For instance, the robot may play some music, say conciliative words to
user’s psychological activity. As for the users who are good at concealing comfort the user, hug the user, and make the user feel empathy of the
their emotion, the EEG data is very useful because it cannot be disguised robot. Also, the system can send the information on a user’s emotion to
or changed. Besides, thanks to the development of mobile network and the mobile phones of user’s family members and friends after the emo-
intelligent terminals, a large number of users tend to announce their tion is cognized. As for the depressed users, the symptom can be found
daily ideas and record their emotion on social networks [26]. The tex- as soon as possible, and medical suggestions should be given. The real-
tual data of users they publish on social network reflects user’s emotion time emotional health monitoring system gives a personalized, intelli-
change within a period. The above four types of emotion data are ana- gent, and humanized emotional feedback to respective users according
lyzed and modeled in this work to achieve higher emotion recognition to their characters and emotion status.
accuracy. In the cases of real-time emotional health monitoring, the key prob-
lem is how to fuse the multimodal emotion data to increase the emotion
2.2. Unlabeled emotion data labeling layer cognition accuracy and offer an accurate real-time emotion service to
users. As for the multimodal emotion cognition model deployed on the
The unlabeled learning algorithm deployed on the edge cloud can remote cloud, it is necessary to train and get a complete model using
distinguish the validity of multimodal emotion data, upload important the existing dataset in advance. Besides, the model deployed on the re-
data and filter redundant data [27,28]. The remote cloud needs to recog- mote cloud shall have good generalization ability on the multimodal
nize the user’s emotion rapidly and accurately. Moreover, an intelligent data collected by several sensors in real time. A complete model trained
robot with the ability of emotional interaction needs to make a deci- with the open multimodal dataset has very important significance to the
sion as soon as possible and provide the corresponding emotional feed- real-time emotional health monitoring system. Accordingly, the newest
back to the user. The system collects user’s multimodal emotion data multimodal dataset, multimodal feature extraction, feature fusion and
in a large quantity. It is a great challenge to improve interaction delay emotion classifier are discussed below.
and emotion cognition while maintaining the system’s intelligence [29].
Therefore, the unlabeled learning algorithm introduced in [30] is used 3. Datasets
in this work. Instead of uploading a large number of original multimodal
emotion data directly into the remote cloud, we will consider what ef- In most processes included in collecting the multimodal data for
fect the data will have on the data set after it is added to determine emotion cognition, the tested people are induced by videos or other
whether the unlabeled data is discarded or retained. Only the multi- means to generate certain emotions in people under the test. When
modal emotion data that can enhance the accuracy of emotion cogni- wanted emotion is generated, the corresponding data is labeled and
tion will be uploaded into the remote cloud. In that way, the amount of recorded. The recorded data is stored as a dataset called the induced
uploaded data is decreased, and the intelligence of the remote cloud is or acted dataset. In some works, the spontaneous emotion of the tested
maintained. users are recorded, these emotions were not stimulated by the exter-
nal factor. Such a dataset is called the spontaneous dataset. Besides
2.3. Multimodal emotion data fusion and cognition layer the old datasets which were surveyed in literature [17], such as HU-
MAINE [33], Belfast [34], SEMAINE [35], IEMOCAP [36] and eN-
After labeling and filtering multimodal emotion data on the edge TERFACE [37], many new achievements about multimodal datasets
cloud, data are uploaded to the remote cloud, and an AI algorithm is have been accomplished with the increase in user modalities. There-
used to cognize and analyze the fusion of multimodal data [31]. The fore, this paper introduces the new five multimodal datasets pre-
fusion of multimodal emotion data can help to get more comprehensive sented in recent years. Table 1 shows the comparison of these six
user’s emotion features, enhance the robustness of emotion service sys- datasets.
tem, and guarantee the system effective work when some emotion data AFEW: The dataset was collected by Abhinav et al. [38]. The authors
lack. As for multimodal emotion data fusion, two key problems shall be collected the temporal videos from movies to depict real-world emotion
solved: (1) how to effectively explore the relevance between different expression as much as possible. The dataset mainly includes audio and
modalities and the emotion data that describe different modalities, and visual modality. Two annotators to annotate the movie clips with a rec-
(2) how to fuse the emotion features or cognition results based on differ- ommender system proposed in this paper. And the dataset have 330 sub-
ent modalities. Concretely, as for the EEG data, preprocessing removes jects in total and seven main emotion label, namely sadness, happiness,
the artifacts. The DBN (Deep Belief Network) algorithm is used for fea- disgust, anger, fear, surprise and the neutral class.
ture extraction. As for voice data, after the Mel spectrogram is obtained RECOLA: This dataset was put forward by Fabien et al. in 2013 [39].
by preprocessing, the AlexNet DCNN (Deep Convolutional Neural Net- The main modalities of this dataset are audio, visual, ECG, and EDA. It
work) is used to extract the emotion features. As for facial expression is spontaneous dataset based on the remote cooperative tasks. In addi-
data, the original facial expression images can be directly input into the tion, the emotion of participators is manipulated and balanced in dyads.
VGGNet DCNN to extract the features of facial expression. As for textual There were 46 people included in the tests, of which 27 females and 19

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

Table 1
Comparison of multimodal emotion datasets.

Dataset Modality Subjects Annotators Emotion label

AFEW Audio, visual 330 2 Happiness, sadness, anger, fear, disgust, surprise and neutral
RECOLA Audio, visual, ECG, EDA 46 6 Arousal and valence
BAUM-1 Audio, visual 31 5 annotators for each clip Happiness, anger, sadness, disgust, fear, surprise, boredom and contempt
EMOEEG EEG, EOG, EMG, ECG, EDA 8 Self-assessment Valence and arousal
CMU-MOSEI Text, visual, audio 1000 3 crowdsourced judges Happiness, sadness, anger, fear, disgust, surprise
WESAD ECG, EDA, EMG, RESP, TEMP and ACC 15 Self-assessment Neutral, stress, amusement

males, and the emotional labels are labeled by six annotaters according Table 2
to two-dimensional coordinates. EEG band range and decomposition level in [46].
BAUM-1: This dataset was put forward by Sara et al. [40] in 2017. Bandwidth Frequency band Decomposition level
It includes acted dataset and spontaneous dataset, denoted as BAUM-1a
1–4 Hz 𝛿 A6
and BAUM-1s, repectively. The two main modalities are facial expres-
4–8 Hz 𝜃 D6
sion and voice. The emotion labels include happiness, anger, sadness, 8–10Hz slow 𝛼
disgust, fear, surprise, boredom and contempt. The acted dataset is col- 8–12 Hz 𝛼
lected by asking the tested people to utter several sentences for the cor- 12–30 Hz 𝛽 D4
30–40 Hz 𝛾 D3
responding scene of eight imaged emotional labels. On the other hand,
for the spontaneous dataset, a series of pictures and small videos stimu-
lated the people under the test to generate target emotions. Two cameras
were installed at the specific positions to capture the facial expression 4. Feature extraction
and voice of the tested people and record their emotion. There were in
total 31 people under the test, including 13 females and 18 males at the Different people have different emotion expression for the same emo-
age in the range 19–65. The total time for watching stimulating videos tion. Some people prefer to show their emotion with language, so their
or pictures and conversing was 50 min for each participant. After get- audio data contains more emotional clues [44]. On the other hand, some
ting the recorded videos, the videos were used to make clips by using people tend to use facial expression to show their emotion. Also, some
the video processing technology. The acted dataset contained 273 clips people are good at concealing and hiding their emotion, but their phys-
while the spontaneous dataset contained 1184 clips. Each clip is anno- iological features cannot be disguised or deceived. The main physio-
tated by five annotators, with scores ranging from 0 to 5. The higher the logical features explored in this paper are the EEG features. Because of
score is, the stronger the corresponding emotional state is. The voting the great improvements in social networks, people tend to post their
method is used to vote on five points of each clip, and the final label is daily life and emotional status on the social network. Therefore, the
the one with the largest number of votes. multimodal emotion data fusion and cognition have begun to play an
EMOEEG: This dataset was put forward by Anne-Claire et al. [41] at increasingly significant part for emotion recognition. Extraction of emo-
the 25th European Signal Processing Conference (EUSIPCO) in 2017. tional features in different modalities is the key step of emotion cogni-
The main modality of the dataset is a physiological signal, including tion. The feature extraction technologies for different modalities have
the EEG, EOG, EMG, ECG, EDA, etc. The Affectiva bracelet was used to been deceived in recent years.
record the skin conductance and temperature of the people during the
test. There were in total eight people included in the test, of which 5 4.1. EEG features
males and 3 females. The emotions were stimulated by the images and
videos. Each image lasted for 25.5 s, and each video lasted for 28 s, and Because the EEG signal is a very weak signal, it can be easily in-
there were 25 image blocks, 50 videos, and 11 sessions. The emotional terfered by noise signals during the data collection process. Therefore,
tagging was performed in a self-assessment way. some preprocessing is needed to remove the artifacts from the EEG sig-
CMU-MOSEI: This dataset was built by Amir et al. [42] of CMU in nals, including the electro-oculogram signal, electromyographic signal,
2018. It is the largest multimodal dataset at present, and it includes three etc. At present, the most frequently-used artifact removal methods are
modalities: text, video, and audio. The dataset contains 23,453 labeled the filtering method and the independent component analysis [45].
videos from 1000 distinct speakers and covers 250 hot issues. Each video According to the review in [46], the EEG emotion features are usu-
contains manual transcription that aligns audio and phoneme grade. The ally divided into frequency domain features, time domain features, and
judges of the three Amazon Machinery Turkey platforms marked the time-frequency features. Common time-domain features include event-
video by crowdsourcing. Ekman criterion was used to classify emotions, related potential (ERP), statistical signals (such as average value, power,
namely happiness, sadness, anger, fear, disgust, and surprise. standard deviation, first-order deviation, normalized first-order devia-
WESAD: This dataset was put forward by Philip et al. [43] at the tion, and so on), non-stationary index (NSI), fractal dimension (DT), and
ICMI conference in 2018. It is a new open multimodal dataset aim- higher order crossings (HOC). The frequency domain features mainly
ing at the wearable pressure and emotion identification. This multi- include band power, higher order spectra (HOS). The band characteris-
modal dataset contains the physiological data and sporting data col- tics of the EEG are generally decomposed into several frequency bands.
lected by the equipment placed on participants’ wrist and chest. There Table 2 shows the range of commonly used frequency band, and decom-
were 15 people included in the test, of which 12 males and 3 fe- position level. Also, the frequency domain features are extracted from
males at an average age of (27.5 ± 2.4). The data set consists of each frequency band, such as differential entropy (DE), power spec-
physiological data (ECG, EDA, EMG, RESP and TEMP) and exercise tral density (PSD), etc. Lastly, common time-frequency features include
data (ACC). These high-resolution data were acquired by the device Hilbert–Huang spectrum (HHS) and discrete wavelet transform (DWT).
worn by the subjects’ chest at a sampling rate of 700 Hz. Then sub- Adrian et al. [47] combined the wavelet transform features, fre-
jects fill in self-report to complete emotional labeling. The self-reports quency domain features, and time domain features as the feature in-
represented the subjective experience during the emotional stimulus. put for the EEG emotion cognition. The extracted features included the
The dataset contains three emotional statuses: neutral, stress, and wavelet coefficients, maximum frequency amplitude, standard devia-
amusement. tion, power, and mean value. Kairui Guo et al. [48] proposed to combine

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

Fig. 2. The feature extractors of BiDANN in [52].

the time domain features and DWT features to build a new characteristic series sufficiently, a Long Short Term Memory (LSTM) framework was
variable and then to combine the SVM and HMM as classifiers for the built to learn the contextual features and transfer the input to another
EEG emotion cognition. Concretely, first the relative wavelet energy was space. Space was more effective and had components at a higher grade.
extracted, and then a relative wavelet entropy was extracted, and they By repeating the LSTM module, a series of hidden statuses completely
were combined finally. The authors select standard deviation and DWT representing input series were obtained; also, the feature space of the
coefficients and multiply them to build a new characteristic variable. EEG was obtained.
In the recent latest years, due to the development in deep learning Song et al. [53] presented a novel dynamic graph convolution neu-
field, better performance has been achieved than by using the tradi- ral network, which can dynamically learn the intrinsic relationship of
tional machine learning methods for disposing of the problems of the various EEG channels through the adjacency matrix of the graph, thus
EEG emotional features extraction and cognition. Zheng et al. [49] put discriminating the extraction of EEG emotional characteristics in dif-
forward the differential entropy features obtained from multiple EEG ferent channels Specifically, the characteristics obtained from multiple
channels, and feed these features to DBN network, then train the net- EEG bands are used as input to the graph, and the channel of EEG is
work and extract more advanced emotional features. And because EEG equivalent to the nodes in DGCNN graph. After the graph filtering, a
data have high low-frequency energy in high-frequency energy, differ- 1 × 1 convolutional layer was employed for learning the features of dis-
ential entropy has the balance ability to distinguish low-frequency EEG crimination among different frequency domains.
from high-frequency EEG. Specifically, the author uses a 1s long non- Lin et al. [54] proposed the Conditional Transfer Learning (cTL) for
overlapping Hanning window and a short-time Fourier transform with EEG emotion cognition. The model stimulated positive transfer of every
a sampling point of 512 to extract the characteristics of each sub-band individual (improved the subjects specificity without the increase of in
from the original EEG signal, and then obtain the differential entropy the labeled data). The cTL first evaluates the transferability of the indi-
feature. In addition, the DBN and Hidden Markov Model were combined vidual to the positive transfer, afterwards it optionally utilizes data from
as auxiliary methods for getting more reliable emotion conversion sta- other people with comparable feature spaces. As for the original EEG sig-
tus. nal of 30 channels and 30 s in every experiment, first the Short Fourier
Mei et al. [50] and Samarth et al. [51] put forward to apply the CNN Transform was used with 50% overlapping 1-s Hamming window to
to extract the EEG emotional features and cognize emotions. This pa- transfer the signal to the frequency domain. The differential laterality
per proves that the momentous information related to emotional state was used to reflect the EEG spectral dynamics of emotional reaction in
is contained in the functional connection matrix. The model proposed in hemisphere frequency spectrum asymmetry signification. Then, the Re-
this paper can extract the relevant features representing different emo- liefF method was exploited to obtain the features.
tions for learning. Choong et al. [55] put forward to use the Detrended Fluctuation
Li et al. [52] proposed an original neural network model BiDANN Analysis (DFA) to detect the time relevance among the EEG signal fea-
used for extracting and cognizing the EEG emotional features, as shown tures and complexity. When calculating DFA for feature extraction in
in Fig. 2. The main content of BiDANN is mapping EEG signal corre- each epoch, the minimum window size is 4 and the maximum window
sponding to two hemispheres of the brain to discriminant the feature size is 76. The maximum window size is 1/10 of the epoch length, and
space respectively. Data could be classified easily. Distribution trans- the window size is incremented by four. Therefore, 19 windows of dif-
formation between train set and test set as well as asymmetry of brain ferent size were analyzed in each stage.
hemisphere were considered sufficiently in the model. The model con- Generally speaking, dimensionality reduction and selection of the
tained two feature extractors which respectively learnt dynamic features EEG features are needed after the features are extracted. At present,
of each brain hemisphere; the original EEG data was mapped to deep Independent Component Analysis (ICA) [56], the Principal Component
feature space with more discriminating emotion information. As for sin- Analysis (PCA) [57], and Common Spatial Pattern (CSP) [58] are com-
gle brain hemisphere feature extractor, to use the time dependencein monly used for that purpose.

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

4.2. Visual features alization motivated the spatial pattern of different nerve cells at the
convolutional layer in the largest degree to analyze network qualita-
In the traditional machine learning methods, the facial expression tively and show the similarity to the facial action unit (FAU). In [68],
features are extracted manually. In general, there are three group of several deep CNNs were trained as committee members, and their de-
methods for visual feature extraction: geometric methods based on the cisions were combined. There were two strategies in the model: (1) To
organs features and convex face positions, pixel methods based on tex- get different decisions from the deep CNN, where the network struc-
tural face features, and mixing methods. Typical geometric methods ture and input normalization were changed at the beginning of training
include the Point Distribution Model (PDM) and Active Shape Model deep networks. (2)In order to obtain a better-performing committee in
(ASM); pixel methods include the Gabor Wavelet, optical flow method, the aspects of structure and decision, a committee layer structure with
Scale invariant feature transformation (SIFT), local binary mode (LBP), exponential weighting and decision fusion was built. In [69], several
linear discriminant analysis (LDA), etc; mixing methods include the Ac- CNNs were fused for learning the facial expression features. In [70], the
tive Appearance Model (AAM), etc [59–61]. When a machine learning proposed method contained the face detection module based on three
method is used to extract facial expression, it is not easy to manually existing technologies and the classification module with several deep
design and build useful and effective features because it needs much CNNs, where each CNN model is trained on FER2013 data set first, and
professional field knowledge and effort. However, in recent years, with then fine-tuned on SFEW2.0 data set to further learn facial emotional
the rise of deep learning, many deep learning algorithms, such as convo- features. In [71], The authors propose a deep neural network with two
lutional neural network (CNN), have achieved good results in the field models for learning facial emotion features together. One of the net-
of visual recognition [62,63]. works extracts the appearance of the face from the original series of
According to the survey presented in [64], deep learning tries to cap- consecutive face images, and the other network extracts the temporal ge-
ture the high-grade features using several nonlinear transformations and ometric features from the temporal face markers. Then joint fine-tuning
layered architecture. In learning of emotional features of appearance, was used for combination the two models. In [72], the authors propose a
common deep learning models include CNN, DBN, Deep autoencoder new action unit selection method to select the feature maps of the pre-
(DAE), Recurrent neural network (RNN), and so on; among them, the trained CNN, analyzed all the feature maps from the 5th layer of the
CNN is used the most. In recent years, many researchers have used the AlexNet and its relation with action unit (AU) and assessed their impor-
CNNs to learn and recognize the emotional features of appearance. tance in facial expression recognition by feature ablation experiment.
In [65], the authors used a CNN architecture based on transfer learn- In [73] and [74], the CNN was used to extract the 3D facial expression
ing. Firstly, the all-purpose pre-training of two CNN architectures with features. In [75] and [76], several CNNs were aggregated to train the
a different depth based on the ImageNet dataset was performed. They local characteristics and holistic characteristics of facial expression re-
used two architectures, AlextNet and VGG-CNN-M-2018. In the first spectively. In [77], the authors combine RNN and CNN to propose a joint
stage, the FER-2013 facial expression dataset was used for fine-tuning; network architecture, which can extract temporal sequence features of
then use the EmotiW dataset for the fine-tuning of the second phase, dynamic facial expressions and static spatial features respectively, as
which allows the trained network weights to fit into the SFEW dataset. shown in Fig. 3. In [78–81], the authors also combined CNN and RNN
In the FaceNet2ExpNet proposed in [66], the authors designed a train- to learn the expression features, achieving a good effect.
ing algorithm with two stages and proposed a new distribution function In [82], the authors proposed a visual/video emotion cognition
for the high-level hidden neuron modeling of the expression network. method through transfer learning. In [83], a new Boosted Deep Belief
Firstly, the paper pre-trains the convolutional layer and regularizes it Net (BDBN) was proposed for executing three training stages iteratively
with the mesh network to obtain the expression network. Then, the net- in a uniform loop frame. The joint fine-tuning in the BDBN frame was
work obtained in the previous step is added to the fully connected layer used to improve the ability to judge features and strengthen their rela-
network to extract the whole facial emotion feature. In [67], the authors tive importance to the strong classifier. In [84], according to the sparsity
not only demonstrated good CNN performance but also introduced a of biological correlation and the cyclic characteristics of the network, an
method to explain which face parts influence the CNN forecast. Visu- S-DSRN network is proposed to learn the emotional features of human

Fig. 3. The Spatial-Temporal Networks for facial expression feature extraction in [77].

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

faces, and the network can improve the stability of recognition results. to learn the local invariant features (LIFs). Next, the LIFs were applied as
The sparsity of features was obtained by using loss learning instead of an input of a feature extractor for the striking discerning feature analysis
additional penalty terms which are usually manually manufactured for (SDFA). Then a new objective function is used to optimize the network
sparse data representation in the proposed DSRN. In [85], The authors and calculate the loss value. The emotional features learned from this
use a generative-contrastive network to learn facial emotional features. function are significantly different, and they are orthogonal and dis-
The network consists of three parts: generative network, contrastive net- criminatory for speech emotion classification. However, the time series
work and discriminative network. The generated network generates ref- in speech are neglected in these methods of learning emotional features
erence pictures, and the contrastive network compares the real pictures using CNN.
with the generated pictures to obtain the contrastive features. To solve the above mentioned time sequence problem of voice, many
researchers used an RNN or an LSTM to extract and automatically learn
4.3. Audio features the high-level features. In [99], the authors use the bidirectional long-
term and short-term memory (BLSTM) model to extract temporal dy-
At present, the acoustic features used for voice emotion recognition namic emotion features of speech. In [100], an RNN was employed to
are divided into be prosodic features, relevant features based on a spec- learn short frame acoustic features relevant to emotion and aggregate
trum and voice quality features [86]. All the extraction of these features the features to be a compact utterance-level representation. Besides, a
are usually done at the initial frame level. Concretely, the prosodic fea- new feature pooling strategy was proposed. The focus was on the spe-
tures include duration, pitch, and energy; the relevant features based on cific area of voice signal containing the emotion. In [101], the authors
the spectrum typically include OSALPC (Unilateral Autocorrelation Lin- first extract 238 LLD features of the original speech signal, and then
ear Predictor Coefficient), MFCC (Mel Frequency Cepstral Coefficient), input them into the CTC-RNN network proposed in the paper. The net-
OSALPCC (OSALPC Based on Cepstral), LPC (Linear Predictor Coeffi- work can automatically align the emotional segments of speech with
cient), LPCC (Linear Predictor Cepstral Coefficient), LFPC (Logarithmic the emotional tags, rather than the emotional segments with the non-
Frequency Power Coefficient), and so on [87]. The above three types emotional tags. In [102], the authors propose a novel pooling method
of acoustic features can be fused to get the mixed features. They are based on the modulation spectrum, which can alleviate the influence
low-level features extracted from those at the frame-level. For instance, of noisy background on emotional feature extraction. Then, the Mul-
in [88], the pitch, formants, zero crossing, MFCC, and its statistic pa- tilayer Perceptron (MLP) and LSTM were used to get features of high
rameters were mixed for extracting the acoustic features. In [89], the level. In [103], the authors combined the CNN and LSTM. The phonetic
authors use cross-correlation model to calculate the correlation between tract length perturbation was first used for data enhancement, and the
the audio samples to be predicted and the audio samples with known la- CNN was employed to get advanced audio emotional features from the
bels, and then give the results of emotion recognition. Among them, the spectrograms; while the Bi-LSTM was used to aggregate the long-term
authors choose the six distinct mixed features of volume, zero crossing dependencies.
rate, energy, MFCC, spectral centroid and formant as emotional features All the above methods for extracting the low-level features depend
for analysis. on the handcrafted features. To extract the features from the origi-
Like the extraction method for emotion expression features, the ap- nal phonetic oscillogram automatically, some researchers combined a
plication of deep learning for automatically extracting and learning CNN with an LSTM to extract the phonetic emotional features automat-
voice emotion features has achieved a good effect. When deep learning is ically [104,105]. In [106], the authors propose an end-to-end network
used for voice emotion recognition, the low-level features of each frame architecture consisting of a CNN plus two layers of LSTM. The CNN is
are usually extracted based on the traditional machine learning method, used to extract emotional features from the original speech signal, and
and next the high-level features can be extracted with deep learning au- then input the obtained features into the LSTM network to further learn
tomatically. In [90], the authors first divide the entire utterance into the context features of the speech, thereby obtaining a complete speech
a series of segments and then get the primary audio features of each emotion feature.
segment. The feature vectors corresponding to each spectrogram have
MFCC, pitch, and delta features with temporal characteristics. Then, a 4.4. Text features
DNN is used to build the utterance-level features from segment-level
probability distributions. In [91], the authors pooled the last hidden Textual information features extraction denotes the grammatical
layer and encoded each utterance to be a fixed-length vector. The pro- analysis and semantic analysis of text. By splitting the sentences, re-
cess of feature coding is designed to use discourse level classifier for moving redundant information, settled words, participles and marked
training in order to better classify. In [92], the emotional features of words, the emotional words that express the textual emotion tendency
voice were learnt by combining a low-level feature extractor based on can be extracted. The traditional extraction of textual emotional features
the Gaussian Mixed Model (GMM) with a high-level feature extractor mainly depends on the rules-based technique, the Bag of Words (BoW)
based on DNN. and some statistic methods [17]. In [107], the authors not only used the
Many researchers used a CNN to extract the high-level features of BoW to extract the textual emotional features but also proposed a new
voice. In [93–95], the authors first calculate low-level spectrums from feature representation method named the eVector. However, after ex-
each frame and then extracted the high-level features from an inde- tracting the textual emotional features, the feature selection is needed.
pendent frame regarding spectrograms as the CNN input. In [96], the Feature selection refers to selecting the most suitable and effective fea-
authors propose a new speech signal processing model that is inspired tures from the trained text features for further analysis. The frequency-
by the retina and convex lens imaging principle. According to the dif- used methods for feature selection are word frequency method (WF),
ferent distances between the spectrum map and the convex lens, the document frequency method (DF), mutual information method (MI), in-
spectrum features of different size and different training data were ob- formation gain method, chi-square test method (CHI), etc [108,109].
tained. Then, a CNN algorithm (AlextNet algorithm) was used to obtain Recently, scholars have tried to get feature representations from text
the audio emotion features. In [97], the combination of phoneme and data automatically. With the development of deep learning, the new
spectrogram features was used as the input of a multichannel CNN, and deep learning methods have attracted more and more attention in the
a good effect was achieved on the IEMOCAP dataset. In [98], a CNN was field of textual feature extraction [110]. In [111], the authors train a
combined with an autoencoder to learn to distinguish the voice features CNN model to extract the emotional features of the text. The input of
which influenced emotion recognition. This model had two stages. In the model is a vector of 306 dimensions per word. The convolution ker-
the first stage, a sparse automatic encoder (SAE) variable with the re- nel of CNN extracts features by computing the semantically related word
constitution punishment was used, and the unlabeled samples were used vectors and convolution layers in a hierarchical manner. In [112], the

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

authors use a CNN with the multiple resolution to identify the emotion speech and video in feature level. Specifically, the Mel spectrogram of
of text information. The CNN was comprised of several parallel convo- the speech is first input to the 2D CNN and the key frames of the video
lutions with the kernels of a different size to utilize textual information are input into the 3D CNN to extract the features of the fully connected
at different grades. In [113], the authors first extract the semantic word layer, and then the features are merged by two consecutive ELMs. The
vector based on the word2vec method. At the same time, the authors first ELM contains 100 hidden units and the second ELM contains 250
map the emotional words in the text to the emotional space according hidden units. Then the output of the second ELM is continuously put
to the affective lexicon to get the emotional word vector, and then gets into softmax and SVM for emotional recognition to get the final result.
the bottleneck feature of the emotional word vector based on autoen-
coder algorithm. Next the bottleneck features of semantic word vector
5.2. Decision-level fusion
and emotional word vector are fused to get the primary text features. Fi-
nally, the LSTM algorithm is used to learn more advanced text features.
In the general process of decision layer fusion, a respective classi-
fier considering the EEG features, phonetic features, textual features,
5. Multimodal fusion and classification
and expression features is used for emotion recognition. Then, an alge-
braic combination rule is adopted to combine single-modality emotion
The challenge of fusing the multimodal emotion data is brought
recognition results to improve the recognition accuracy of final result.
by the emotion cognition and analysis after combining the heteroge-
Compared with the feature layer fusion, the decision layer fusion em-
neous emotion data modalities of different source and different time
phasizes the difference between different features. The most suitable
scale [114]. The fusion of multimodal emotion data can offer more ref-
classifier can be chosen for each modality. However, the relevance be-
erence information for emotion decisions, which raises the accuracy of
tween features is not considered. Also, the learning process of learning
general decisions. At present, there are mainly two kinds of multimodal
is long and time-consuming [17].
fusion methods: feature-level fusion, and decision-level fusion. In some
In [124], the weighted product rule was adopted to fuse the audio
researches, model-level fusion is also introduced [115]. But compared
and image recognition results at a decision layer. Specifically, SVM is
with the former two fusion methods, model-level fusion is used rela-
used to classify each feature, and then the weights in the fusion net-
tively less. Therefore, this paper mainly introduced feature-level fusion
work are multiplied by the probabilistic values for each class of each
and decision-level fusion with the development of AI.
feature obtained before, and then the values belonging to the same class
are added together. Eventually, the final label is selected which has the
5.1. Feature-level fusion
greatest probabilistic value. In [125], the EEG and facial expression were
respectively classified, and the sum rule and production rule were used
At present, feature level fusion is the most commonly used method
to fuse the recognition results. In [126], a quality adaptive multimodal
in multi-modal emotion recognition, which connects features extracted
fusion scheme was used for multimodal data integration of the ECG,
from each modal into a new feature vector in some way. This new fea-
EEG, CSR and facial expression at the decision layer. Then the final out-
ture vector tends to have a higher dimension, and then uses the dimen-
put result is obtained by fusing the results of each mode according to
sion reduction method, and finally uses a classifier to identify the emo-
the fusion method of decision level, as shown in Fig. 5. Moreover, the
tion. The mutual relations between different modalities are used for fea-
authors released their dataset, namely QAMAF online. In [127], the au-
ture layer fusion. However, the difference between the emotional fea-
thors use decision-level fusion to integrate various network models. The
tures of different modalities is not considered. Also, time synchronism
paper first extracts audio features and continuous image sequences from
of different modalities is difficult to be achieved. With the increase of
video. After the basic preprocessing operation of images, the recognition
modalities, it is even more difficult to learn the relevance among modal-
results are obtained by using CNN-RNN and C3D network respectively.
ities [17]. The traditional fusion methods include the concatenation,
At the same time, SVM is used to recognize the preprocessed audio data.
outer-product, etc.
The predicted scores obtained from the different models are blended by
In [116], data was input in different modalities to hidden units and
weighted summation.
time pooling units respectively, and then the features were fused in a
multimodal fusion layer. Next, the input features were fused into the
LSTM network for training. Finally, a linear regression layer was used 5.3. Findings and discussion
for emotion recognition. In [117], the phonetic features in 1280 dimen-
sions were extracted from the video while the features in 2048 dimen- Studies about feature-level fusion and decision-level fusion in this
sions were extracted from the images. Then, the two kinds of features paper are summarized in Table 3. We find that in the current multi-
were merged to obtain the feature vector in 3328 dimensions, which was modal emotion data fusion technology, feature-level fusion is used more
fed to the 2-layer LSTM network for features training and recognition; a than decision-level fusion. The feature extraction for each modality is
good effect was achieved on the RECOLA dataset. In [111], the phonetic, mostly based on AI algorithms such as deep learning. Then, for the high-
textual and image features were directly connected for feature fusion af- dimensional features after fusion, the pooling method or other dimen-
ter the features had been extracted and selected. Then, the MKL classifier sionality reduction methods such as PCA are exploited to select features
was used for emotion recognition. In [118], a new method on the ba- and reduce the feature dimensions. Then the deep learning method is
sis of integration for visual and phonetic features following the bilinear used for training. Finally, a softMax classifier or a simple linear classifier
pooling theory was proposed; the DBN was used to train the fused fea- such as SVM is used for emotion classification. In decision-level fusion
tures; finally, a softmax classifier was employed to recognize emotions. strategy, sum-rule and product-rule are major trend. In addition, for
In [119], a new feature fusion method based on the Relational Tensor emotion label, discrete basic emotions (happiness, anger, sadness, dis-
Network which fused text, audio, and image, was proposed. The valid- gust, fear, and surprise) are used more than dimension methods (arousal
ity of this method was verified on the CMU-MOSEI dataset. In [120], and valence).
the authors use self-attention mechanism to fuse audio features and text In Section 2, we proposed a motivation example of multimodal in-
features to get new emotional features. In [121] and [122], the authors formation fusion for data-driven emotion recognition. According to the
use the DBN network to fuse the speech emotional features and the fa- feature extraction method of each modal mentioned above and the mul-
cial emotional features, and then trains them to learn the new emotional timodal fusion strategy, we have conducted experiment for the motiva-
features after the fusion of the two modals; finally, the SVM was used tion example. In this experiment, we use DBN algorithm to extract EEG
for emotion classification and recognition, as shown Fig. 4. In [123], features, AlexNet DCNN network to extract advanced features from Mel
the authors use extreme learning machines (ELM) to fuse the features of spectrum of speech, VGGNet DCNN network to extract features from

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

Fig. 4. Feature layer fusion in [121].

Fig. 5. Decision layer fusion presented in [126].

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

Table 3
Comparison of accuracy with different fusion methods.

Fusion method Dataset Reference Modality Average accuracy

Feature-level RECOLA Chao et al. [116] Audio, visual, ECG, EDA 0.667
Tzirakis et al. [117] Audio, visual 0.760
IEMOCAP Poria et al. [111] Audio. visual, text 0.776
Hazarika et al. [120] Audio, text 0.721
eNterface Nguyen et al. [118] Audio, visual 0.9085
Zhang et al. [121] Audio, visual 0.8597
CMU-MOSEI Sahay et al. [119] Audio, visual, text 0.4917
BAUM-1s Zhang et al. [121] Audio, visual 0.5457
Decision-level AFEW Sun et al. [124] Audio, visual 0.512
Fan et al. [127] Audio, visual 0.5902
QAMAF Gupta et al. [126] EEG,ECG,GSR, visual 0.59

Fig. 6. The experiment result of emotion recognition: (a) Accuracy of emotion recognition with the number of iteration epochs; (b) Loss function of emotion
recognition with the number of iteration epochs.

user’s facial expressions, and CNN to extract features from text con- Supplementary material
tent of social network. Then the feature-level fusion strategy proposed
in [121] is used to fuse the features of each modal. Finally, the softmax Supplementary material associated with this article can be found, in
classifier is used to classify the emotions. Fig. 6 is the accuracy and loss the online version, at doi:10.1016/j.inffus.2019.06.019.
function with different epoches. We can see that with the increase of
iterations, the accuracy of training set and validation set increases, the
value of loss function decreases, and the accuracy finally converges to
about 90%, which achieves better results.
[1] S.K. D’mello, J. Kory, A review and meta-analysis of multimodal affect detection
systems, ACM Comput. Surv. (CSUR) 47 (3) (2015) 43.
[2] M. Chen, P. Zhou, D. Wu, L. Hu, M. Mehedi Hassan, H. Atif Alamri, Ai-skin : skin
6. Conclusion disease recognition based on self-learning and wide data collection through a closed
loop framework, arXiv:1906.01895 (2019).
[3] Y. Qian, Y. Zhang, X. Ma, H. Yu, L. Peng, Ears: emotion-aware recommender system
Taking the real-time emotional health monitoring systems as an ex- based on hybrid information fusion, Inf. Fusion 46 (2019) 141–146.
ample, this paper comprehensively reviews and summarizes the relevant [4] C.L. Breazeal, Designing Sociable Robots, MIT press, 2004.
[5] S. Robotics, Pepper, Softbank Robot. (2016).
key technologies in the field of multimodal information fusion for data-
[6] T. Bartelme, Meet carl marci: a doctor who wants to measure your emotions, Physi-
driven emotion recognition. The existing extraction technologies for the cian Exec. 38 (1) (2012) 10.
open dataset, EEG, audio, visual and textual features, feature layer fu- [7] R. Gravina, Q. Li, Emotion-relevant activity recognition based on smart cushion
using multi-sensor fusion, Inf. Fusion 48 (2019) 1–10.
sion, decision layer fusion and classification are discussed in details.
[8] M.W. Moreira, J.J. Rodrigues, N. Kumar, K. Saleem, I.V. Illin, Postpartum depres-
These presented discussions aim to provide a comprehensive overview sion prediction through pregnancy data analysis for emotion-aware smart systems,
and a big-picture of this exciting and hot-spot research area. Inf. Fusion 47 (2019) 23–31.
[9] M. Chen, Y. Jiang, Y. Cao, Y.A. Zomaya, Creativebioman: brain and body wearable
computing based creative gaming system, arXiv:1906.01801 (2019).
[10] C.E. Izard, Human Emotions, Springer Science & Business Media, 2013.
Acknowledgment [11] M.M. Hassan, M.G.R. Alam, M.Z. Uddin, S. Huda, A. Almogren, G. Fortino, Human
emotion recognition using deep belief network architecture, Inf. Fusion 51 (2019)
The authors extend their appreciation to the Deanship of Scientific 10–18.
[12] B. Duc, E.S. Bigün, J. Bigün, G. Maître, S. Fischer, Fusion of audio and video in-
Research at King Saud University , Riyadh, Saudi Arabia for funding this formation for multi modal person authentication, Pattern Recognit. Lett. 18 (9)
work through the research group project no. RGP -318. (1997) 835–843.

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

[13] Z. Wang, D. Wu, R. Gravina, G. Fortino, Y. Jiang, K. Tang, Kernel fusion based [42] A.B. Zadeh, P.P. Liang, S. Poria, E. Cambria, L.-P. Morency, Multimodal language
extreme learning machine for cross-location activity recognition, Inf. Fusion 37 analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph,
(2017) 1–9. in: Proceedings of the 56th Annual Meeting of the Association for Computational
[14] G. Fortino, S. Galzarano, R. Gravina, W. Li, A framework for collaborative comput- Linguistics (Volume 1: Long Papers), 1, 2018, pp. 2236–2246.
ing and multi-sensor data fusion in body sensor networks, Inf. Fusion 22 (2015) [43] P. Schmidt, A. Reiss, R. Duerichen, C. Marberger, K. Van Laerhoven, Introducing
50–70. wesad, a multimodal dataset for wearable stress and affect detection, in: Proceed-
[15] M. Chen, Y. Ma, J. Song, C.-F. Lai, B. Hu, Smart clothing: connecting human with ings of the 2018 on International Conference on Multimodal Interaction, ACM,
clouds and big data for sustainable health monitoring, Mobile Networks Appl. 21 2018, pp. 400–408.
(5) (2016) 825–845. [44] M. Chen, F. Herrera, K. Hwang, Cognitive computing: architecture, technologies
[16] R. Gravina, P. Alinia, H. Ghasemzadeh, G. Fortino, Multi-sensor fusion in body and intelligent applications, IEEE Access 6 (2018) 19774–19783.
sensor networks: state-of-the-art and research challenges, Inf. Fusion 35 (2017) [45] J. Preethi, M. Sreeshakthy, A. Dhilipan, A survey on eeg based emotion analysis
68–80. using various feature extraction techniques, Int. J. Sci. Eng.Technol. Res. (IJSETR)
[17] S. Poria, E. Cambria, R. Bajpai, A. Hussain, A review of affective computing: from 3 (11) (2014).
unimodal analysis to multimodal fusion, Inf. Fusion 37 (2017) 98–125. [46] R. Jenke, A. Peer, M. Buss, Feature extraction and selection for emotion recognition
[18] Z. Yin, M. Zhao, Y. Wang, J. Yang, J. Zhang, Recognition of emotions using multi- from eeg, IEEE Trans. Affect. Comput. 5 (3) (2014) 327–339.
modal physiological signals and an ensemble deep learning model, Comput. Meth- [47] A.Q.-X. Ang, Y.Q. Yeong, W. Wee, Emotion classification from eeg signals using
ods Programs Biomed. 140 (2017) 93–110. time-frequency-dwt features and ann, J. Comput. Commun. 5 (03) (2017) 75.
[19] M. Khezri, M. Firoozabadi, A.R. Sharafat, Reliable emotion recognition system [48] K. Guo, H. Candra, H. Yu, H. Li, H.T. Nguyen, S.W. Su, Eeg-based emotion classi-
based on dynamic adaptive fusion of forehead biopotentials and physiological sig- fication using innovative features and combined svm and hmm classifier, in: Engi-
nals, Comput. Methods Programs Biomed. 122 (2) (2015) 149–164. neering in Medicine and Biology Society (EMBC), 2017 39th Annual International
[20] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal, E. Marchi, D. Lalanne, R. Cowie, Conference of the IEEE, IEEE, 2017, pp. 489–492.
M. Pantic, Av+ ec 2015: The first affect recognition challenge bridging across au- [49] W.-L. Zheng, J.-Y. Zhu, Y. Peng, B.-L. Lu, Eeg-based emotion classification using
dio, video, and physiological data, in: Proceedings of the 5th International Work- deep belief networks, in: Multimedia and Expo (ICME), 2014 IEEE International
shop on Audio/Visual Emotion Challenge, ACM, 2015, pp. 3–8. Conference on, IEEE, 2014, pp. 1–6.
[21] M.K. Abadi, R. Subramanian, S.M. Kia, P. Avesani, I. Patras, N. Sebe, Decaf: [50] H. Mei, X. Xu, Eeg-based emotion classification using convolutional neural net-
meg-based multimodal database for decoding affective physiological responses, work, in: Security, Pattern Analysis, and Cybernetics (SPAC), 2017 International
IEEE Trans. Affect. Comput. 6 (3) (2015) 209–222. Conference on, IEEE, 2017, pp. 130–135.
[22] W.-L. Zheng, B.-L. Lu, Investigating critical frequency bands and channels for [51] S. Tripathi, S. Acharya, R.D. Sharma, S. Mittal, S. Bhattacharya, Using deep and
EEG-based emotion recognition with deep neural networks, IEEE Trans. Auton. convolutional neural networks for accurate emotion classification on deap dataset,
Ment. Dev. 7 (3) (2015) 162–175. in: AAAI, 2017, pp. 4746–4752.
[23] H. Ghasemzadeh, P. Panuccio, S. Trovato, G. Fortino, R. Jafari, Power-aware activ- [52] Y. Li, W. Zheng, Z. Cui, T. Zhang, Y. Zong, A novel neural network model based
ity monitoring using distributed wearable sensors, IEEE Trans. Hum. Mach. Syst. on cerebral hemispheric asymmetry for eeg emotion recognition, in: IJCAI, 2018,
44 (4) (2014) 537–544. pp. 1561–1567.
[24] T. Dalgleish, The emotional brain, Nat. Rev. Neurosci. 5 (7) (2004) 583. [53] T. Song, W. Zheng, P. Song, Z. Cui, Eeg emotion recognition using dynamical graph
[25] M. Chen, J. Zhou, G. Tao, J. Yang, L. Hu, Wearable affective robot, IEEE Access 6 convolutional neural networks, IEEE Trans. Affect. Comput. (2018).
(2018) 64766–64776. [54] Y.-P. Lin, T.-P. Jung, Improving eeg-based emotion classification using conditional
[26] G. Muhammad, M.F. Alhamid, User emotion recognition from a larger pool of transfer learning, Front. Hum. Neurosci. 11 (2017) 334.
social network data using active learning, Multimed. Tools Appl. 76 (8) (2017) [55] W. Choong, W. Khairunizam, M. Omar, M. Murugappan, A. Abdullah, H. Ali,
10881–10892. S. Bong, Eeg-based emotion assessment using detrended flunctuation analysis (dfa),
[27] M. Chen, W. Li, G. Fortino, Y. Hao, L. Hu, I. Humar, A dynamic service migration J. Telecommun. Electron.Comput. Eng. (JTEC) 10 (1–13) (2018) 105–109.
mechanism in edge cognitive computing, ACM Trans. Internet Technol. (TOIT) 19 [56] A. Hyvärinen, J. Karhunen, E. Oja, Independent component analysis, 46, John Wi-
(2) (2019) 30. ley & Sons, 2004.
[28] G. Fortino, R. Giannantonio, R. Gravina, P. Kuryloski, R. Jafari, Enabling effective [57] I. Jolliffe, Principal component analysis, Springer, 2011.
programming and flexible management of efficient body sensor network applica- [58] K.K. Ang, Z.Y. Chin, H. Zhang, C. Guan, Filter bank common spatial pattern (fbcsp)
tions, IEEE Trans. Hum. Mach. Syst. 43 (1) (2013) 115–133. in brain-computer interface, in: 2008 IEEE International Joint Conference on Neu-
[29] X. Chen, Y. Zhao, Y. Li, QoE-aware wireless video communications for emo- ral Networks (IEEE World Congress on Computational Intelligence), IEEE, 2008,
tion-aware intelligent systems: a multi-layered collaboration approach, Inf. Fusion pp. 2390–2397.
47 (2019) 1–9. [59] B. Fasel, J. Luettin, Automatic facial expression analysis: a survey, Pattern Recognit.
[30] M. Chen, Y. Hao, K. Lin, Z. Yuan, L. Hu, Label-less learning for traffic control in an 36 (1) (2003) 259–275.
edge network, IEEE Netw. 32 (6) (2018) 8–14. [60] G. Sandbach, S. Zafeiriou, M. Pantic, L. Yin, Static and dynamic 3d facial expression
[31] G. Smart, N. Deligiannis, R. Surace, V. Loscri, G. Fortino, Y. Andreopoulos, Decen- recognition: a comprehensive survey, Image Vis. Comput. 30 (10) (2012) 683–697.
tralized time-synchronized channel swapping for ad hoc wireless networks, IEEE [61] C.A. Corneanu, M.O. Simón, J.F. Cohn, S.E. Guerrero, Survey on rgb, 3d, thermal,
Trans. Veh. Technol. 65 (10) (2016) 8538–8553. and multimodal approaches for facial expression recognition: history, trends, and
[32] G. Fortino, D. Parisi, V. Pirrone, G. Di Fatta, Bodycloud: a saas approach for com- affect-related applications, IEEE Trans. Pattern Anal. Mach. Intell. 38 (8) (2016)
munity body sensor networks, Future Gen. Comput. Syst. 35 (2014) 62–79. 1548–1568.
[33] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Mar- [62] M. Moghimi, S.J. Belongie, M.J. Saberian, J. Yang, N. Vasconcelos, L.-J. Li, Boosted
tin, L. Devillers, S. Abrilian, A. Batliner, et al., The humaine database: Addressing convolutional neural networks, BMVC, 2016.
the collection and annotation of naturalistic and induced emotional data, in: Inter- [63] M. Chen, X. Shi, Y. Zhang, D. Wu, M. Guizani, Deep features learning for medical
national conference on affective computing and intelligent interaction, Springer, image analysis with convolutional autoencoder neural network, IEEE Trans. Big
2007, pp. 488–500. Data (2017).
[34] E. Douglas-Cowie, R. Cowie, M. Schröder, A new emotion database: considerations, [64] S. Li, W. Deng, Deep facial expression recognition: a survey, arXiv:1804.08348
sources and scope, ISCA tutorial and research workshop (ITRW) on speech and (2018).
emotion, 2000. [65] H.-W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recog-
[35] G. McKeown, M. Valstar, R. Cowie, M. Pantic, M. Schroder, The semaine database: nition on small datasets using transfer learning, in: Proceedings of the 2015 ACM
annotated multimodal records of emotionally colored conversations between a per- on international conference on multimodal interaction, ACM, 2015, pp. 443–449.
son and a limited agent, IEEE Trans. Affect. Comput. 3 (1) (2012) 5–17. [66] H. Ding, S.K. Zhou, R. Chellappa, Facenet2expnet: regularizing a deep face recog-
[36] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, nition net for expression recognition, in: Automatic Face & Gesture Recognition
S.S. Narayanan, Iemocap: interactive emotional dyadic motion capture database, (FG 2017), 2017 12th IEEE International Conference on, IEEE, 2017, pp. 118–126.
Lang. Resour. Eval. 42 (4) (2008) 335. [67] P. Khorrami, T. Paine, T. Huang, Do deep neural networks learn facial action units
[37] O. Martin, I. Kotsia, B. Macq, I. Pitas, The enterface’05 audio-visual emotion when doing expression recognition? in: Proceedings of the IEEE International Con-
database, in: 22nd International Conference on Data Engineering Workshops ference on Computer Vision Workshops, 2015, pp. 19–27.
(ICDEW’06), IEEE, 2006, p. 8. [68] B.-K. Kim, J. Roh, S.-Y. Dong, S.-Y. Lee, Hierarchical committee of deep convolu-
[38] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, et al., Collecting large, richly anno- tional neural networks for robust facial expression recognition, J. Multimodal User
tated facial-expression databases from movies, IEEE Multimedia 19 (3) (2012) 34– Interfaces 10 (2) (2016) 173–189.
41. [69] G. Wen, Z. Hou, H. Li, D. Li, L. Jiang, E. Xun, Ensemble of deep neural networks
[39] F. Ringeval, A. Sonderegger, J. Sauer, D. Lalanne, Introducing the recola multi- with probability-based fusion for facial expression recognition, Cognit. Comput. 9
modal corpus of remote collaborative and affective interactions, in: Automatic Face (5) (2017) 597–610.
and Gesture Recognition (FG), 2013 10th IEEE International Conference and Work- [70] Z. Yu, C. Zhang, Image based static facial expression recognition with multiple deep
shops on, IEEE, 2013, pp. 1–8. network learning, in: Proceedings of the 2015 ACM on International Conference on
[40] S. Zhalehpour, O. Onder, Z. Akhtar, C.E. Erdem, Baum-1: a spontaneous audio-vi- Multimodal Interaction, ACM, 2015, pp. 435–442.
sual face database of affective and mental states, IEEE Trans. Affect. Comput. 8 (3) [71] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-tuning in deep neural networks for
(2017) 300–313. facial expression recognition, in: Proceedings of the IEEE International Conference
[41] A.-C. Conneau, A. Hajlaoui, M. Chetouani, S. Essid, Emoeeg: a new multimodal on Computer Vision, 2015, pp. 2983–2991.
dataset for dynamic eeg-based emotion recognition with audiovisual elicitation, [72] Y. Zhou, B.E. Shi, Action unit selective feature maps in deep networks for facial ex-
in: Signal Processing Conference (EUSIPCO), 2017 25th European, IEEE, 2017, pression recognition, in: Neural Networks (IJCNN), 2017 International Joint Con-
pp. 738–742. ference on, IEEE, 2017, pp. 2031–2038.

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

[73] H. Li, J. Sun, Z. Xu, L. Chen, Multimodal 2d+ 3d facial expression recognition with [101] X. Chen, W. Han, H. Ruan, J. Liu, H. Li, D. Jiang, Sequence-to-sequence mod-
deep fusion convolutional neural network, IEEE Trans. Multimedia 19 (12) (2017) elling for categorical speech emotion recognition using recurrent neural network,
2816–2831. in: 2018 First Asian Conference on Affective Computing and Intelligent Interaction
[74] A. Jan, H. Ding, H. Meng, L. Chen, H. Li, Accurate facial parts localization and (ACII Asia), IEEE, 2018, pp. 1–6.
deep learning for 3d facial expression recognition, in: Automatic Face & Gesture [102] A.R. Avila, J. Monteiro, D. O’Shaughneussy, T.H. Falk, Speech emotion recognition
Recognition (FG 2018), 2018 13th IEEE International Conference on, IEEE, 2018, on mobile devices based on modulation spectral feature pooling and deep neural
pp. 466–472. networks, in: 2017 IEEE International Symposium on Signal Processing and Infor-
[75] S. Xie, H. Hu, Facial expression recognition using hierarchical features with deep mation Technology (ISSPIT), IEEE, 2017, pp. 360–365.
comprehensive multipatches aggregation convolutional neural networks, IEEE [103] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, B. Schmauch, Cnn+ lstm archi-
Trans. Multimedia 21 (1) (2019) 211–220. tecture for speech emotion recognition with data augmentation, in: Proc. Workshop
[76] Y. Fan, J.C. Lam, V.O. Li, Multi-region ensemble convolutional neural network on Speech, Music and Mind 2018, 2018, pp. 21–25.
for facial expression recognition, in: International Conference on Artificial Neural [104] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M.A. Nicolaou, B. Schuller,
Networks, Springer, 2018, pp. 84–94. S. Zafeiriou, Adieu features? End-to-end speech emotion recognition using a
[77] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression recognition based on deep deep convolutional recurrent network, in: Acoustics, Speech and Signal Process-
evolutional spatial-temporal networks, IEEE Trans. Image Process. 26 (9) (2017) ing (ICASSP), 2016 IEEE International Conference on, IEEE, 2016, pp. 5200–
4193–4203. 5204.
[78] P. Rodriguez, G. Cucurull, J. Gonzalez, J.M. Gonfaus, K. Nasrollahi, T.B. Moes- [105] W. Lim, D. Jang, T. Lee, Speech emotion recognition using convolutional and re-
lund, F.X. Roca, Deep pain: exploiting long short-term memory networks for facial current neural networks, in: Signal and information processing association annual
expression classification, IEEE Trans. Cybern. (99) (2017) 1–11. summit and conference (APSIPA), 2016 Asia-Pacific, IEEE, 2016, pp. 1–4.
[79] B. Hasani, M.H. Mahoor, Facial expression recognition using enhanced deep 3d [106] P. Tzirakis, J. Zhang, B.W. Schuller, End-to-end speech emotion recognition using
convolutional neural networks, in: Computer Vision and Pattern Recognition Work- deep neural networks, in: 2018 IEEE International Conference on Acoustics, Speech
shops (CVPRW), 2017 IEEE Conference on, IEEE, 2017, pp. 2278–2288. and Signal Processing (ICASSP), IEEE, 2018, pp. 5089–5093.
[80] N. Jain, S. Kumar, A. Kumar, P. Shamsolmoali, M. Zareapoor, Hybrid deep neural [107] Q. Jin, C. Li, S. Chen, H. Wu, Speech emotion recognition with acoustic and lexical
networks for face emotion recognition, Pattern Recognit. Lett. (2018). features, in: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE Interna-
[81] D.K. Jain, Z. Zhang, K. Huang, Multi angle optimal pattern-based deep learning for tional Conference on, IEEE, 2015, pp. 4749–4753.
automatic facial expression recognition, Pattern Recognit. Lett. (2017). [108] S.N. Shivhare, S. Khethawat, Emotion detection from text, arXiv:1205.4944 (2012).
[82] Y. Hao, J. Yang, M. Chen, M.S. Hossain, M.F. Alhamid, Emotion-aware video qoe [109] D. Ghazi, D. Inkpen, S. Szpakowicz, Hierarchical approach to emotion recogni-
assessment via transfer learning, IEEE Multimedia (2018). tion and classification in texts, in: Canadian Conference on Artificial Intelligence,
[83] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a boosted deep Springer, 2010, pp. 40–50.
belief network, in: Proceedings of the IEEE Conference on Computer Vision and [110] B. Kratzwald, S. Ilić, M. Kraus, S. Feuerriegel, H. Prendinger, Deep learning for af-
Pattern Recognition, 2014, pp. 1805–1812. fective computing: text-based emotion recognition in decision support, Decis. Sup-
[84] M. Alam, L.S. Vidyaratne, K.M. Iftekharuddin, Sparse simultaneous recurrent deep port Syst. 115 (2018) 24–35.
learning for robust facial expression recognition, IEEE Trans. Neural Netw. Learn. [111] S. Poria, I. Chaturvedi, E. Cambria, A. Hussain, Convolutional mkl based multi-
Syst. (2018). modal emotion recognition and sentiment analysis, in: Data Mining (ICDM), 2016
[85] Y. Kim, B. Yoo, Y. Kwak, C. Choi, J. Kim, Deep generative-contrastive networks for IEEE 16th International Conference on, IEEE, 2016, pp. 439–448.
facial expression recognition, 2017 arXiv:1703.07140. [112] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, N. Dehak, Deep neural
[86] W. Han, H. Li, H. Ruan, L. Ma, Review on speech emotion recognition, J.Software networks for emotion recognition combining audio and transcripts, in: Proc. Inter-
25 (1) (2014) 37–50. speech 2018, 2018, pp. 247–251.
[87] W. Dai, D. Han, Y. Dai, D. Xu, Emotion recognition and affective computing on [113] M.-H. Su, C.-H. Wu, K.-Y. Huang, Q.-B. Hong, Lstm-based text emotion recognition
vocal social media, Inf. Manag. 52 (7) (2015) 777–788. using semantic and emotional word vectors, in: 2018 First Asian Conference on
[88] P.P. Dahake, K. Shaw, P. Malathi, Speaker dependent speech emotion recogni- Affective Computing and Intelligent Interaction (ACII Asia), IEEE, 2018, pp. 1–6.
tion using mfcc and support vector machine, in: Automatic Control and Dynamic [114] S. Poria, E. Cambria, N. Howard, G.-B. Huang, A. Hussain, Fusing audio, visual and
Optimization Techniques (ICACDOT), International Conference on, IEEE, 2016, textual clues for sentiment analysis from multimodal content, Neurocomputing 174
pp. 1080–1084. (2016) 50–59.
[89] J. Chatterjee, V. Mukesh, H.-H. Hsu, G. Vyas, Z. Liu, Speech emotion recognition [115] V. Vielzeuf, S. Pateux, F. Jurie, Temporal multimodal fusion for video emotion
using cross-correlation and acoustic features, in: 2018 IEEE 16th Intl Conf on De- classification in the wild, in: Proceedings of the 19th ACM International Conference
pendable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelli- on Multimodal Interaction, ACM, 2017, pp. 569–576.
gence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and [116] L. Chao, J. Tao, M. Yang, Y. Li, Z. Wen, Long short term memory recurrent neural
Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), network based multimodal dimensional emotion recognition, in: Proceedings of
IEEE, 2018, pp. 243–249. the 5th International Workshop on Audio/Visual Emotion Challenge, ACM, 2015,
[90] K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network pp. 65–72.
and extreme learning machine, Fifteenth annual conference of the international [117] P. Tzirakis, G. Trigeorgis, M.A. Nicolaou, B.W. Schuller, S. Zafeiriou, End-to-end
speech communication association, 2014. multimodal emotion recognition using deep neural networks, IEEE J. Sel. Top. Sig-
[91] Z.-Q. Wang, I. Tashev, Learning utterance-level representations for speech emotion nal Process. 11 (8) (2017) 1301–1309.
and age/gender recognition using deep neural networks, in: Acoustics, Speech and [118] D. Nguyen, K. Nguyen, S. Sridharan, D. Dean, C. Fookes, Deep spatio-temporal
Signal Processing (ICASSP), 2017 IEEE International Conference on, IEEE, 2017, feature fusion with compact bilinear pooling for multimodal emotion recognition,
pp. 5150–5154. Comput. Vision Image Understanding 174 (2018) 33–42.
[92] I.J. Tashev, Z.-Q. Wang, K. Godin, Speech emotion recognition based on gaussian [119] S. Sahay, S.H. Kumar, R. Xia, J. Huang, L. Nachman, Multimodal relational tensor
mixture models and deep neural networks, in: Information Theory and Applications network for sentiment and emotion classification, arXiv:1806.02923 (2018).
Workshop (ITA), 2017, IEEE, 2017, pp. 1–4. [120] D. Hazarika, S. Gorantla, S. Poria, R. Zimmermann, Self-attentive feature-level
[93] S. Parthasarathy, I. Tashev, Convolutional neural network techniques for speech fusion for multimodal emotion detection, in: 2018 IEEE Conference on Mul-
emotion recognition, in: 2018 16th International Workshop on Acoustic Signal En- timedia Information Processing and Retrieval (MIPR), IEEE, 2018, pp. 196–
hancement (IWAENC), IEEE, 2018, pp. 121–125. 201.
[94] A.M. Badshah, J. Ahmad, N. Rahim, S.W. Baik, Speech emotion recognition from [121] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, Learning affective features with a
spectrograms with deep convolutional neural network, in: Platform Technology hybrid deep model for audio–visual emotion recognition, IEEE Trans. Circuits Syst.
and Service (PlatCon), 2017 International Conference on, IEEE, 2017, pp. 1–5. Video Technol. 28 (10) (2018) 3030–3043.
[95] L. Zheng, Q. Li, H. Ban, S. Liu, Speech emotion recognition based on convolution [122] Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Košir, Audio-visual emotion fu-
neural network combined with random forest, in: 2018 Chinese Control And Deci- sion (AVEF): a deep efficient weighted approach, Inf. Fusion 46 (2019) 184–
sion Conference (CCDC), IEEE, 2018, pp. 4143–4147. 192.
[96] Y. Niu, D. Zou, Y. Niu, Z. He, H. Tan, Improvement on speech emotion recognition [123] M.S. Hossain, G. Muhammad, Emotion recognition using deep learning approach
based on deep convolutional neural networks, in: Proceedings of the 2018 Interna- from audio–visual emotional big data, Inf. Fusion 49 (2019) 69–78.
tional Conference on Computing and Artificial Intelligence, ACM, 2018, pp. 13–18. [124] B. Sun, L. Li, G. Zhou, X. Wu, J. He, L. Yu, D. Li, Q. Wei, Combining multimodal fea-
[97] P. Yenigalla, A. Kumar, S. Tripathi, C. Singh, S. Kar, J. Vepa, Speech emotion recog- tures within a fusion network for emotion recognition in the wild, in: Proceedings
nition using spectrogram & phoneme embedding, in: Proc. Interspeech 2018, 2018, of the 2015 ACM on International Conference on Multimodal Interaction, ACM,
pp. 3688–3692. 2015, pp. 497–502.
[98] S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep [125] Y. Huang, J. Yang, P. Liao, J. Pan, Fusion of facial expressions and eeg for multi-
convolutional neural network and discriminant temporal pyramid matching, IEEE modal emotion recognition, Comput. Intell. Neurosci. 2017 (2017).
Trans. Multimedia 20 (6) (2018) 1576–1590. [126] R. Gupta, M. Khomami Abadi, J.A. Cárdenes Cabré, F. Morreale, T.H. Falk, N. Sebe,
[99] J. Lee, I. Tashev, High-level feature representation using recurrent neural network A quality adaptive multimodal affect recognition system for user-centric multime-
for speech emotion recognition (2015). dia indexing, in: Proceedings of the 2016 ACM on international conference on mul-
[100] S. Mirsamadi, E. Barsoum, C. Zhang, Automatic speech emotion recognition us- timedia retrieval, ACM, 2016, pp. 317–320.
ing recurrent neural networks with local attention, in: Acoustics, Speech and [127] Y. Fan, X. Lu, D. Li, Y. Liu, Video-based emotion recognition using cnn-rnn and
Signal Processing (ICASSP), 2017 IEEE International Conference on, IEEE, 2017, c3d hybrid networks, in: Proceedings of the 18th ACM International Conference
pp. 2227–2231. on Multimodal Interaction, ACM, 2016, pp. 445–450.

Y. Jiang, W. Li and M.S. Hossain et al. Information Fusion 53 (2020) 209–221

Yingying Jiang (e-mail: yingyingjiang@hust.edu.cn) re- Min Chen (minchen2012@hust.edu.cn) is a full professor in
ceived her bachelor degree from School of Information and School of Computer Science and Technology at Huazhong Uni-
Safety Engineering, Zhongnan University of Economics and versity of Science and Technology (HUST) since Feb. 2012. He
Law (ZUEL) in June 2017. Currently, she is a Ph.D candidate is the director of Embedded and Pervasive Computing (EPIC)
at Embedded and Pervasive Computing (EPIC) Lab in School Lab at HUST. His Google Scholar Citations reached 15,050+
of Computer Science and Technology, HUST. Her research in- with an h-index of 60 and i10-index of 188. His top paper was
terests include healthcare big data, congnitive learning, etc. cited 1750+ times. He was selected as Highly Cited Research
at 2018. He got IEEE Communications Society Fred W. Eller-
sick Prize in 2017. His research focuses on cognitive comput-
ing, 5G Networks, embedded computing, wearable computing,
big data analytics, robotics, machine learning, deep learning,
emotion detection, IoT sensing, and mobile edge computing,
Wei Li is with the School of Computer Science and Technol- Abdulhameed Alelaiwi (aalelaiwi@ksu.edu.sa) is currently
ogy, Huazhong University of Science and Technology, China an Associate Professor of Software Engineering Department
(e-mail: weili_epic@hust.edu.cn). Wei Li is a Ph.D candidate in the College of Computer and Information Sciences, King
at the School of Computer Science and Technology, Huazhong Saud University (KSU), Riyadh, Saudi Arabia. He received his
University of Science and Technology, China. Her research in- PhD degree in Software Engineering from the College of En-
terests include software defined IoT, deep learning, etc. gineering, Florida Institute of Technology-Melbourne, USA in
2002. He is currently serving as the Vice Dean of Research
Chairs Program at KSU. He has published over 70+ research
papers in the ISI-Indexed journals of international repute. His
research interest includes cloud computing, multimedia, Inter-
net of things, Big data, and mobile cloud.

M. Shamim Hossain (mshossain@ksu.edu.sa) is a Professor at the Department of Soft-

ware Engineering, College of Computer and Information Sciences, King Saud University, Muneer Al-Hammadi (eng.muneer2008@gmail.com) is a Re-
Riyadh, Saudi Arabia. He is also an adjunct professor at the School of Electrical Engi- searcher and a Ph.D. candidate in the Department of Computer
neering and Computer Science, University of Ottawa, Canada. His research interests in- Engineering, College of Computer and Information Sciences,
clude cloud networking, smart environment (smart city, smart health), social media, IoT, King Saud University, Riyadh, Saudi Arabia. His research in-
edge computing and multimedia for health care, deep learning approach to multimedia terests include image and video processing, and deep learning
processing, and multimedia big data. He has authored and coauthored more than 200
publications including refereed journals, conference papers, books, and book chapters.
Recently, his publication is recognized as the ESI Highly Cited Papers. He is a recipient
of a number of awards, including the Best Conference Paper Award and the 2016 ACM
Transactions on Multimedia Computing, Communications and Applications (TOMM) Nico-
las D. Georganas Best Paper Award, Research Quality Award, King Saud University, and
the Research in Excellence Award from the College of Computer and Information Sciences
(CCIS), King Saud University (3 times in a row). He is on the Editorial Boards of the IEEE
WIRELESS COMMUNICATIONS, the IEEE ACCESS, the Journal of Network and Computer
Applications (Elsevier), the Computers and Electrical Engineering (Elsevier), the Human-
Centric Computing and Information Sciences (Springer), the Games for Health Journal,
and the International Journal of Multimedia Tools and Applications (Springer). He also
serves as a Lead Guest Editor for the IEEE NETWORK.