Академический Документы
Профессиональный Документы
Культура Документы
Abstract: The key to speech emotion recognition is extraction of speech emotion features. In this paper, a new network
model (CNN-RF) based on convolution neural network combined with random forest is proposed. Firstly, the
convolution neural network is used as the feature extractor to extract the speech emotion feature from the normalized
spectrogram, used random forest classification algorithm to classify the speech emotion features. The result of
experiment shows that CNN-RF model is superior to the traditional CNN model. Secondly, Improved the Record Sound
command box of Nao and applied the CNN-RF model to Nao robot. Finally, Nao robot can "try to figure out" a human's
psychology through speech emotion recognition and also know about people's happiness, anger, sadness and joy,
achieving a more intelligent human-computer interaction.
Key Words: Convolution Neural Network, Speech Emotion Recognition, Random Forest, Spectrogram, Nao Robot
978-1-5386-1243-9/18/$31.00 2018
c IEEE 4143
automatic learning features which were easy to be 99*145 pixels, 3 channels and jpg format and used as the
distinguished from the original speech signals, and used input;
these features to solve feature extraction problem related C1 and C2 layers: Convolution layer, these two layers used
with context. Compared with traditional speech emotion 16 5*5 multi-convolution nucleus. The feature image size in
recognition methods based on signal processing, the C1 is changed from 99*145 into 95*141 and the feature
proposed method has good prediction capacity on RECOLA image size in C2 is 91*137, each layer generates 16 feature
natural emotion database. images;
S3 layer: Pooling layer, this layer used the 2*2 maximum
3 SPEECH EMOTION RECOGNITION BASED
sampling convolution nucleus. This layer was to reduce
ON CNN-RF dimension of data and generated 16 45*68 feature images.
The framework of speech emotion recognition is shown in C4 and C5 layers: Convolution layer, these layers used 16
Figure 1, spectrogram was calculated from emotion speech types of 3*3 multi-convolution nucleus, 16 feature images
samples through framing, windowing, short-time Fourier size in C5 is changed from 43*66 of C4 into 41*64.
transform (STFT) and power spectral density (PSD), and S6 layer: Pooling layer, the 2*2 maximum sampling
the normalized spectrogram was used as input of CNN. convolution nucleus was used and 16 20*32 feature images
Speech emotion features were extracted by CNN and the were gained.
output of CNN Flatten layer was input into RF classifier as
F1 layer: Full connection layer. In this paper uses two fully
eigenvectors of speech emotion samples. In the recognition
connected layers. F1 have 128 parameters, F2 have 6
stage, the test speech signals were transformed into
parameters (6 represents the speech emotion category),
spectrogram and then input into the CNN-RF model
softmax was used as the classifier.
classifier to recognize types of speech emotions.
Train CNN-RF Model
D layer: Dropout layer, this study applied dropout strategy
FRAMINGǃ
WINDOWINGǃ
in F1, the parameter value was set 0.6, the loss value at
Emotion
,QSXW
Speech
SFTFǃPSD
Spectrogram
NORMALIZE
Features
RF
Classifier
training and val_loss reduction at verification achieved the
Samples
best. The purpose is to improve the generalization ability of
FRAMINGǃ
2XWSXW
the model.
Test
Speech
WINDOWINGǃ
SFTFǃPSD 2XWSXW
Types of
Speech
It can be seen from Fig.2 that the similarity between applied
Spectrogram CNN-RF Model
Signals Emotions
CNN feature extraction model and VGG16 and VGG19 [11]
Recognize
lies in design of convolution layers. The repeated
Fig 1. Framework of speech emotion recognition system occurrence of convolution layers is conducive to fuller
feature learning.
3.1 CNN Feature Extraction
3.2 RF Classifier
Speech emotion feature is the basis for speech emotion
recognition. Extraction accuracy of speech emotion features It RF [12] was used as classifier after CNN accomplished
in the original speech emotion samples can influence the features. The designed RF in this study covered 200
final recognition rate of speech emotions directly. The decision trees. The classification standards used Gini
comparison diagram between CNN-RF and CNN is shown indexes. Except for these two parameters, other parameters
in Figure 2. In order to learn more global features from all used defaulted values of RF classifier in scikit-learn
different angles, we set up setting multi-convolution kernels packet.
in different layers. The following are the parameter settings In generation process of RF, the following two random
of CNN network model for feature extraction: sampling methods were used to construct the decision tree,
INPUT layer: Emotion speech samples with different time which increases generalization ability effectively:
length were normalized into colorful spectrogram with
CNN
RF CNN-RF
C1 C2 S3 C4 C5 S6 F1ǃD
4144 The 30th Chinese Control and Decision Conference (2018 CCDC)
y Row sampling. In the original training dataset N, n regarded as a convolution layer.The second full connection
(n<<N) samples with replacement were chosen layer was connected after F1 and D, and the activation
randomly to generate the decision tree. Input samples function softmax was used to realize multi-classification.
of each tree were the proper subset of the original Parameter of softmax was set 6, indicating 6 emotions.
training dataset, thus enabling to prevent overfitting. Classification results of CNN model and CNN-RF model
y Column sampling. Suppose there were M features and are Table 3. The recognition accuracy of CNN-RF is 3.25%
each node chose m (m<<M) features to determine the higher than that of CNN model.
best splitting points. This ensures complete splitting or
pointing to one classification of leaf nodes of the Table1. Classification Results of CNN and CNN-RF
decision tree.
In addition, classification results of RF were determined by Network Model ACC
number of output types in all decision trees, resulting in the
relatively high classification accuracy. CNN 0.8143
Fig 3. Normalized spectrogram corresponding to speech signals
CASIA database were divided into training set and test set
according to the proportion of 5:1. The performance of the
classification is evaluated by Accuracy (ACC). It can be
seen from Fig.2 that after the Flatten layer of the CNN
model, two fully connected layers (indicated by dotted Fig 4.Choregraphe interface
lines) are connected, the first fully connected layer can be
The 30th Chinese Control and Decision Conference (2018 CCDC) 4145
In this study, the Record Sound box was improved and the
“ALAudioRecorder” application programming interface
was used to replace the “ALAudioDevice” interface to
generate new Recorder box. It can be seen from the interface
of Choregraphe in Figure 4 that the improved box increases
the single-track (front microphone) .wav format and
four-track .ogg format in recording.
In this study, the same text content of “Hello, Nao!” was
used ad recorded by the original Record Sound box and the
improved Recorder box. Two different formats of
documents which were generated by the original Record
Sound box were named as RS1.wav and RS2.ogg, while the
document generated by the improved Recorder box was
named by Recorder.wav. Three speeches in internal
memory of NAO are shown in Figure 5. All three audio
documents were downloaded and input into Audacity.
Name, number of tracks and sampling frequency of
documents were examined (Figure 6). The first two is
Recorder.wav,and the second to the fifth rows are RS1.wav,
indicating that RS1 is the four-track mixed sound. Audios
collected by left, right, front and back microphones of the
robot were shown from top to bottom. The sixth row is
RS2.ogg single-track audio, but the .ogg document belongs
to audio lossy compression document and can affect tone
quality to some extent. In addition, it tries to use .wav format
in order ensure consistency of format of audios collected
NAO and CASIA database and eliminate interferences of Fig 6. Time-domain waveforms of Recorder, RS1 and RS2
audio format on application effects. Hence, the
Recorder.wav meets the above requirements. (3) Test samples are classified by the trained and stored
CNN-RF network model. Use the “ALTextSpeech”
application interface in NAO robot to “speak out” the
recognition results.
START
The CNN-RF speech emotion recognition model Fig 7. Speech emotion recognition of NAO robot
successfully applied to Nao robot needs the three steps
shown in Figure 7. The detailed process is as follows: Test results on NAO robots are shown in Table 2. In the
(1) Acquire emotional speech samples timely by NAO. process of test, different sentences were used to express
Connect NAO robot and local computer, use the improved emotions. According to test, “surprise” and “fear” were the
Recorder box to store the emotional speech samples in .wav easiest to be confused. The recognition degree of “fear” and
format and download them by SFTP functions of python; “happy” was lower than the average level. The highest
(2) Draw the spectrogram; recognition degree was achieved to “neutral”, reaching
90%.
4146 The 30th Chinese Control and Decision Conference (2018 CCDC)
Table2. Test results in NAO robots
The 30th Chinese Control and Decision Conference (2018 CCDC) 4147