Вы находитесь на странице: 1из 6

SIGNaL PROCESSING

algorithms, architectures, arrangements, and applications


SPa 2016
THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS INC. September 21-23rd, 2016, Pozna, POLAND

Classification of emotions from speech signal

Andrzej Majkowski, Marcin Koodziej, Remigiusz J. Rak, Robert Korczyski


Institute of Theory of Electrical Engineering, Measurement and Information Systems
Warsaw University of Technology
Warsaw, Poland
andrzej.majkowski@ee.pw.edu.pl

AbstractThe article presents an analysis of the possibility of emotions are expressed differently, the set of best features for
recognizing speakers emotions from speech signal in Polish emotion classification may also depend on the language.
language. In order to perform experiments a database containing
speech recordings with emotional content was created. On its II. MATERIAL
basis, extraction of features from the speech signals was In order to conduct experiments a database of speech
performed. The most important step was to determine which of
recordings with emotions was created. For this purpose two
the previously extracted features were the most suitable to
sound sources were used. The first and the main source were
distinguish emotions and with what accuracy the emotions could
be classified. Two feature selection methods - Sequential Forward fairy tales from resources of a Polish Catholic Radio.
Search (SFS) and t-statistics were examined. Emotion Recordings were made by young people from middle school
classification was implemented using k - Nearest Neighbor (k- and high school. In the recordings there were several male and
NN), Linear Discriminant Analysis (LDA) and Support Vector female voices. The second source were radio-plays of Polish
Machine (SVM) classifiers. Classification was carried out for Radio, available in the Internet. The recordings, made by
pairs of emotions. The best results were obtained for classifying professional adult actors, included male and female voices.
neutral and fear (91.9%) and neutral and joy emotions (89.6%). Most recordings also contained undesirable background
soundtrack (e.g. creaking door, sound of a car engine, singing
Keywords-emotion recognition, feature extraction, feature birds). For the experiments, the fragments that did not contain
selection, speech processing, speech signal such sounds were selected. The selected fragments of speech
signal were one to six seconds long. They were saved in .wav
I. INTRODUCTION format with a sampling frequency of 44kHz and 16 bit
An automatic recognition of a speakers emotions, based on resolution. The prepared database, containing sound recordings
the processing of sound recordings, find numerous expressing emotions is described in Table 1. Three experts
applications. They can be a great help in telephone customer were involved in the process of recording assignments to
service departments - enabling the analysis of degree of clients particular kind of emotions. The recordings, at which at least
satisfaction or dissatisfaction. They can also be used in alarm one person had doubts were rejected.
systems (e.g. safety bank systems during cashier calls with
customers, police systems to prevent and combat crime). TABLE I. DATABASE OF SOUND RECORDINGS WITH EMOTIONS
Analysis of emotions aims not only to recognize the emotional
Emotion Number of
state of a speaker, but also to give the impression of emotions recordings
in artificially synthesized speech.
neutral 35
In recent years various studies have been done to find out joy 26
the features that will be helpful in recognizing emotions [1-3]. sadness 55
A number of source features, vocal tract features and prosodic fear 19
features are discussed in [4,5]. Features like sound pitch and astonishment 54
energy, formant frequencies, Mel-Frequency Cepstral anger 61
Coefficients (MFCC) are utilized for the detection of several
emotions with good accuracy in [6-8]. Researchers are The database of recordings mainly includes fragments of
continuously looking for new feature sets. Voice quality plays for children, in which emotions are often exaggerated.
parameters like Harmonic Noise Ratio (HNR), shimmer and On the one hand, it provides an opportunity to obtain greater
jitter are also being analyzed [9]. Several pattern recognition variation of values of features associated with emotions, on the
techniques such as Support Vector Machine (SVM), Gaussian
other hand, the use of such features in an attempt to classify
Mixture Model (GMM), Kernel Regression (KR), Linear emotions in natural speech can give poorer results.
Discriminant Analysis (LDA), k - Nearest Neighbor (k-NN)
and other methods were used for emotion recognition from III. METHODS
speech [10-13]. The aim of the article is to classify emotions in
speech signal in Polish language. As in each mother tongue The procedure of automatic recognition of emotions in
speech consists of the following traditional steps: feature
extraction, feature selection and classification.

276
A. Feature extraction TABLE II. THE FEATURES CALCULATED FOR SPEECH RECORDINGS

Each feature selected from the speech signal should provide Feature Description
Number
of features
as much useful information as possible. Due to lack of
RMS RMS value 1
knowledge about the usefulness of individual features in the
classification process, we constructed feature vectors that Eavg mean value of energy 1
contain maximum number of them. At a later stage, the Emax maximum value of energy 1
features were selected in terms of their usefulness in the Emin minimum value of energy 1
process of emotion recognition in speech.
Evar energy variance 1
Before calculation of features the analyzed sound signal RMSavg mean value of RMS value 1
amplitude was scaled to <-1,+1> range. Then one audio RMSmax maximum of RMS value 1
channel from the stereo track was isolated. Extraction of
RMSmin minimum of RMS value 1
features required dividing the sound signal into frames. Based
on literature studies the time window length was assumed as RMSvar variance of RMS value 1
0.03 second [6,14]. It gave about 34 frames for 1 second sound lEavg mean logarithm of energy value 1
recording. For 44kHz sampling frequency, each signal frame lEvar variance of logarithm of energy 1
comprised of 1320 samples. lEmax maximum value of logarithm of energy 1
As features various parameters calculated for the sound lEmin minimum value of logarithm of energy 1
signal and their statistical parameters, such as average, Dcavg mean value of delta coefficient 1
minimum and maximum values, variance were taken into
Dcvar variance of delta coefficient 1
account:
Dcmax maximum value of delta coefficient 1
x Energy properties of the speech signal (RMS, E Dcmin minimum value of delta coefficient 1
energy).
Ddcavg mean value of delta-delta coefficient 1
x Mel-Frequency Cepstral Coefficients (Mc(i), i=112) Ddcvar variance of delta-delta coefficient 1
[4]. Ddcmax maximum value of delta-delta coefficient 1
x Delta coefficients calculated as derivatives of the Ddcmin minimum value of delta-delta coefficient 1
cepstral coefficients (Dc), typically used in speech Mc(i)avg mean value of the i-th Mel-Cepstral coefficient 12
recognition systems. Mc(i)var variance value of the i-th Mel-Cepstral coefficient 12
x Density of zero crossings calculated in time domain Mc(i)max maximum value of the i-th Mel-Cepstral coefficient 12
(GP0). Mc(i)min minimum value of the i-th Mel-Cepstral coefficient 12

x The first local maximum of speech signal spectrum GP0avg mean value of zero crossings density 1
(F0), called laryngeal tone or fundamental frequency. GP0var variance of zero crossings density 1
GP0max maximum value of zero crossings density 1
x The spectral moments defined in frequency domain.
The first order normalized moment is interpreted as the GP0min minimum value of zero crossings density 1
center of gravity of the spectrum (SCW) and describes F0 fundamental frequency 1
the distribution of frequencies in spectrum (defines F0avg mean value of the fundamental frequency 1
whether the signal spectrum is dominated by low or
F0var variance of the fundamental frequency 1
high frequencies).
F0max maximum value of the fundamental frequency 1
x Spectral flux (SF) - a measure of the rate of change of F0min minimum value of the fundamental frequency 1
power spectrum of the signal. It is calculated by
SCWavg mean value of the center of gravity of the spectrum 1
comparing the power spectrum of the signal frame with
the power spectrum of the previous frame. For each SCWvar variance of the center of gravity of spectrum 1

frequency point the difference between a value of the SCWmax maximum value of the center of gravity of spectrum 1
spectral line in the current window and the SCWmin minimum value of the center of gravity of spectrum 1
corresponding spectral line value in the previous SFavg The mean value of spectral flux 1
window was calculated. The result was the sum of the
SFvar variance of Spectral flux 1
squares of the differences of these values (spectral flux
is used, among others, to determine the timbre). SFmax maximum value of Spectral flux 1
SRO(p)avg mean value of spectral roll off for p threshold 3
x The spectral roll off coefficient (SRO) - specifies how
SRO(p)var variance of spectral roll off for p threshold 3
much frequencies are concentrated below a given
energy threshold p (p = 80%, 50%, 30%). SRO(p)max maximum value of spectral roll off for p threshold 3
SRO(p)min minimum value of spectral roll off for p threshold 3
In Table 2 there are presented all described features (97)
taken into account for each recording.

277
B. Feature selection
The aim of the feature selection stage was to find the It should be noted however, that the choice of features is to
features that were best suited for recognizing emotional states some extent dependent on the classification method and the
in speech. Too large number of features makes effective division of the data on the training and testing sets. For each
classification practically impossible. There are many pair of emotions the best feature set was selected individually.
algorithms of feature selection [15]. In further study we used As an example, the objective function values (classification
two methods: Sequential Forward Search (SFS) and t-statistics. error) used for SFS method for sadness-anger and neutral-fear
The essence of the SFS algorithm is to enlarge (starting emotions are shown in Fig. 1 and 2. It is seen that the useful
from one and ending with the desired number of features) the number of features is in the area of 5 to 20. The objective
subset of features, existing at a given stage of the procedure, function values for classifying other pairs of emotions have
and then to check the "quality" of the newly created set [16]. confirmed this relationship.
This procedure is repeated for all the unused features, and its Using t-statistics, we seek such a feature set that maximize
purpose is to identify the features that the best complement the separability of two classes [17]. This is a fast method, but can
current subset. The SFS method has linear complexity with the only be used for two classes. The basis of the method is to rank
increase in the number of target features. A key element of the the features according to Welch t-statistic. The value of Welch
SFS procedure is to assess the suitability of the following t-coefficient determines the degree of differentiation of the two
features in terms of ensuring the correct classification. To do considered classes. Two selected by the t-statistics algorithm
this, we define a proper objective function. The objective features (Dcvar and IEavg) that relatively easy allow to
function is used to check achieved efficiency of classification distinguish neutral from fear emotions are shown in Fig. 3. All
for the selected feature set. We used LDA and SVM classifiers. features can be ranked in accordance with increasing t values.
Feature selection results were similar in both cases, but the
whole process was much slower in the SVM case.

Figure 3. A sample feature distribution for neutral (+) - fear emotions (*)

Figure 1. The values of the objective function (SFS) for sadness anger

Figure 4. The usefulness of features for a group of two emotions sadness -


anger (the t-statistic values)
Figure 2. The values of the objective function (SFS) for neutral - fear

278
TABLE III. THE CLASSIFICATION ACCURACY FOR SFS METHOD

Classifier 5-NN LDA SVM 5-NN LDA SVM


Emotions (linear) (linear) (linear) (linear)

5 best features 15 best features


neutral -
72.2 84.0 82.3 77.4 81.2 79.1
anger
neutral - joy 79.8 89.6 87.4 75.4 82.5 82.5
neutral -
64.1 59.6 58.5 63.7 60.7 64.1
sadness
neutral - fear 87.0 91.9 90.7 87.0 90.7 89.5
neutral -
72.6 78.2 72.7 64.0 73.4 72.6
astonishment
joy - sadness 87.7 86.8 83.9 87.3 86.0 86.4

joy - fear 85.9 84.4 83.7 80.0 71.9 85.2


joy -
69.2 71.7 71.6 65.0 75.0 76.6
astonishment
Figure 5. The usefulness of features for a group of two emotions sadness -
anger (the t-statistic values) joy - anger 70.1 74.7 75.1 72.0 65.9 69.7

sadness - fear 82.9 87.8 86.9 81.5 86.9 87.4


Based on t-statistics, for each pair of emotions the best feature
sadness -
set was calculated individually. As an example, Figures 4, 5 astonishment
65.4 74.6 74.0 72.8 72.2 72.2
show distributions of Welch t - coefficient for sadness-anger sadness -
75.3 81.9 82.7 80.2 80.5 79.8
and neutral-fear emotions. For number of features from 20 to anger
fear -
97 the loss of quality is almost linear. The shape of the graph astonishment
77.6 84.0 80.3 81.3 80.8 82.2
suggests that the best features, in the process of emotion
fear - anger 80.0 82.9 82.1 81.7 83.3 82.5
recognition, are limited to first 20 ones. Other features
distributions confirm this relationship. In further experiments astonishment
64.3 75.9 75.6 61.4 69.0 71.3
- anger
the classification of emotions in speech signal for both SFS and
t-statistics feature selection methods was carried out for 5 and MEAN 75.6 80.5 79.2 75.4 77.3 78.7

15 features.
TABLE IV. THE CLASSIFICATION ACCURACY FOR T-STATISTISCS METHOD
C. Classification
Classifier
Each record of speech signal was represented by a vector of 5-NN LDA SVM 5-NN LDA SVM
97 features. So for 250 recordings collected in database the Emotions (linear) (linear) (linear) (linear)
features matrix was of dimension 250u97. The recordings 5 features 15 features
represent 6 groups of emotions (neutral, joy, sadness, fear, neutral
70.1 71.5 71.9 78.5 76.7 79.5
astonishment, anger). The classification of emotions in speech anger
signal was performed using three classifiers, k-NN, LDA and neutral joy 74.3 81.4 81.4 82.0 83.6 77.6
SVM. Emotion classes were analyzed in pairs. Features were neutral-
selected for each pair independently using t-statistics and SFS 57.7 64.8 68.2 64.1 62.6 66.6
sadness
methods. To evaluate the classification error 10-cross neutral fear 92.0 89.5 92.6 85.8 89.5 90.7
validation test was used. The entire set was divided into 10
neutral
subsets. For 10-cross validation test the classifier was learned astonishment
62.5 68.5 70.8 69.7 68.5 68.9
on the first 9 subsets and tested at the last 10th. Then the joy sadness 84.4 83.1 84.0 85.2 87.7 88.1
learning process included subsets 1-8 and 10, and the test was
carried out on 9 subset. This process was repeated for each of joy fear 85.2 74.8 80.7 85.9 86.7 87.4
the 10 subsets. joy
70.4 66.7 70.8 67.1 73.8 79.9
astonishment
IV. RESULTS joy anger 79.7 67.4 74.7 77.3 68.2 76.6
In Table 3 there are presented the results of the emotion 84.2 86.9 84.7 85.1 86.0 88.3
sadness fear
classification (in pairs) for 5 and 15 best features obtained with
sadness
SFS method. The LDA classifier gave the best results for 5 astonishment
70.9 69.1 67.0 73.1 74.6 77.4
features (average classification accuracy 80.5%), and SVM for sadness
77.6 77.0 77.6 77.3 79.3 79.6
15 (78.7%). What is interesting in this case, classification anger
accuracy for 5 features was slightly better than for 15. The fear
79.0 79.4 79.0 74.4 74.4 79.9
astonishment
easiest to classify were pairs of emotions: neutral fear (91.9%
- 5 features LDA) and neutral joy (89.6% - 5 features LDA), fear anger 81.2 83.8 78.8 82.1 81.7 84.2

the most difficult to classify: neutral sadness (59.6% - 5 astonishment


61.2 68.1 67.8 67.3 67.5 76.2
anger
features LDA) and joy astonishment (71.7% - 5 features
LDA). MEAN 75.4 75.5 76.7 77.0 77.4 80.1

279
In Table 4 there are presented the results of the emotion V. DISCUSSION
classification for 5 and 15 best features obtained with t- Classification results for feature selection using t-statistics,
statistics. Here for 5 features the best results were obtained for especially for a smaller number of features, are a bit worse than
SVM classifier (average classification accuracy 76.7%), the in a case of SFS. It seems that it is because the t-statistics often
same for 15 features (80.1%). The results are slightly worse selects features correlated with each other.
than for the SFS method. The pairs of emotions: neutral fear,
sadness fear, and neutral joy were easiest to classify, while For each pair of emotions a different set of the best features
neutral sadness and astonishment anger were the most was selected. Although many of them were repeated and the
difficult to classify. same were best suited to recognize emotions. To these features
mostly belonged signal energy and RMS, especially their peak
To reduce the initial number of 97 features, an additional values and variances, and Mel-Frequency Cepstral
experiment has been conducted. We used Principal Component Coefficients. The best features for t-statistics method were:
Analysis (PCA). The essence of the PCA method is to RMSvar, lEmax, Mc12max, Mc12min, Emax, Evar, RMSmax,
construct a new orthogonal space based only on those Mc02max, Mc12var, RMS, Eavg, RMSavg, lEmin, mc11min.
directions, along which the set of samples has the largest The best features for SFS method were: lEmax, RMS, Ddcavg,
dispersion. If the dispersion of samples is large, it may be lEvar, RMSvar, Mc01avg, Eavg, lEavg, Mc03avg, Ddcmax,
because there is a difference between objects of different Mc02avg, Evar, RMSmin, SRO80min, Mc02min. Classification
classes (for the combination of features which determinate the errors in most cases were the smallest while distinguishing
new direction). We used PCA decomposition on 5 and 15 neutral emotions from fear, and the biggest for distinguishing
principal components (5 components are sufficient to contain neutral from sadness.
64% of total variance while 15 components 86% of total
variance). In this case classification was carried out using 5- It is worth comparing the results with those obtained by
KNN and SVM classifiers. The results are shown in Table 5. other authors. Various combinations of feature were tested in
The use of PCA generally slightly worsen the classification [14]. The authors for four classes of emotions obtained
results. In this case, the best average classification results recognition accuracy of 71.9% for SVM classifier. Five
(73.3% for 5 components and 79.9% for 15 components) are emotions were classified with the use of one against all
worse than for SFS method (80.5% for 5 features and 78.7% criteria for SVM in [6]. The achieved accuracy was about 80%.
for 15 features) and t-statistics (76.7% for 5 features and 80.1% An approach to find the most important sound features is
for 15 features). Application of PCA in this case seems to be presented in [7]. For neural network as a classifier the precision
unjustified. was 68%, while for SVM 64%.

TABLE V. THE CLASSIFICATION ACCURACY FOR PCA VI. CONCLUSIONS


Classifier In this article we examined the problem of automatic
PCA PCA PCA PCA
5-NN SVM 5-NN SVM
recognition of speaker emotions based on the recorded speech
Emotions
signal in Polish language. The study was based on specially
5 comp. 15 comp. created database containing recordings grouped by emotions.
neutral - anger 75.4 77.1 72.9 79.8 For the analysis a quantitative description of the speech signal
neutral - joy 74.3 77.6 78.1 79.2 was proposed. Obtained in this manner features next were
neutral - sadness 55.9 58.9 61.1 68.5 selected and then tested for their suitability in recognizing
neutral - fear 79.0 86.3 79.0 90.1 emotions. As the result of the analysis, different sets of best
neutral - astonishment 62.9 61.8 67.0 70.4 features were selected. A lot of features were common for all
joy - sadness 83.6 83.1 83.6 88.1
the sets. These include energy-related parameters, RMS, and
82.9 77.0 85.2 85.1
Mel-Frequency Cepstral Coefficients. On the other hand, we
joy - fear
got a list of features that are not useful at all - such as the
joy - astonishment 72.9 67.1 76.2 78.3
average density of zero crossings or the fundamental
joy - anger 75.8 68.6 70.9 77.0
frequency.
sadness - fear 78.3 88.7 80.2 86.8
sadness - astonishment 66.4 67.6 68.2 79.2 REFERENCES
sadness - anger 76.1 76.2 80.4 80.7 [1] T. L. Pao, C. H. Wang, and Y. J. Li, A Study on the Search of the Most
fear - astonishment 80.4 73.9 76.2 78.5 Discriminative Speech Features in the Speaker Dependent Speech
Emotion Recognition, in 2012 Fifth International Symposium on
fear - anger 78.3 81.2 82.9 83.4
Parallel Architectures, Algorithms and Programming, 2012, pp. 157
astonishment - anger 56.8 50.8 56.8 73.7 162.
MEAN 73.3 73.1 74.6 79.9 [2] H. Atassi, A. Esposito, and Z. Smekal, Analysis of high-level features
for vocal emotion recognition, in 2011 34th International Conference
on Telecommunications and Signal Processing (TSP), 2011, pp. 361
366.
Also an experiment was conducted in which all 6 emotions [3] D. Kamiska, T. Sapiski, and A. Pelikant, Comparison of perceptual
were classified. In this case 15 features were taken into features efficiency for automatic identification of emotional states from
consideration. The obtained classification accuracies (SVM speech, in 2013 6th International Conference on Human System
classifier) were: 38.1% for SFS feature selection method and Interactions (HSI), 2013, pp. 210213.
[4] S. G. Koolagudi and K. S. Rao, Emotion recognition from speech: a
36.7% for t-statistic method. review, Int. J. Speech Technol., vol. 15, no. 2, pp. 99117, Jan. 2012.

280
[5] D. Ververidis and C. Kotropoulos, Emotional speech recognition: [12] S. A. Rieger, R. Muraleedharan, and R. P. Ramachandran, Speech
Resources, features, and methods, Speech Commun., vol. 48, no. 9, pp. based emotion recognition using spectral feature extraction and an
11621181, Sep. 2006. ensemble of kNN classifiers, in 2014 9th International Symposium on
[6] M. Srivastava and A. Agarwal, Classification of emotions from speech Chinese Spoken Language Processing (ISCSLP), 2014, pp. 589593.
using implicit features, in 2014 9th International Conference on [13] E. Bozkurt, E. Erzin, C. E. Erdem, and A. T. Erdem, Use of Line
Industrial and Information Systems (ICIIS), 2014, pp. 16. Spectral Frequencies for Emotion Recognition from Speech, in 2010
[7] V. Kirandziska and N. Ackovska, Finding important sound features for 20th International Conference on Pattern Recognition (ICPR), 2010,
emotion evaluation classification, in 2013 IEEE EUROCON, 2013, pp. pp. 37083711.
16371644. [14] S. Chen, Q. Jin, X. Li, G. Yang, and J. Xu, Speech emotion
[8] A. Zelenik, B. Kotnik, Z. Kai, and A. Chowdhury, Novel expressive classification using acoustic features, in 2014 9th International
speech classification algorithm based on multi-level extraction Symposium on Chinese Spoken Language Processing (ISCSLP), 2014,
techniques, in 2010 5th International Conference on Pervasive pp. 579583.
Computing and Applications (ICPCA), 2010, pp. 410415. [15] M. Kalamani, S. Valarmathy, C. Poonkuzhali, and C. J. N, Feature
[9] S. Lalitha, A. Madhavan, B. Bhushan, and S. Saketh, Speech emotion selection algorithms for automatic speech recognition, in 2014
recognition, in 2014 International Conference on Advances in International Conference on Computer Communication and Informatics
Electronics, Computers and Communications (ICAECC), 2014, pp. 14. (ICCCI), 2014, pp. 17.
[10] M. K. Sarker, K. M. R. Alam, and M. Arifuzzaman, Emotion [16] H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and
recognition from speech based on relevant feature and majority voting, Data Mining. Springer Science & Business Media, 2012.
in 2014 International Conference on Informatics, Electronics Vision [17] H. Liu, J. Li, and L. Wong, A comparative study on feature selection
(ICIEV), 2014, pp. 15. and classification methods using gene expression profiles and proteomic
[11] K. V. K. Kishore and P. K. Satish, Emotion recognition in speech patterns, Genome Inform. Int. Conf. Genome Inform., vol. 13, pp. 51
using MFCC and wavelet features, in Advance Computing Conference 60, 2002.
(IACC), 2013 IEEE 3rd International, 2013, pp. 842847.

281

Вам также может понравиться