Вы находитесь на странице: 1из 8

Improving Spontaneous Childrens Emotion Recognition by Acoustic Feature Selection and Feature-Level Fusion of Acoustic and Linguistic Parameters

Santiago Planet and Ignasi Iriondo


La Salle - Universitat Ramon Llull, C/Quatre Camins, 2, 08022 Barcelona, Spain {splanet,iriondo}@salle.url.edu

Abstract. This paper presents an approach to improve emotion recognition from spontaneous speech, using a wrapper method to reduce an acoustic set of features and feature-level fusion to merge them with a set of linguistic ones. The proposed system was evaluated with the FAU Aibo Corpus. We considered the same emotion set that was proposed in the Interspeech 2009 Emotion Challenge. The main contribution of this work is the improvement, with the reduced set of features, of the results obtained in this Challenge and the combination of the best ones. We built this set with a selection of 28 acoustic and 5 linguistic features and concatenation of the feature vectors from an original set of 389 parameters. Keywords: Emotion recognition, spontaneous speech, acoustic features, linguistic features, feature-level fusion, speaker independence

Introduction

The inclusion of speech in human-computer interaction (HCI) is increasing as a natural way to interact with user interfaces. This speech analysis should include paralinguistic information besides automatic speech recognition. The analysis of aective states in the input or the synthesis of expressive speech at the output could make applications more usable and friendly. In general, skills of emotional intelligence added to machine intelligence could make HCI more similar to human-human interactions [10]. The analysis and synthesis of emotion may be applied in a wide range of scenarios, e.g. to the automatic generation of audiovisual content, for virtual meetings or even in automatic dialogue systems. Currently, there are many studies related to emotion recognition based on dierent approaches. However, most of them are based on corpora that were built by utterances recorded by actors under supervised conditions. Nowadays this is not the current trend because of their lack of realism [20]. Hence, there are many eorts trying to emulate real-life conditions in the emotion recognition research area. The rst attempt to work with spontaneous speech utterances seems to be [16] where authors collected and analysed infant-directed speech. However,

Improving Spontaneous Childrens Emotion Recognition

it is dicult to compare the results of dierent approaches when they are using dierent data and dierent evaluation methods. Recently, the Interspeech 2009 Emotion Challenge [14] tried to solve these problems. It proposed a framework to generalise the research on this topic by oering a corpus of spontaneous speech. It also dened a training and a test subset in order to allow speaker independence during the analysis. The main goal of this paper is to present a study of emotion recognition from spontaneous speech under real-life conditions, improving the results of previous works. To emulate real-life conditions we used a spontaneous speech corpus that includes non-prototypical data with low emotional intensity, with a very unbalanced number of emotional labels and including a garbage class without a specic emotion denition. To improve the performance of classiers we used acoustic and linguistic features. Since the set of acoustic features was much larger than the set of linguistic ones we processed the acoustic set employing a wrapper approach to reduce it by selecting the most relevant features. A correct choice of the acoustic parameters can improve the classication results, as it was stated in [7]. In the next step we combined the acoustic and the linguistic parameters at the feature level before starting the experiment of classication. This paper is structured as follows: Section 2 describes the corpus and its parameterisation. Section 3 describes the experiment and details the methodology, the creation of the selected set of features and the learning algorithms. Section 4 summarizes and discusses the achieved results before concluding this paper (Section 5).

Corpus

In this work we used the FAU Aibo Corpus [18] as it was dened in [14]. In this Section we describe this corpus and its acoustic and linguistic parameterisation. 2.1 Description

The FAU Aibo Corpus collected audio recordings of German speech from the interaction of children from two schools playing with the Sonys Aibo robot in a Wizard of Oz scenario. These audio recordings were divided into 18,216 chunks. To guarantee speaker independence, the chunks from one school were chosen to create one fold (fold 1), while the chunks from the other school were used to create a second fold (fold 2). Each parameterised chunk was considered an instance of the datasets that we used to train and test the classication schemes. The number of resulting instances was 9,959 for the fold 1 and 8,257 for the fold 2. The corpus was labelled in ve category labels: Anger (A), which included angry (annoyed), touchy (irritated, previous step of anger) and reprimanding (reproachful) aective states; Emphatic (E) (accentuated and often hyper-articulated speech, but without emotion); Neutral (N); Positive (P), which included motherese (like infant-directed speech from the child to the robot) and joyful (the child enjoyed a situation) states; and Rest (R), a garbage class that

Improving Spontaneous Childrens Emotion Recognition

collected the aective states of surprise (in a positive sense), boredom (with a lack of interest in the interaction with the robot) and helpless (doubtful, with disuencies and pauses). Distribution of classes was highly unbalanced, as shown in Fig.1. For a full description of this version of the corpus cf. [14].
6,000 5,000 Number of instancies 4,000 3,000
2,093 5,590

5,377

Fold 1 Fold 2

2,000 1,000 0

1,508 881 611 721 546 674 215

N Label

Fig. 1. Number of instances per class for the folds 1 and 2 of the FAU Aibo Corpus.

2.2

Acoustic parameterisation

The acoustic analysis of the corpus consisted on calculating 16 low-level descriptors (LLDs) per chunk and their derivatives. These LLDs were: the zero-crossing rate (ZCR) analysed in the time signal, the root mean square (RMS) frame energy, the fundamental frequency (F0) normalised to 500 Hz, the harmonicsto-noise ratio (HNR) and 12 mel-frequency cepstral coecients (MFCC). We calculated 12 functionals from these LLDs. These functionals were: the mean, the standard deviation, the kurtosis and the skewness, the value and range and position of the extremes, and the range and two linear regression coecients with their mean square errors (MSE). To perform the parameterisation we used the openSMILE software, included in the openEAR toolkit release [1]. Considering this parameterisation, each instance of the datasets is associated to an acoustic feature vector of 16 2 12 = 384 elements. 2.3 Linguistic parameterisation

The linguistic information is based on the words that children used to communicate with the robot Aibo. The FAU Aibo Corpus provided the transcriptions of the utterances of both folds. We used the emotional salience proposed by [9] to convert the words of a chunk in emotion-related attributes. An emotionally salient word is a word that appears most often in that emotion than in the other categories. From the list of salient words of a chunk (those that exceeded a threshold of salience), an activation feature vector was computed following [9]. The dimension of this linguistic feature vector was of 5 elements, one for each emotion.

Improving Spontaneous Childrens Emotion Recognition

Experimentation

In this Section we explain the methodology of the experiment, the feature selection method, the preprocessing of the data and the learning algorithms studied. 3.1 Methodology

The acoustic feature vector contained a big amount of information. The inclusion of irrelevant attributes could deteriorate the performance of the classiers used in the learning stage [19]. Also, if these data were merged with the linguistic features then the resulting vectors would be very unbalanced because they would contain many more features related to the acoustic information than features related to the linguistic information. Feature selection techniques are intended to create a subset of features by eliminating irrelevant input variables (i.e. variables that have little predictive information), what could improve the resulting classiers and obtain a more generalizable model [5]. We used a wrapper method to select the best subset of acoustic features before merging them with the linguistic parameters, as it is explained in Section 3.2. After reducing the number of acoustic parameters, we used feature-level fusion to merge the acoustic and the linguistic information. As described in [17], a feature-level fusion scheme integrates unimodal features before learning concepts. The main advantages are the use of only one learning stage and taking advantage of mutual information. We used concatenation of the acoustic and linguistic feature vectors to create a multimodal representation of each instance. We evaluated the classier schemes in a 2-fold cross-validation manner. We used one of the schools for training and the other school for testing and vice versa to guarantee speaker independence in the experiments. The measure used to compare the eectiveness rates of the classication approaches in this experiment was the unweighted average recall (UAR). We chose this measure because the distribution of the classes in the FAU Aibo Corpus was very unbalanced (cf. Fig.1). However, in most of the studies of emotion recognition the weighted average recall (WAR) is used because the distribution of the classes of the studied corpora is usually quite balanced. Considering the UAR, instead of the WAR measure, the most even class-wise performance was intended. This is meaningful in a natural interaction scenario because neutral interactions are the most usual ones and, in this case, detecting the interactions with emotional content is as important as the detection of the neutral interactions. Equation 1 shows that the recall for one class c was calculated as the proportion of correctly classied cases with respect to the corresponding number of instances of this class. Equation 2 shows the computation of UAR performance of the classier considering the recalls of each class c. recallc = U AR = T Pc T Pc + F Nc
|C| c=1 recallc

(1)

|C|

(2)

Improving Spontaneous Childrens Emotion Recognition

where T Pc stands for True Positives of class c, F Nc stands for False Negatives of class c and |C| represents the number of classes. 3.2 Feature selection and dataset preprocessing

There are two strategies to select the best subset of features from a dataset: the lter methods (only based on the characteristics of the data) and the wrapper methods (using a specic classier to evaluate the subset) [3]. In this study we used a wrapper method considering a Na ve-Bayes classication scheme to assess the goodness-of-t of the chosen subset. We chose the Na ve-Bayes algorithm because of its simplicity and because it obtained the best classication results with this corpus in our previous work [11]. The second key point to dene the feature selection method in a wrapper strategy is the choice of the search method in the space of feature subsets. In our case we chose a greedy forward search, starting with no features and adding one at each iteration until addition of a new element decreased the evaluation. To carry out this selection, we resampled the training dataset reducing it by half to accelerate the process and biased it to a uniform distribution. We used only the training set to select the subsets of features and evaluated them on all training data. The dataset was reduced from 384 acoustic features to 28 features: 21 related to the MFCC parameters, 3 related to the RMS frame energy, 2 related to the F0, 1 related to the HNR and 1 related to the ZCR. We concatenated these acoustic parameters to the 5 linguistic features obtaining a feature vector of 33 elements per instance. We preprocessed the datasets used to train the classiers resampling them. We biased the datasets to a uniform class distribution by means of a resampling with replacement and duplicating the total number of instances. We made this for the training stage because the classication algorithms were intended to maximize de WAR instead of the UAR. Biasing the distribution to make it uniform the classication performance got improved except in the case of the Na ve-Bayes algorithm. For this reason, we did not consider the resampling preprocessing for this classication scheme. 3.3 Experiment description

We evaluated three dierent classication approaches. We used the implementations provided by the WEKA data mining toolkit [19]. The rst learning algorithm was a Na ve-Bayes classier. This algorithm was found to be the most relevant in [11] despite its simplicity, so it was used as a baseline for the experiment described here. To improve the performance of this classier we applied, prior to the classication, a supervised discretisation process based on the Fayyad and Iranis Minimum Description Length (MDL) method [2]. The second classication approach was a support vector machine (SVM) classication scheme. We chose a SVM with a linear kernel, using sequential minimal optimisation learning [12] and pairwise multi-class discrimination [4] to allow the algorithm

Improving Spontaneous Childrens Emotion Recognition

to deal with a problem of ve classes. The third classier was a logistic model tree as described in [8], a model tree that uses logistic regression instead of linear regression at the leaves. This is named Simple Logistic in WEKA.

Results

Table 1 shows the results of the experiment described above. UAR results appear grouped into three categories: using all the acoustic and linguistic features, using only the 28 acoustic features selected and using these 28 acoustic features and the 5 linguistic parameters. For each algorithm we show three results: the Fold 1 column indicates the results obtained when training the classiers with the school 1 and testing with the school 2, the Fold 2 column is the opposite and the third result is the mean of the two folds.
Table 1. UAR of the classiers. The Fold 1 and Fold 2 columns indicate the UAR obtained when training with the fold 1 and testing with the fold 2 and vice versa, respectively. Results are grouped into three categories by considering: all the features, a reduced set of 28 acoustic features and a set of 28 acoustic and 5 linguistic features.

All features (389) Fold 1 Fold 2 Na ve-Bayes 40.46 39.32 40.64 44.70 SVM Simple Logistic 38.44 46.26 Algorithm

Mean 39.89 42.67 42.35

Acoustic features (28) Fold 1 Fold 2 Mean 27.94 29.52 28.73 39.44 38.48 38.96 39.30 38.36 38.83

Acoustic and linguistic features (33) Fold 1 Fold 2 Mean 33.90 40.92 37.41 41.60 47.66 44.63 44.06 48.20 46.13

Results show that the selection of 28 acoustic features provided results below the obtained by the datasets with all the features. However, the addition of the 5 linguistic parameters to the reduced dataset improved the performance of the classiers. E.g. in the case of the best classier, the Simple Logistic algorithm, UAR was 46.13% considering the reduced acoustic dataset and the linguistic features and 42.35% considering all the features. In this case, a reduction from 389 to 33 features implied improving this classifer by 3.78% absolute (8.93% relative). This was not true for the Na ve-Bayes algorithm which obtained the best performance considering all the features instead of the reduced dataset of 33 features (39.89% vs. 37.41%, respectively), which was in accordance to [13]. Considering the results with the reduced dataset of acoustic and linguistic features, the Simple Logistic algorithm (46.13%) improved the Na ve-Bayes classier (37.41%) by 8.72% absolute (23.31% relative). To compare these results with the experiments carried out by other authors in the same scenario, Folder 1 column of Table 1 must be taken into account. It shows the performance of the classication algorithms when using school 1 for training and testing with school 2, in the manner detailed in [14]. In [15] authors compiled a list of results achieved by individual participants to the Interspeech 2009 Emotion Challenge and their fusion by a majority voting scheme.

Improving Spontaneous Childrens Emotion Recognition

[6] obtained the best result in the same conditions of this paper (41.65%). The fusion of the best 7 contributions to the Challenge achieved 44% UAR. The result obtained in this paper by means of the Simple Logistic classier and only 33 features improved the result of [6] by 2.41% absolute (5.79% relative) and the result of the fusion in [15] by 0.06%absolute (0.14% relative). Our result compared with [15] was quite similar, but it must be noted that the number of features involved was dramatically lower in our study and also the complexity of the learning scheme.

Conclusions

Emotion recognition studies usually dealt with acted data but it implied a lack of realism. Although recent approaches used more realistic data, results were usually dicult to be compared as it was stated in [14]. We used the same conditions of [14] to carry out an experiment improving the last results related to emotion recognition from spontaneous childrens speech, working with realistic data in a multispeaker scenario. Results showed the importance of a convenient feature selection. We improved the performance of the classiers by working with the fusion at the feature level of 28 acoustic and 5 linguistic parameters by concatenation of vectors instead of all of them. This performance represented a 8.93% relative improvement in the case of the best learning algorithm (Simple Logistic) over the same algorithm without considering the feature selection (389 features). Comparing our result with the most recently obtained in [15] by fusion of classiers, the performances were similar (improvement of 0.14% relative). However, our result was related to a smaller dataset and a simpler learning scheme. Linguistic modality revealed as an important feature in this task. The result obtained by the Simple Logistic algorithm considering the acoustic and the linguistic features (46.13% UAR) was 7.3% absolute (18.8% relative) over the same learning scheme but discarding the linguistic features (38.83% UAR). This improvement was also observed in the Na ve-Bayes (37.41% UAR vs. 28.73% UAR) and the SVM (44.63% UAR vs. 38.96% UAR) classiers. Future work will deal with transcriptions of words obtained from an automatic speech recogniser (ASR). With these transcriptions we would be able to check the previous statements in a completely real scenario.

References
1. Eyben, F., Wllmer, M., Schuller, B.: openEAR - Introducing the Munich Openo Source Emotion and Aect Recognition Toolkit. In: 4th International HUMAINE Association Conference on Aective Computing and Intelligent Interaction 2009. pp. 576581. Amsterdam, The Netherlands (2009) 2. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classication learning. In: 13th International Joint Conference on Articial Intelligence. pp. 10221029 (1993)

Improving Spontaneous Childrens Emotion Recognition

3. Guyon, I., Elissee, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 11571182 (2003) 4. Hastie, T., Tibshirani, R.: Classication by pairwise coupling. Annals of Statistics 26(2), 451471 (1998) 5. Kim, Y., Street, N., Menczer, F.: Feature selection in data mining. In: Wang, J. (ed.) Data Mining: Opportunities and Challenges, pp. 80105. Idea Group Publishing (2003) 6. Kockmann, M., Burget, L., Cernock, J.: Brno University of Technology System y for Interspeech 2009 Emotion Challenge. In: 10th Annual Conference of the International Speech Communication Association. pp. 348351. Brighton, UK (2009) 7. Kostoulas, T., Ganchev, T., Lazaridis, A., Fakotakis, N.: Enhancing emotion recognition from speech through feature selection. In: Sojka, P., Hork, A., Kopecek, I., Pala, K. (eds.) Text, Speech and Dialogue, LNCS, vol. 6231, pp. 338344. Springer Berlin/Heidelberg (2010) 8. Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Machine Learning 59(12), 161205 (May 2005) 9. Lee, C.M., Narayanan, S.S.: Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13, 293303 (2005) 10. Picard, R.W., Vyzas, E., Healey, J.: Toward machine emotional intelligence: Analysis of aective physiological state. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(10), 11751191 (October 2001) 11. Planet, S., Iriondo, I., Socor, J.C., Monzo, C., Adell, J.: GTM-URL Contribuo tion to the Interspeech 2009 Emotion Challenge. In: 10th Annual Conference of the International Speech Communication Association. pp. 316319. Brighton, UK (2009) 12. Platt, J.: Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods: Support Vector Learning. MIT Press (1998) 13. Rish, I.: An empirical study of the naive bayes classier. IJCAI 2001 Workshop on Empirical Methods in Articial Intelligence 3(22), 4146 (2001) 14. Schuller, B., Steidl, S., Batliner, A.: The interspeech 2009 emotion challenge. In: 10th Annual Conference of the International Speech Communication Association. pp. 312315. Brighton, UK (2009) 15. Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and aect in speech: State of the art and lessons learnt from the rst challenge. Speech Communication In Press, Corrected Proof (2011) 16. Slaney, M., McRoberts, G.: Baby ears: a recognition system for aective vocalizations. 1998 IEEE International Conference on Acoustics Speech and Signal Processing pp. 985988 (1998) 17. Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. 13th Annual ACM International Conference on Multimedia pp. 399402 (2005) 18. Steidl, S.: Automatic Classication of Emotion-Related User States in Spontaneous Childrens Speech. Logos Verlag (2009) 19. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, CA, 2nd edn. (June 2005) 20. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of aect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 3958 (January 2009)

Вам также может понравиться