Вы находитесь на странице: 1из 4

Using the Lyapunov Exponent from Cepstral

Coefficients for Automatic Emotion Recognition

Marius Dan Zbancioc1,2 Monica Feraru1


Institute of Computer Science, Romanian Academy Institute of Computer Science, Romanian Academy
Technical University “Gheorghe Asachi” of Iasi Iaşi, România
Iaşi, România monica.feraru@iit.academiaromana-is.ro
zmarius@etti.tuiasi.ro

Abstract— The main goal of this paper is to establish the neural network 80.8%. The best emotion recognized applying
relevance of nonlinear parameters (Lyapunov exponents) in the SVM was anger and was confuses the happiness with the
automatic classification of emotions, for the Romanian language. sadness state and boredom with neutral tone. Appling NN was
The Largest Lyapunov Exponent - LLE was computed for the obtained better performance for the sadness and neutral tone,
MFCC mel frequency cepstral coefficients and the LPCC linear and “perfectly” distinguishes between happiness and boredom.
prediction cepstral coefficients. The Support Vector Machine -
SVM classifier provides better results than Weighted K-Nearest In order to classify the emotional state, in [3] was used the
Neighbors - WKNN classifier in emotion recognition for feature SVM classifier on the features extracted from the IITKGP-
vectors that contains LLE (around 75%). The best recognized by SESC. The extracted parameters were the prosodic features –
using SVM classifier was the neutral tone, followed by the the energy, the pitch, and the formants and the spectral features
sadness, fury and the weakest recognized was the joy. For - MFCC and LPCC. They concluded that using both (prosodic
features vectors which include LLE the best results was obtained and spectral) features, the recognition rate is better than only
in combination with LAR - Log Area Ratio coefficients, one. The recognition rate of the combinations made on German
respectively PARCOR - partial correlation coefficients. model was: energy+pitch (33.33%), MFCC+MEDC (86.66%),
MFCC+MEDC+LPCC (86.66%), MFCC+MEDC+Energy+
Keywords - Largest Lyapunov exponent, cepstral coefficients, pitch (90%), MFCC+MEDC+Energy (91.30).
automatic emotion recognition
In [4] the researchers said that MFCC and LPCC “are used
I. INTRODUCTION as the correlates of vocal tract information” in the emotion
recognition process. They used the simulated emotional speech
Recently, the automatic emotion recognition plays an corpus (IITKGP-SESC) and the Berlin emotional speech
important role in the research area. The emotion recognition database (Emo-DB). As the classifier they applied the GMM -
from speech has many applications in various areas such as: in Gaussian Mixture Models and AANNs - Auto-Associative
the speaker recognition, in improving the synthesizers, in Neural Networks. The results obtained show that the using of
learning of a new language, in health care, etc. In the analysis LPCC is better (69%) than the results of the MFCC. The happy
of speech emotion recognition one of the important stage is state is recognized ”well by most of the features”. They
how to extract and select better the speech features. The concluded that this can be explained because the LPCC
common features extracted are: the pitch, the formants, the “represent the speech production characteristics, by analyzing
energy, and some spectrum features for example Mel- all frequency components in a uniform manner”. The emotion
Frequency Cepstrum Coefficients (MFCC), Linear Prediction recognition was better by using GMM than AANNs.
Cepstrum Coefficients (LPCC), Log Frequency Power
Coefficients (LFPC), Perceptual Linear Predictive (PLP), Mel- Regarding to the classification of seven emotions from the
energy spectrum dynamic coefficients (MEDC). emotional Berlin database, the researchers [5] used SVM and
the accuracy obtained was of 68%. In [6] they concluded that
The researchers [1] used in the emotion recognition as the the combination of two or more features offer a good accuracy
classifier the SVM and as the vector features: the pitch, the to the results. Using as features MFCC [7] and KNN as the
energy and the coefficients MFCC, LPCC, MEDC. They tried classifier, the accuracy rate obtained was 67%. Using MFFC
to classify three emotional states: happy, sadness and neutral and SVM the accuracy rate was of 65% in [8].
tone and the emotional databases used were Berlin Database of
Emotional Speech and their self-built Chinese emotional In order to classify five emotions (anger, happiness,
databases. The best accuracy rate (91.3% for the Chinese sadness, surprise and neutral state) using SVM and as the
database and 95.1% for the Berlin database) was obtained by parameters: MFCC, the periodicity histogram and the
combining MFCC+MEDC+ Energy. fluctuation pattern, in [9] for the Danish Emotion Speech
(DES) database it was obtained the recognition rate between
In [2], it was used as the important speech features MFCC, 55.40% and 68%. In paper [10] the accuracy rate archived
LPC - Linear Prediction Coefficients, LPCC. The results using HMM, in order to recognize five emotions were 88.7%
obtained with the SVM classifier was 84.2% and with NN-
for the emotion of Mandarin database. They selected six III. FEATURE VECTORS EXTRACTION
parameters (LPCC, MFCC, LFPC, jitter, PLP and LPC) and in Because the emotion recognition is made on the feature
order to obtain the best recognition of the emotion they used vectors which contain the prosodic information as fundamental
single and different combination of these parameters. Using frequency F0 and the formants F1-F4, the extraction of the
only MFCC and LPCC the accuracy was of 68.21% and other parameters was performed only on the vowel phonemes
respectively 68.68%; by combining these two parameters they (vowels and voiced consonants depending of the context). In
obtained 68.97% recognition; by combining all six parameters the following when we refer to consonants only means to
the accuracy was 83.91%. unvoiced/voiceless areas.
The paper is structured as follows. In the next section, some We mention this because in the emotion recognition are
brief information regarding the emotional corpus SROL are important the amplitude of sounds in voice, the accent, and the
presented and in section 3, the features vector extraction inflections of the voice. The pitch is present only in the vowels
algorithm are discussed. The analysis of the results obtained areas and not in the consonants areas. It were considered from
and the concluding remarks are presented in the last section. the beginning as interest areas in speech recognition only those
which contain F0. If it would have done the speaker
II. SROL DATABASE – EMOTIONAL CORPUS recognition then it will be necessary also the information from
The SRoL project (Sound of Romanian Language) is a the consonant (unvoiced) areas.
collection of sound files and of instruments for speech Recently, we proposed a robust method for the C/V
processing. It can be freely accessed at the webpage: segmentation based on the predictive neural networks
www.etc.tuiasi.ro/sibm/romanian_spoken_language/index.html described extensively in [12], [15]. After the method
. The collection contains utterance from normal people which optimization it was significantly reduced the running times so
are not professional voices (like actors for example). For the that the segmentation can be made in real time.
emotional corpus the speakers pronounced seven sentences
with four emotional states: sadness, joy, fury and neutral tone. The functional blocks of the application for the feature
vectors extracting that will sets the testing and the training test
Because this emotional database do not contain of the classifiers are described in the following figure:
professional voices it was necessary to validate the expressed
emotions for each file. After validation from a total of 396 files Preprocessing
(each file contains many pronunciations of some sentences) it s(t)
Vowel / sw(t) Pre- sp(t) Hamming sˆw (t )
consonant Framing
was kept 145 sound files with 545 sentences. From every emphasis windowing
segmentation
sentence it was extracted a feature vector with a number of 208
parameters. Only 203 parameters are useful in classification,
the other 5 helps to identify the sound file and the speaker. ΔMFCC
mfcc Logarithm mel Mel filter FFT
ΔΔMFCC
cepstrum + DCT spectrum bank
The recorded sentences from the SRoL corpus are: S1. extraction
”You came back to me / Ai venit iar la mine”, S2. “Yesterday Feature MFCC
evening / Aseară”, S3. “Who done that? / Cine a făcut asta?”, vectors Mel Energies
S4. “You will get the desired place / Îţi vei câştiga locul dorit”, AC 14th AC
S5. “My man done it / Omul meu îl lucră”, S6. “Anyway, you PARCOR
LAR Log area PARCOR Levinson
can get the desired place / Oricum îţi poţi câştiga locul dorit”, LPCC ratio -Dublin
Auto-
S7. “Mother is coming / Vine mama”. LPCC LPC coefficient method correlation
conversion LPC
LPC
TABLE I. THE FEATURE VECTORS
Figure 1. Cepstral coefficients extraction block diagram
Index Parameter
1, 2 sentence name, number of appearances The preprocessing is made in four stages:
3-5 mean, standard deviation, median of F0
P1. The C/V segmentation – It was extracted the voiced
6-8, 9-11, 12-14, mean, standard deviation, median of first formants
areas where the signal is quasi-periodic and it was eliminated
15-17 F1 (6-8), F2, ... , F4(15-17)
18,19,20 jitter extracted with 2 methods, shimmer the consonants areas, where we have a noise type signal and
21-33, 33-44 mean of 12 MFCC mel frequency cepstal coef.
where the spectral energy is concentrated in the high
frequencies area. The segmentation method based on the neural
45-56, 57-68 derivative parameter ΔMFCC, ΔΔMFCC
networks predicts / approximate better the signal in the periodic
69-89 mean of 21 LPCC linear predictive cepstral coef.
areas of the vowels than in consonants areas, where the
90-104, 105 LPC linear predictive coef., prediction error (105)
prediction error is significantly higher. The number of samples
106-119, PARCOR partial correlation coef.,
120-133,134-148 LAR – log area ratio and AC autocorellation coef. in each frame is N=564, for a sample frequency f s =22.05kHz
149-160, 161-181 Lyapunov from MFCC, respectively from LPCC
182-193 Energy from 12 mel frequency bands P2. Framing / windowing – from the vowel areas it was
194-205 Lyapunov from energy extracted window analysis s w (t ) / frames of length 25.6ms.
206,207,208 ID of recording, sentence and speaker The overlapping of two consecutive frames is 60%, which
means the windows are located at 10 ms distance each to other.
P3. Pre-emphasis – it was applied a filter which increase nf
⎛ ⎞
the amplitude of high frequencies. This filter is useful in our vMFCC (c) = ∑ log10 (E j ) ⋅ cos⎜ c ⋅ ( j − 0.5) π ⎟ (5)
application because in the vowel regions the spectral energy is j =1 ⎝ nf ⎠
predominantly in low frequency bands [0-2000]Hz. The filter
coefficient α has the value 0,7.
where E j represent the mel-energies of n f = 40 triangular
filters. There are computed a number of 13 coefficients, but
s p [k ] = s w [k ] − α ⋅ s w [k ] , k = 1, N (1) because the first coefficient vMFCC (0) is the mean of the
input set, there are kept only 12 values ( c = 1,12 ).
P4. Hamming windowing – by multiplying the s p window
M4. Delta MFCC and delta-delta MFCC extraction –
with this tapering function the samples located close to the represent the derivate of the cepstral vector. These derivative
frame borders are reduced in amplitude, compared with the coefficients are computing from windows situated at distance
samples situated in the center of the frame, which receives of 50ms (the step between two consecutive windows is 10ms).
higher weights. The rectangular window is not recommended
to be used because introduces false information in the
spectrum. The hamming window moderates the undesirable vΔMFCC (c, k ) = vMFCC (c, k − 2) − vMFCC (c, k + 2) (6)
effects on DFT (Discrete Fourier Transform) produced by the
discontinuities from the borders of analysis window. The
values of parameters a and b are: a=0.54, b=1-a; The ΔΔMFCC are computed with the same derivative
formula but applied on the ΔMFCC.
The extraction of LPCC requires the following steps:
⎛ k −1 ⎞
hw [k ] = a + b ⋅ cos⎜ 2π ⎟ , k = 1, N (2) L1. Autocorrelation computing - with the classical formula:
⎝ N −1 ⎠
sˆw [k ] = s p [k ] ⋅ hw [k ] (3) N −1− i
Css [i ] = ∑ sˆ [k ] ⋅ sˆ [k + i]
w w , i = 0, nAC (7)
After the preprocessing stages from the windows sˆw (t ) it k =0

can be extracted the cepstral coefficients: LPCC and MFCC. L2. Levinson-Durbin method – apply a recursive algorithm
For the extraction of the mel-frequency cepstral coefficients is in order to solve a prediction problem. Each autocorrelation
necessary to make the transition to the frequency domain. The coefficient is predicted using the other nAC=14 values. The
processing steps are noted by M1-M4 and are described below: values obtained at the last iteration give the linear prediction
M1. FFT – computes the spectrum S(f) for each frame. coefficients LPC.
This spectrum is the convolution of the input signal and of the
hamming window. The frequencies are linear distributed in the
nac
Css [i ] = ∑ α k ⋅ Css [ i − k ] , i = 1, nac
interval [0, fs/2] in N/2+1 points.
(8)
M2. Mel-frequency bank filtering – The frequency scale is k =1
mapped into a logarithmic scale. The mel scale is used in vLPC [0] = 1 , vLPC [k ] = −α k (9)
perceptual analysis of vocal signal and reflect the manner of
how the human ear process the sounds. The conversion of
normal frequencies into mel-frequencies is made by the next L3. Log are ratio coefficients – are computed from vLPC.
formula: The Levinson-Durbin method and the computation of
PARCOR and LAR coefficients is better described in [14]
L4. LPCC conversion – the computing of cepstral
f mel = 1127 ⋅ ln(1 + f / 700 )  (4) coefficients LPCC are different for the first values until the nAC
order and the high order until nLPCC=21:
Until the corner frequency 700 Hz the bandwidth of the
triangular filters are approximately equals, but because after
this value the bandwidths grows logarithmically a vLPCC [k ] = vLPC [k ] , k < nAC (10)
normalization of the filters is required. This operation is k −1
j
detailed in [12, 13]. vLPCC [k ] = ∑ ⋅ vLPCC [ j ] ⋅ vLPC [mod(k − j , nAC )]
M3. Logarithm + DCT – The cepstrum computation require j =1 k
the inverse Fourier transform of the logarithm of the squared
spectrum of signal. This formula explains the necessity to From this sets of parameters we have computed the LLE –
apply the logarithm. The MFCC values are obtained with the Largest Lyapunov Exponent varying the embedding lag and the
classical formula of the Discrete Cosine Transform: embedding dimension. The algorithm is detailed in [12].
Because the input series applied to the procedure for bands and when the smallest global error was 23.59% obtained
computing the nonlinear parameter of Lyapunov are too short, for {106-119 PARCOR + 192-205 Lyap. energy}. The neutral
we have used a method proposed by Rosenstein [11]. This tone has the best recognition percent and the weakest
algorithm was specially projected for small data sets and is fast recognized was for the joy state. The emotion recognition
and robust to changes and to noise. accuracy obtained by us is comparable to those reported by
others, with the mention that our database does not contain
IV. RESULTS AND CONCLUSIONS professional voices.
The emotional corpus from SROL contains at this moment
around 400 files made from 25 speakers. In each files there are ACKNOWLEDGMENT
pronounced 3-5 sentence. The feature vectors were extracted We acknowledge to Romanian Academy priority research
only from 145 files / 545 sentences left in database after grant “The cognitive systems and applications” and to entire
validation. The training data sets include 75% of the feature research team of the project SRoL “Sounds of the Romanian
vectors, and the testing set contains 25%. The average error / language”
global error is computed after 50 iterations, at each iteration the
training and testing sets are mixed, keeping the ratio 75-25%. REFERENCES
In our previous research, we have observed that KNN
classifier has good performance for cepstral coefficients, but do [1] Y. Pan, P. Shen, L. Shen, “Feature Extraction and Selection in Speech
not offer a good classification of emotions when in the feature Emotion Recognition”, vol. 2, pp. 64-69, 2012.
vectors are included Lyapunov exponents. We have verified [2] T.-L. Pao, Y.-T. Chen, J.-H. Yeh, P.-J. Li, “Mandarin Emotional Speech
this conclusion for the new training data sets. The smallest Recognition Based on SVM and NN”, 18th Int. Conf. on Pattern
Recognition, vol. 1, pp. 1096-1100, 2006.
global error for KNN classifier was around 50% for {90-104
LPC coef., 161-181 Lyap. LPCC}, respectively {21-32 MFCC, [3] B.S. Yalamanchili, K.K Anusha, K. Santhi, P. Sruthi, B.
SwapnaMadhavi, “Non Linear Classification for Emotion Detection on
149-160 Lyap. MFCC} Because the results for KNN classifier Telugu Corpus”, Int. Journal of Computer Science and Information
was modest, we present in Table II only the results obtained Technologies, vol. 5 (2) , pp. 2443-2448, 2014.
with the SVM classifier: [4] K.S. Rao and S.G. Koolagudi, “Robust Emotion Recognition using
Spectral and Prosodic Features”, SpringerBriefs in Speech Technology,
pp. 17-46, 2013.
TABLE II. SVM CLASIFIER – EMOTION RECOGNITON ERROR WITH
FEATURE VECTORS THAT CONTAIN LLE EXPONENTS [5] A Milton, S. Sharmy Roy, S. Tamil Selvi, “SVM Scheme for Speech
Emotion Recognition using MFCC Feature” Int. Journal of Computer
Parameters Global Joy Fury Sad- Neu- Applications, vol. 69, no.9, 2013.
error ness tral [6] S. Koolagudi and K. Sreenivasa Rao, “Emotion recognition from speech
149-160 Lyapunov from MFCC coef. 26.77 32.16 30.07 22.39 15.67 using source, system and prosodic features”, Int J Speech Tech, pp. 265
161-181 Lyapunov from LPCC coef. 27.33 32.48 30.30 23.31 17.14 – 289, 2012.
194-205,149-181 Lyap. Energy + Lyap 29.46 36.39 31.88 24.96 17.44 [7] Y. Han, G. Wang, Y. Yang, “Speech Emotion Recognition Based on
from MFCC and LPCC coef. MFCC”, Journal of Chong Qing University of Posts and
194-205, 149-160 Lyap Energy+MFCC 27.40 33.88 29.70 22.99 16.04 Telecommunication, Natural Science Edition, vol. 20(5), 2008.
194-205, 161-181, Lyap Energy+LPCC 28.54 35.19 30.30 24.59 17.52 [8] E. Albornoz, D. Milone and H. Rufiner, “Spoken emotion recognition
using hierarchical classifiers”, Computer Speech and Language, vol. 25,
21-32 MFCC, 149-160 Lyap MFCC 26.03 29.78 30.22 20.52 18.21 pp. 556 –570, 2011.
33-44 MFCC stdev, 149-160 Lyap MFCC 26.35 29.18 31.19 22.69 16.57 [9] V. Chavan, V.V. Gohokar, “Speech Emotion Recognition by using
45-56 delta MFCC, 149-160 Lyap MFCC 26.91 33.66 28.58 22.61 16.04 SVM-Classifier”, Int. Journal of Engineering and Advanced Technology
182-193 Energy, 149-160 Lyap MFCC 26.17 33.43 28.21 20.45 15.60 - IJEAT, vol. 1, issue 5, 2012.
90-104 coef LPC, 161-181 Lyap LPCC 26.20 29.17 28.80 24.44 17.82 [10] T.-L. Pao, Y.-T. Chen, J.-H. Yeh, and W.-Y. Liao, “Detecting Emotions
in Mandarin Speech”, Computational Linguistics and Chinese Language
106-119 coef Parcor, 161-181 Lyap LPCC 24.51 26.77 26.54 22.71 18.72 Processing, vol. 10, no. 3, pp. 347-362, 2005.
120-133 coef LAR, 161-181 Lyap LPCC 23.35 25.49 24.59 22.26 18.27 [11] M.T. Rosenstein, J. J. Collins and C. J. De Luca, “A practical method
182-193 Energy, 161-181 Lyap LPCC 25.79 30.75 26.09 24.14 17.67 for calculating largest Lyapunov exponents from small data sets”,
Physica D., 1993.
[12] M. Feraru and M. Zbancioc, “Emotion Recognition using Lyapunov
The SVM classifier provides a global error around 27% Exponent of the Mel-Frequency Energy Bands”, Int. Conf. of
when the feature vectors contain only the LLE (Largest Electronics, Computers and Artificial Intelligence - ECAI 2014
Bucureşti, România, in press.
Lyapunov Exponent) extracted from cepstral coefficients
MFCC or LPCC. The emotion classification accuracy is [13] M. Zbancioc and M. Feraru, “A Study about MFCC Relevance in
Emotion Classification for SROL Database”, Proc. of 4th Int. Symp. on
improved when the feature vectors contain supplementary Electrical and Electronics Engineering, ISEEE 2013, Galaţi, România,
parameters. In this case the best recognition 76.65% (error ISBN 978-1-4799-2441-7, IEEE Catalog Number CFP1393K-USB.
23.35) is obtained for Lyapunov from LPCC + LAR [14] M. Feraru and M. Zbancioc, “Emotion Recognition in Romanian
coefficients. A good recognition of 75.49% was also obtained Language using LPC Features”, The 4th IEEE Int. Conf. on E-Health
for PARCOR + Lyapunov from LPCC. and Bioengineering Grigore T. Popa University of Medicine and
Pharmacy, Iaşi, Romania, ISBN 978-1-4799-2372-4, pp. 1-4, 2013.
The best results presented in this paper are comparable with [15] M. Zbancioc and M. Feraru, “The automatic segmentation of the vocal
the results obtained in our previous research when the signal using predictive neural network”, Int. Symposium on Signals,
Lyapunov was extracted from the energy of mel-frequency Circuits, and Systems - ISSCS2013, ISBN: 978-1-4799-3193-4, pp.1-4

Вам также может понравиться