Вы находитесь на странице: 1из 11

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014

17 19, July 2014, Mysore, Karnataka, India

INTERNATIONAL JOURNAL OF ELECTRONICS AND


COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)
ISSN 0976 6464(Print)
ISSN 0976 6472(Online)
Volume 5, Issue 8, August (2014), pp. 160-170
IAEME: http://www.iaeme.com/IJECET.asp
Journal Impact Factor (2014): 7.2836 (Calculated by GISI)
www.jifactor.com

IJECET
IAEME

EMOTIONAL ANALYSIS AND EVALUATION OF KANNADA SPEECH


DATABASE
Pallavi J1,
1

Geethashree A2,

Dr. D J Ravi3

Student Master of Technology, ECE, VVCE, Mysore, Karnataka, India


2
Asst.Professor Dept. of ECE, VVCE, Mysore, Karnataka, India
3
Professor and HOD Dept. of ECE, VVCE, Mysore, Karnataka, India

ABSTRACT
Emotion is an affective state of consciousness that involves feeling and plays a
significantrole in communication. So it is necessary to analyze and evaluate speech data base to
build an effective emotion recognition system and efficient man machine interface. This paper
presents and discusses development of emotional Kannada speech data base analysis and its
evaluation using Mean opinion score (MOS), PNN and k-NN.
Keywords: K-Neighbouring Numbers (K-NN), Probability Neural Network (PNN), Speech Corpus.
I. INTRODUCTION
Emotion plays an important role in day-to-day interpersonal human interactions. Recent
findings have suggested that emotion is integral to our rational and intelligent decisions. A
successful solution to this challenging problem would enable a wide range of important applications.
Correct assessment of the emotional state of an individual could significantly improve quality of
emerging, natural language based human-computer interfaces [1,3,6]. It helps us to relate with each
other by expressing our feelings and providing feedback.
There have been many studies [3,4,7-10] for emotional speech but it is observed that most of
the studies are for English, Hindi and other languages, there is also a need to study these aspects for
Kannada speech. The investigation of both prosody related features [13] and spectral features for the
evaluation of emotion recognition is necessary 50-500 LPC coefficients as spectral features, whereas
mean value of pitch (F0), intensity, pressure of sound, Power Spectral Density (PSD), pressure, as
prosody related features have been studied. The human capability to recognize the emotion from
speech was also studied and compared with machine classifiers.
This important aspect of human interaction needs to be considered in the design of human
machine interfaces. Initially a listening test of sample Sentences was done to identify speakers
160

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

emotion based on auditory impressions and Mean opinion score was collected. Then speakers
emotion Identification of sample sentences was done with probabilistic neural network (PNN) and kneighboring numbers (KNN) using LPC and subsequently PRAAT software package was used to
extract the Pattern of acoustic parameters for sample sentences [2].
II. EMOTIONAL DATABASE
Obtaining emotional corpus is quite difficult in itself. Various methods have been utilized in
the past, like the use of acted speech, the speech obtained from movies or television shows and
speech recorded in event recall [2, 5, 6].
The database is composed of 4 different emotions (happy, sad, anger and fear) and neutral
emotion as uttered by two male Kannada actors, consisting of a total of 60 sentences containing
minimum 3 to maximum 7 words. The first step was to record the voice of each words and sentences.
The recordings of all the words and sentences were done using recording studio. These words and
sentences were recorded at a sample rate of 44100 Hz with a mono channel. The sentences used for
Statistical analysis is listed in table 1.
Table 1: Sentences used in analysis
Sent.
S2
S3
S5
S5
S6
S7
S8

KANNADA (English)


(long live like a wind)
 !.
( I am blessed ,as I protected the lives of elders)
#%()+-./.
(I have fought and Experienced with so many people like you.)
2!
(Aravinda is my Disciple)
 +5678
(I study during night time)
%<88.
(He might be a Brahmin ,there is no doubt about it)
>?>%@.5A?
(Father, who is that fellow who troubles us?)

III. ANALYSIS
Pitch is strongly correlated with the fundamental frequency of the sound. It occupies a central
place in the study of prosodic attributes as it is the perceived fundamental frequency of the sound [3,
4 & 8]. It differs from the actual fundamental frequency due to overtones inherent in the sound
Fig 1 to Fig 5 shows the pitch and intensity of different emotions of Sentence 6. The table 2
shows the mean pitch of the different emotion and Fig.6 shows the variation of mean pitch in
different emotions. It shows that mean pitch is highest in fear and lowest in sadness when compare to
other emotions.

161

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Figure 1: Pitch and intensity of neutral sentence

Figure 2: Pitch and intensity of emotion (sad)

Figure 3: Pitch and intensity of emotion (fear)

Figure 4: Pitch and intensity of emotion (anger)

Figure 5: Pitch and intensity of emotion (happy)


Table 2: Mean pitch of sentences in different emotion (Hz)
Sent Neutral Sadness Fear Anger Happy
S1
129.12 119.71 209.53 189
140.4
S2
116.95 137.37 198.84 189
135.2
S3
123.33 131.45 195.83 210
176.3
S4
113.37 116.56 164.74 177
162.7
S5
125.55 156.28 226.61 195
172.5
S6
103.04 160.46 202.5
223
153.2
S7
108.97 124.59 192.17 174
127.7
S8
108.87 107.61 165.21 136
110
Table.3 shows the intensity of different emotions and Fig.7 shows the variation of intensity. It
shows that intensity is highest in anger and lowest in fear.

162

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Figure.6: Mean pitch of 8 sentences in different emotion


Table 3: Intensity of different emotion in dB
Sent.No neutral Sad Fear anger Happy
S1
85.64 84.94 88.78 90.88 90.39
S2
85.50 79.29 78.29 83.15 84.17
S3
87.33 84.82 87.70 89.17 90.51
S4
83.29 88.01 88.99 91.98 86.93
S5
86.39 86.98 89.16 91.30 90.61
S6
83.22 85.35 88.98 87.28 85.59
S7
88.92 86.48 88.00 92.16 85.74
S8
88.14 87.70 87.26 87.95 85.51

Figure 7: Intensity of different emotions


For analysis purpose speech signal is decomposed in to number of frames. These frames may
be voiced or unvoiced. If voiced frame contain prosodic feauteres, unvoiced frames contains
excitation features along the prosodic features. so it necessary to analyse the unvoiced frames.
163

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Table 4 contains the percentage of unvoiced frames in a sentence in all emotions. Fig 8 shows that
unvoiced frames are highest in fear & lowest in happy when compare other emotion.Pressure of
sound influence the Intensity which in turn affects the power at each formant. (PSD) of different
emotions is plotted in Fig.9 and pressure of sound in Fig.10. Irrespective of emotions the radiance of
lips for the a sentences or utterence remains same. The rate of vocal fold changes for different
emotions causing the less tilt in specrtum, which greatly influences the emotions. This indicates that
not only prosodic features but also excitation sources influence the emotions. Fig.11 shows the vocal
ract variations in different emotions.
Table 4: Percentage of unvoiced frames in different emotions
Sent.No Neutral Sadness Fear
Anger Happy
S1
S2
S3
S4
S5
S6
S7
S8

17.88%
31.14%
14.86%
30.77%
34.04%
29.44%
23.61%
25.94%

43.08%
33.93%
28.37%
25.65%
43.28%
27.38%
32.16%
27.45%

54.37%
39.02%
29.43%
43.56%
50.00%
53.40%
41.76%
29.13%

28.41%
19.41%
23.17%
19.16%
37.69%
31.10%
22.55%
32.28%

25.73%
24.74%
27.32%
20.15%
38.53%
30.09%
27.25%
40.29%

Figure 8: Percentage of unvoiced frames in different emotions

Figure 9: PSD in different emotions


164

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Figure 10: Pressure of sound in different emotions


By Analysis of different parameters like intensity, pitch, number of unvoiced frames, sound
pressure, PSD and vocal fold influence it is very difficult to characterize each emotions. While
coming to statistical variance of values, it is much more difficult to characterize emotions. So it is
necessary to design an envelope which considers all the above characterises. This can be done using
LPC, LSF, MFCC or LFCC. In this work we are making use of LPC

Figure 11: Vocal fold variance in different emotions

.
Figure 12: Spectrogram of the neutral sentence

Figure 13: Spectrogram of Emotion (sad)


165

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Figure 14: Spectrogram of Emotion (Fear)

Figure 15: Spectrogram of Emotion (Anger)

Figure 16: Spectrogram of Emotion (Happy)


The Effects of Excitation which cannot be seen in prosodic analysis can be seen in
Spectrogram analysis, which can be analysed using the nonparametric methods of non-stationary
signal.
IV. FEATURE EXTRACTION
The performance of an emotion classifier relies heavily on the Quality of speech data. LPC is
powerful speech signal analysis technique. LPC determines the coefficients of a forward linear
predictor by minimizing the error in the least squares sense. It has applications in filter design and
speech coding, since LPC provides a good approximation of vocal tract spectral envelop. LPC finds
the coefficients of a pth-order linear predictor (FIR filter) that predicts the current value of the realvalued time series x based on past samples.

Figure 17: Block diagram of LPC


166

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

p is the order of the prediction filter polynomial, a = [1,a(2), ... a(p+1)]. If p is unspecified,
LPC uses as a default p = length(x)-1. If x is a matrix containing a separate signal in each column,
LPC returns a model estimate for each column in the rows of matrix and a column vector of
prediction error variances g. The length of p must be less than or equal to the length of x.
LPC analyses the speech signal by eliminating the formant and speech by estimating the
intensity and frequency of the remaining buzz. The process is called inverse filtering and the
remaining is called the residue. The excitation signal obtained from the LPC analysis is viewed
mostly as error signal, and contains higher order relations. Higher order relations contain strength of
excitation, characteristics of glottal volume velocity waveform, shapes of glottal pulse, variance of
vocal folds.
V. EVALUATION
Evaluation is carried in two methods
Evaluation by listener: Perception test is done and Mean Opinion Score is taken, the main objective
of perception test is to validate the recorded voice for recognition of emotion. The perception test
involved 25 people from various backgrounds. Sentences in random order were played to the
listeners and they were asked to identify expression of emotion in the utterances. The listeners were
required to choose the emotion of the recorded voice from a list of 4 emotions along with the neutral
sentences. The MOS was of the test was calculated.
Evaluation by classifier
Probabilistic neural network (PNN): PNN is closely related to Parzen window Probability Density
Function (PDF) estimator. A PNN consists of several sub-networks, each of which is a Parzen
window PDF estimator for each of the classes. The input nodes are the set of measurements. The
second layer consists of the Gaussian functions formed using the given set of data points as centers.
The third layer performs an average operation of the outputs from the second layer for each class.
The fourth layer performs a vote, selecting the largest value. The associated class label is then
determined.

Figure 18: PNN classifier

167

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

In general, a PNN for M classes is defined as






      exp 

|| ,||
 

---------(1)

Where nj denotes the number of data points in class j. The PNN assign x into class k if yk(x) >yj(x),
j[1M], ||x j,i-x||2 is calculated as the sum of Squares
K-Neighboring numbers: In pattern recognition, the k Nearest Neighbors algorithm is a nonparametric method used for classification. The output depends on value of K in algorithm.
In k-NN classification, the output is a class membership. An object is classified by a majority
vote of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the
class of that single nearest neighbor.

Figure 19: Block diagram of emotion recognition


In k-NN regression, the output is the property value for the object. This value is the average
of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning,
where the function is only approximated locally and all computation is deferred until classification.
The k-NN algorithm is among the simplest of all machine learning algorithms.
Both for classification, it can be useful to weight the contributions of the neighbors, so that the nearer
neighbors contribute more to the average than the more distant ones. For example, a common
weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the
neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or
the object property value (for k-NN regression) is known. This can be thought of as the training set
for the algorithm, though no explicit training step is required.
VI. RESULTS AND DISCUSSION
EVALUSTION OF EMOTION
Evaluation by people: Confusion matrix created after calculating the MOS is shown in table 5, it
was observed that the most recognised emotion was anger (91%), while the least recognized emotion
was fear (70%). From the table, it can be observed that fear is the most confusing emotion that is
very much confused with sadness. The average of recognition of emotion was 81% and the order of
recognition of all emotion is anger > neutra l > sadness > happy > fear.

168

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

Table 5:Confusion matrix of perception test


Category Neutral sadness fear anger happy
Neutral
2%
1%
6%
2%
89%
Sadness
4%
11% 4%
3%
78%
Fear
3%
18%
7%
2%
70%
Anger
5%
1%
1% 91%
2%
happy
10%
1%
1% 11% 77%
Evaluation by classifiers : LPC coeffieints are fed as input to both algorithms for classification of
emotions. The results obtained in both methos are almost same. That is, as the number of coeffients
and K increases the accuracy towards detecting emotions like sadness and fear increases but
ambiguity in detecting other emotions like neutral, happy, anger also increases. As the number of coefficient and k decreases the accuracy toward detecting emotions like neutral, happy and anger
increases and ambiguity exists between sad and fear.
Table 6: Confusion matrix of evaluation of emotions by k-NN and PNN
LPC=50,K=1
Neutral sadness fear
Neutral

anger happy

2%

5%

3%

20%

Sadness

70%
30%

11%

6%

30%

23%

Fear

35%

10%

5%

25%

25%

Anger

12%

5%

8%

10%

happy

5%

2%

5%

65%
20%

LPC=500,K=5
Neutral sadness fear

68%

anger happy

Neutral

20%

2%

8%

30%

40%

Sad

6%

20%

5%

0%

fear

2%

69%
11%

19%

0%

anger

30%

5%

68%
8%

22%

35%

happy

20%

25%

5%

30%

20%

VII. CONCLUSION
In this paper, the prosodic and excitation features in Kannada speech has been analysed
from spoken sentences for important categories of emotion. It has been observed that all these
prosodic features (F0, A0, D), along with the excitation parameters (PSD, pressure and vocal fold
variance) play significant role in expression of emotions. Evaluation of database has been conducted
using the database created to express the emotion. Here along with prosodic parameter excitation
parameters has been used for training PNN, k-NN classifier. The result shows, there is an ambiguity
in detection of emotion like neutral, anger, happy with sad and fear when LPC co-efficient and k
value varies.This work can be enhanced using MFCC, LFCC, and PFCC. Further studies should be
conducted using database created by natural conversations
169

Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India

REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]

[9]

[10]

[11]

[12]

[13]

Takashi & Norman D. Cook, Identifying Emotion in Speech Prosody Using Acoustical Cues
of Harmony, INTERSPEECH, ISCA, DBLP (2004).
Paul Boersma and David Weenink. (2009, November) Praat: doing phonetics by computer.
[Online]. URL http://www.fon.hum.uva.nl/praat/.
Sendlmeier, W.F., Kienast M. and Paeschke, A. F0 contours in Emotional Speech.
Technische University, Berlin, Proc. ICPhS, 1999.
Mozziconacci, S.J.L and Hermes D.J. Role of Intonational Patterns in Conveying Emotion
in Speech. ICPhS 1999, 1999 - Citeseer.
Kwon O W, Chan K L, Hao J, et al. Emotion Recognition by Speech Signals. Eurospeech,
Geneva, Switzerland, 2003.
Rong J, Li G, Chen Y-P P. Acoustic feature selection for automatic emotion recognition
from speech. J InfProcManag, 2009.
D.J.Ravi and SudarshanPatilkulkarni, Kannada Text to Speech Synthesis Systems: Emotion
Analysis international conference on natural language processing (ICON-2009).
Sushma Bahuguna1, Y. P. Raiwani. A Study Of Acoustic Features Pattern Of Emotion
Expression For Hindi Speech international journal of computer engineering & technology
(ijcet) measurement science review, Volume 10, No. 3, 201072.
J. Pibil, and A. Pibilov, An Experiment with Evaluation of Emotional Speech Conversion
by Spectrograms Institute of Photonics and Electronics, Academy of Sciences CR, v.v.i.,
Chabersk 57,CZ-182 51 Prague 8, Czech Republic.
Slobodan T. Jovii ,ZorkaKai , Miodragorevi, MirjanaRajkovi, Serbian emotional
speech database: design, processing and evaluation ISCA Archive SPECOM2004:9th
Conference Speech and Computer St.Petersburg, Russia September, 20-22, 2004.
Shashidhar G. Koolagud, RaoSreenivasaKrothapalli Two stage emotion recognition based
on speaking rate Received: 16 November 2010 / Accepted: 2 December 2010 / Published
online: 11 December 2010 Springer Science+Business Media, LLC 2010.
Shashidhar G. Koolagudi, K. SreenivasaRao Emotion recognition from speech: a review
Received: 7 July 2011 / Accepted: 17 December 2011 / Published online: 4 January 2012
Springer Science+Business Media, LLC 2011.
Syed Abbas Ali, SitwatZehar, Mohsin Khan & Faisal Wahab, Development and Analysis of
Speech Emotion Corpus using Prosodic Features fpr Cross Linguistics International Journal
of Scientific & Engineering Research, vol-4, Issue 1, Jan-2013, ISSN 2229-5518.

170

Вам также может понравиться