Академический Документы
Профессиональный Документы
Культура Документы
IJECET
IAEME
Geethashree A2,
Dr. D J Ravi3
ABSTRACT
Emotion is an affective state of consciousness that involves feeling and plays a
significantrole in communication. So it is necessary to analyze and evaluate speech data base to
build an effective emotion recognition system and efficient man machine interface. This paper
presents and discusses development of emotional Kannada speech data base analysis and its
evaluation using Mean opinion score (MOS), PNN and k-NN.
Keywords: K-Neighbouring Numbers (K-NN), Probability Neural Network (PNN), Speech Corpus.
I. INTRODUCTION
Emotion plays an important role in day-to-day interpersonal human interactions. Recent
findings have suggested that emotion is integral to our rational and intelligent decisions. A
successful solution to this challenging problem would enable a wide range of important applications.
Correct assessment of the emotional state of an individual could significantly improve quality of
emerging, natural language based human-computer interfaces [1,3,6]. It helps us to relate with each
other by expressing our feelings and providing feedback.
There have been many studies [3,4,7-10] for emotional speech but it is observed that most of
the studies are for English, Hindi and other languages, there is also a need to study these aspects for
Kannada speech. The investigation of both prosody related features [13] and spectral features for the
evaluation of emotion recognition is necessary 50-500 LPC coefficients as spectral features, whereas
mean value of pitch (F0), intensity, pressure of sound, Power Spectral Density (PSD), pressure, as
prosody related features have been studied. The human capability to recognize the emotion from
speech was also studied and compared with machine classifiers.
This important aspect of human interaction needs to be considered in the design of human
machine interfaces. Initially a listening test of sample Sentences was done to identify speakers
160
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
emotion based on auditory impressions and Mean opinion score was collected. Then speakers
emotion Identification of sample sentences was done with probabilistic neural network (PNN) and kneighboring numbers (KNN) using LPC and subsequently PRAAT software package was used to
extract the Pattern of acoustic parameters for sample sentences [2].
II. EMOTIONAL DATABASE
Obtaining emotional corpus is quite difficult in itself. Various methods have been utilized in
the past, like the use of acted speech, the speech obtained from movies or television shows and
speech recorded in event recall [2, 5, 6].
The database is composed of 4 different emotions (happy, sad, anger and fear) and neutral
emotion as uttered by two male Kannada actors, consisting of a total of 60 sentences containing
minimum 3 to maximum 7 words. The first step was to record the voice of each words and sentences.
The recordings of all the words and sentences were done using recording studio. These words and
sentences were recorded at a sample rate of 44100 Hz with a mono channel. The sentences used for
Statistical analysis is listed in table 1.
Table 1: Sentences used in analysis
Sent.
S2
S3
S5
S5
S6
S7
S8
KANNADA (English)
(long live like a wind)
!.
( I am blessed ,as I protected the lives of elders)
#%()+-./.
(I have fought and Experienced with so many people like you.)
2!
(Aravinda is my Disciple)
+5678
(I study during night time)
%<88.
(He might be a Brahmin ,there is no doubt about it)
>?>%@.5A?
(Father, who is that fellow who troubles us?)
III. ANALYSIS
Pitch is strongly correlated with the fundamental frequency of the sound. It occupies a central
place in the study of prosodic attributes as it is the perceived fundamental frequency of the sound [3,
4 & 8]. It differs from the actual fundamental frequency due to overtones inherent in the sound
Fig 1 to Fig 5 shows the pitch and intensity of different emotions of Sentence 6. The table 2
shows the mean pitch of the different emotion and Fig.6 shows the variation of mean pitch in
different emotions. It shows that mean pitch is highest in fear and lowest in sadness when compare to
other emotions.
161
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
162
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
Table 4 contains the percentage of unvoiced frames in a sentence in all emotions. Fig 8 shows that
unvoiced frames are highest in fear & lowest in happy when compare other emotion.Pressure of
sound influence the Intensity which in turn affects the power at each formant. (PSD) of different
emotions is plotted in Fig.9 and pressure of sound in Fig.10. Irrespective of emotions the radiance of
lips for the a sentences or utterence remains same. The rate of vocal fold changes for different
emotions causing the less tilt in specrtum, which greatly influences the emotions. This indicates that
not only prosodic features but also excitation sources influence the emotions. Fig.11 shows the vocal
ract variations in different emotions.
Table 4: Percentage of unvoiced frames in different emotions
Sent.No Neutral Sadness Fear
Anger Happy
S1
S2
S3
S4
S5
S6
S7
S8
17.88%
31.14%
14.86%
30.77%
34.04%
29.44%
23.61%
25.94%
43.08%
33.93%
28.37%
25.65%
43.28%
27.38%
32.16%
27.45%
54.37%
39.02%
29.43%
43.56%
50.00%
53.40%
41.76%
29.13%
28.41%
19.41%
23.17%
19.16%
37.69%
31.10%
22.55%
32.28%
25.73%
24.74%
27.32%
20.15%
38.53%
30.09%
27.25%
40.29%
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
.
Figure 12: Spectrogram of the neutral sentence
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
p is the order of the prediction filter polynomial, a = [1,a(2), ... a(p+1)]. If p is unspecified,
LPC uses as a default p = length(x)-1. If x is a matrix containing a separate signal in each column,
LPC returns a model estimate for each column in the rows of matrix and a column vector of
prediction error variances g. The length of p must be less than or equal to the length of x.
LPC analyses the speech signal by eliminating the formant and speech by estimating the
intensity and frequency of the remaining buzz. The process is called inverse filtering and the
remaining is called the residue. The excitation signal obtained from the LPC analysis is viewed
mostly as error signal, and contains higher order relations. Higher order relations contain strength of
excitation, characteristics of glottal volume velocity waveform, shapes of glottal pulse, variance of
vocal folds.
V. EVALUATION
Evaluation is carried in two methods
Evaluation by listener: Perception test is done and Mean Opinion Score is taken, the main objective
of perception test is to validate the recorded voice for recognition of emotion. The perception test
involved 25 people from various backgrounds. Sentences in random order were played to the
listeners and they were asked to identify expression of emotion in the utterances. The listeners were
required to choose the emotion of the recorded voice from a list of 4 emotions along with the neutral
sentences. The MOS was of the test was calculated.
Evaluation by classifier
Probabilistic neural network (PNN): PNN is closely related to Parzen window Probability Density
Function (PDF) estimator. A PNN consists of several sub-networks, each of which is a Parzen
window PDF estimator for each of the classes. The input nodes are the set of measurements. The
second layer consists of the Gaussian functions formed using the given set of data points as centers.
The third layer performs an average operation of the outputs from the second layer for each class.
The fourth layer performs a vote, selecting the largest value. The associated class label is then
determined.
167
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
exp
|| ,||
---------(1)
Where nj denotes the number of data points in class j. The PNN assign x into class k if yk(x) >yj(x),
j[1M], ||x j,i-x||2 is calculated as the sum of Squares
K-Neighboring numbers: In pattern recognition, the k Nearest Neighbors algorithm is a nonparametric method used for classification. The output depends on value of K in algorithm.
In k-NN classification, the output is a class membership. An object is classified by a majority
vote of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the
class of that single nearest neighbor.
168
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
anger happy
2%
5%
3%
20%
Sadness
70%
30%
11%
6%
30%
23%
Fear
35%
10%
5%
25%
25%
Anger
12%
5%
8%
10%
happy
5%
2%
5%
65%
20%
LPC=500,K=5
Neutral sadness fear
68%
anger happy
Neutral
20%
2%
8%
30%
40%
Sad
6%
20%
5%
0%
fear
2%
69%
11%
19%
0%
anger
30%
5%
68%
8%
22%
35%
happy
20%
25%
5%
30%
20%
VII. CONCLUSION
In this paper, the prosodic and excitation features in Kannada speech has been analysed
from spoken sentences for important categories of emotion. It has been observed that all these
prosodic features (F0, A0, D), along with the excitation parameters (PSD, pressure and vocal fold
variance) play significant role in expression of emotions. Evaluation of database has been conducted
using the database created to express the emotion. Here along with prosodic parameter excitation
parameters has been used for training PNN, k-NN classifier. The result shows, there is an ambiguity
in detection of emotion like neutral, anger, happy with sad and fear when LPC co-efficient and k
value varies.This work can be enhanced using MFCC, LFCC, and PFCC. Further studies should be
conducted using database created by natural conversations
169
Proceedings of the 2nd International Conference on Current Trends in Engineering and Management ICCTEM -2014
17 19, July 2014, Mysore, Karnataka, India
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
Takashi & Norman D. Cook, Identifying Emotion in Speech Prosody Using Acoustical Cues
of Harmony, INTERSPEECH, ISCA, DBLP (2004).
Paul Boersma and David Weenink. (2009, November) Praat: doing phonetics by computer.
[Online]. URL http://www.fon.hum.uva.nl/praat/.
Sendlmeier, W.F., Kienast M. and Paeschke, A. F0 contours in Emotional Speech.
Technische University, Berlin, Proc. ICPhS, 1999.
Mozziconacci, S.J.L and Hermes D.J. Role of Intonational Patterns in Conveying Emotion
in Speech. ICPhS 1999, 1999 - Citeseer.
Kwon O W, Chan K L, Hao J, et al. Emotion Recognition by Speech Signals. Eurospeech,
Geneva, Switzerland, 2003.
Rong J, Li G, Chen Y-P P. Acoustic feature selection for automatic emotion recognition
from speech. J InfProcManag, 2009.
D.J.Ravi and SudarshanPatilkulkarni, Kannada Text to Speech Synthesis Systems: Emotion
Analysis international conference on natural language processing (ICON-2009).
Sushma Bahuguna1, Y. P. Raiwani. A Study Of Acoustic Features Pattern Of Emotion
Expression For Hindi Speech international journal of computer engineering & technology
(ijcet) measurement science review, Volume 10, No. 3, 201072.
J. Pibil, and A. Pibilov, An Experiment with Evaluation of Emotional Speech Conversion
by Spectrograms Institute of Photonics and Electronics, Academy of Sciences CR, v.v.i.,
Chabersk 57,CZ-182 51 Prague 8, Czech Republic.
Slobodan T. Jovii ,ZorkaKai , Miodragorevi, MirjanaRajkovi, Serbian emotional
speech database: design, processing and evaluation ISCA Archive SPECOM2004:9th
Conference Speech and Computer St.Petersburg, Russia September, 20-22, 2004.
Shashidhar G. Koolagud, RaoSreenivasaKrothapalli Two stage emotion recognition based
on speaking rate Received: 16 November 2010 / Accepted: 2 December 2010 / Published
online: 11 December 2010 Springer Science+Business Media, LLC 2010.
Shashidhar G. Koolagudi, K. SreenivasaRao Emotion recognition from speech: a review
Received: 7 July 2011 / Accepted: 17 December 2011 / Published online: 4 January 2012
Springer Science+Business Media, LLC 2011.
Syed Abbas Ali, SitwatZehar, Mohsin Khan & Faisal Wahab, Development and Analysis of
Speech Emotion Corpus using Prosodic Features fpr Cross Linguistics International Journal
of Scientific & Engineering Research, vol-4, Issue 1, Jan-2013, ISSN 2229-5518.
170