Вы находитесь на странице: 1из 6

99 10 23

D-03

Speech Emotion Recognition and its Applications


*
Jiun-Sheng Li*

**
Chu-Chuan Huang**

***
Shin-Tzen Sheu***

****
Ming-Wheng Lin****

*
**
***
****

(pitch)(formant)(frame energy)(Mel-scale Frequency


Cepstral Coefficients, MFCC)(Support Vector
Machine, SVM)
63.8% 86.8%

Schuller et al.[5]
X.H. Le et al. [6] T.L. Pao et
al.[7] (Mel-scale Frequency
Cepstral Coefficients, MFCC)
(Linear predictor coefficient, LPC)

[5](Hidden Markov
Model, HMM)[2][3][7](Gaussian
Mixture Model, GMM)[5][6][7]

[1]

1.

2.

(Prosody)
(Pitch)(Energy)[2-5]
(Formant)[4]

1. (window)(frame)
x(n)

187

(window)

(speech features)
(windowing)
(frame) 15~30 (ms)
(feature parameters)
5~20 ms

16kHz
62.5 (s)
256 16 ms
1/2
8 ms 128
frame 256

EL x (m ) = log
x(n )2

n = m N +1

(2.4.2)
x
1
Ex

5. (Mel-scale Frequency Cepstral


Coefficients, MFCC)

1kHz
100Hz1kHz

(mel-frequency)

(DCT)

1kHz 1kHz
4kHz 20
100200300400500600700800
9001000114813181514173719952291
2630302034674000Hz
m

2. (Pitch)
f
fundamental frequency C A
440Hz

Auto-correction function (ACF)

frame n
local
maximum
256
ACF
Pitch
3. (formant)

(formant)

F1
F2
(20~4kHz)

0 k < f m 1

k f m 1
f m 1 k f m
f m f m 1
Bm (k ) =
f m +1 k
f m k f m +1
f m +1 f m

0
f m +1 < k

(2.5.1)
Bm(k) m fm
m fm-1 fm+1
M

4. (Frame energy)

(intensity)

E x (m ) =

1 m M

f m +1

Y (m ) = log
X (k )2 B m (k )
k = f

m 1

(2.5.2)
M
(discrete cosine transform, DCT)

x(n )

n = m N +1

(2.4.1)

188

1
c x (n ) =
M

n m 1
2
Y (m ) cos

m =1

Triangular filter
1
0.9
0.8

(2.5.3)

0.7

c x (n ) x(n)
(Mel-frequency Cepstral coefficient, MFCC)
13 n = 1,2,3,13

Gain

0.6
0.5
0.4
0.3
0.2
0.1

Speech data

0.8
0.6

Level

0.4

500

1000

1500
2000
2500
Frequency (Hz)

3000

3500

4000

0.2

0
-0.2
-0.4
-0.6

0.5

1.5

-3

2
Time (sec)

2.5

3.5

Frame

x 10

Level

pitchenergyformant
MFCC 13 23
SVM[8]

23

6 7

-5

-10
1.042

1.044

1.046

1.048

1.05
Time (sec)

1.052

1.054

1.056

1.058

1 frame
frame
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2

50

100

150

200

250

150

200

250

Auto-Correlation Function (ACF)


3
2.5
2
1.5
1
0.5
0
-0.5
-1

50

100

2 ACF

3 Formant
Waveform
100

Amplitude

50

-50

-100
1

Energy
5000

Amplitude

4000

3000

2000

1000

189

70% 94% 86.75%


3

(Confustion Matrix) 4 5

EnergyMFCC
()

20 P1~P10
P11~P20 )
25 35
4
()()
()()

30
20430=2400

20 19

20

35% 86% 63.8%


2

2
P1
P2
P3
P4
74.17%
50%
64.16% 78.33%
P6
P7
P8
P9
78%
62.5% 51.67% 85.83%
P11
P12
P13
P14
85%
82.5% 55.83% 72.5%
P16
P17
P18
P19
50.83%
35%
62.5% 51.67%
63.79%

P5
70%
P10
70%
P15
57.5%
P20
37.5%

3
P1
P2
P3
P4
78.33% 93.33% 78.33% 91.67%
P6
P7
P8
P9
90%
78.33% 86.67% 93.33%
P11
P12
P13
P14
91.67%
90%
91.67%
85%
P16
P17
P18
P19
85%
95%
81.67%
90%
86.75%

P5
88.33%
P10
88.33%
P15
88%
P20
70%

190

401
130
20
32
379
14
32
18
445
30
18
246

49
22
105
306

264
34
1
39
253
5
2
2
260
2
1
33

1
3
36
264

19()4(
1()4(

)30(
)30(
)=2280
)=120

20()4(
1()4(
)15(
)15(
)=1140
)=60

1.

2.

3.

63%~86%
70%~94%

SVM

4.

191

P.W. Jordan(1998).Human factors for pleasure


in product use, Elsevier Science Ltd, vol. 29,
pp. 25-33.
D.N. Jiang and L.H. Cai(2004).Speech
Emotion Classification with the Combination of
Statistic Features and Temporal Features, IEEE
International Conference on Multimedia and
Expo , Taipei, Taiwan, vol. 3, pp. 1967-1970.
B. Schuller, G. Rigoll and M. Lang(2003).
Hidden Markov Model-based Speech Emotion
Recognition, Proc. of IEEE International
Conference on Acoustics, Speech, and Signal
Processing, Hong Kong, China, vol. 2, pp. 1-4.
D. Ververidis, C. Kotropoulos and I.
Pitas(2004).Automatic Emotional Speech
Classification,
Proceedings
of
IEEE
International Conference on Acoustics, Speech,
and Signal Processing, Montreal, Quebec,
Canada, vol. 1, pp. 593-596.

5.

B. Schuller, G. Rigoll and M. Lang(2004).


Speech Emotion Recognition Combining
Acoustic Features and Linguistic Information in
a Hybrid Support Vector Machine - Belief
Network Architecture, Proceedings of IEEE
International Conference on Acoustics, Speech,
and Signal Processing, Montreal, Quebec,
Canada, vol. 1, pp. 577-580.

6.

X.H. Le, G. Qunot and E. Castelli(2004).


Recognizing Emotions for Audio-Visual
Document Indexing," Proceedings of 9th
Symposium on Computers and Communications,
Alexandria, Egypt, vol. 2, pp. 580-584.

7.

T.L. Pao, Y.T. Chen and J.H. Yeh(2004).


Emotion Recognition from Madarin Speech
Signals, Proceedings of IEEE International
Symposium on Chinese Spoken Language
Processing, Hong Kong, pp. 301-304.

8.

C.C. Chang and C.K. Lin, LIBSVM: a library


for support vector machines. Software available
at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Speech Emotion Recognition and its Applications


Jiun-Sheng Li*

Chu-Chuan Huang**

Shin-Tzen Sheu***

Ming-Wheng Lin****

*Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lgs@itri.org.tw
** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
chuchuan@itri.org.tw
*** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
SerinaSheu@itri.org.tw
**** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lmw@itri.org.tw

Abstract
This paper proposed a speech emotion recognition method and its applications. Several
speech features such as pitch, formant, frame energy, and Mel-scale frequency cepstral
coefficients (MFCC) are considered in the proposed system. Support vector machine (SVM)
is used to classify speeches into four emotions. The experimental results showed the proposed
system performed 63.8% in outside tests and 86.8% in inside tests.
keywordsSpeech RecognitionEmotion RecognitionSpeech Emotion

192

Вам также может понравиться