語音情緒辨識技術與應用之研 PDF

99 10 23
D-03
Speech Emotion Recognition and its Applications

*
Jiun-Sheng Li*
**
Chu-Chuan Huang**
***
Shin-Tzen Sheu***
****
Ming-Wheng Lin****
*
**
***
****

(pitch)(formant)(frame energy)(Mel-scale Frequency

Cepstral Coefficients, MFCC)(Support Vector
Machine, SVM)
63.8% 86.8%
Schuller et al.[5]
X.H. Le et al. [6] T.L. Pao et
al.[7] (Mel-scale Frequency
Cepstral Coefficients, MFCC)
(Linear predictor coefficient, LPC)
[5](Hidden Markov
Model, HMM)[2][3][7](Gaussian
Mixture Model, GMM)[5][6][7]
[1]
1.
2.
(Prosody)
(Pitch)(Energy)[2-5]
(Formant)[4]
1. (window)(frame)
x(n)
187
(window)
(speech features)
(windowing)
(frame) 15~30 (ms)
(feature parameters)
5~20 ms
16kHz
62.5 (s)
256 16 ms
1/2
8 ms 128
frame 256
EL x (m ) = log
x(n )2
n = m N +1
(2.4.2)
x
1
Ex
5. (Mel-scale Frequency Cepstral

Coefficients, MFCC)
1kHz
100Hz1kHz
(mel-frequency)
(DCT)
1kHz 1kHz
4kHz 20
100200300400500600700800
9001000114813181514173719952291
2630302034674000Hz
m
2. (Pitch)
f
fundamental frequency C A
440Hz
Auto-correction function (ACF)
frame n
local
maximum
256
ACF
Pitch
3. (formant)
(formant)
F1
F2
(20~4kHz)
0 k < f m 1
k f m 1
f m 1 k f m
f m f m 1
Bm (k ) =
f m +1 k
f m k f m +1
f m +1 f m
0
f m +1 < k
(2.5.1)
Bm(k) m fm
m fm-1 fm+1
M
4. (Frame energy)
(intensity)
E x (m ) =
1 m M
f m +1
Y (m ) = log
X (k )2 B m (k )
k = f
m 1
(2.5.2)
M
(discrete cosine transform, DCT)
x(n )
n = m N +1
(2.4.1)
188
1
c x (n ) =
M
n m 1
2
Y (m ) cos
m =1
Triangular filter
1
0.9
0.8
(2.5.3)
0.7
c x (n ) x(n)
(Mel-frequency Cepstral coefficient, MFCC)
13 n = 1,2,3,13
Gain
0.6
0.5
0.4
0.3
0.2
0.1
Speech data
0.8
0.6
Level
0.4
500
1000
1500
2000
2500
Frequency (Hz)
3000
3500
4000
0.2
0
-0.2
-0.4
-0.6
0.5
1.5
-3
2
Time (sec)
2.5
3.5
Frame
x 10
Level
pitchenergyformant
MFCC 13 23
SVM[8]
23
6 7
-5
-10
1.042
1.044
1.046
1.048
1.05
Time (sec)
1.052
1.054
1.056
1.058
1 frame
frame
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
50
100
150
200
250
150
200
250
Auto-Correlation Function (ACF)

3
2.5
2
1.5
1
0.5
0
-0.5
-1
50
100
2 ACF
3 Formant
Waveform
100
Amplitude
50
-50
-100
1
Energy
5000
Amplitude
4000
3000
2000
1000
189
70% 94% 86.75%

3
(Confustion Matrix) 4 5
EnergyMFCC
()
20 P1~P10
P11~P20 )
25 35
4
()()
()()
30
20430=2400
20 19
20
35% 86% 63.8%

2
2
P1
P2
P3
P4
74.17%
50%
64.16% 78.33%
P6
P7
P8
P9
78%
62.5% 51.67% 85.83%
P11
P12
P13
P14
85%
82.5% 55.83% 72.5%
P16
P17
P18
P19
50.83%
35%
62.5% 51.67%
63.79%
P5
70%
P10
70%
P15
57.5%
P20
37.5%
3
P1
P2
P3
P4
78.33% 93.33% 78.33% 91.67%
P6
P7
P8
P9
90%
78.33% 86.67% 93.33%
P11
P12
P13
P14
91.67%
90%
91.67%
85%
P16
P17
P18
P19
85%
95%
81.67%
90%
86.75%
P5
88.33%
P10
88.33%
P15
88%
P20
70%
190
401
130
20
32
379
14
32
18
445
30
18
246
49
22
105
306
264
34
1
39
253
5
2
2
260
2
1
33
1
3
36
264
19()4(
1()4(
)30(
)30(
)=2280
)=120
20()4(
1()4(
)15(
)15(
)=1140
)=60
1.
2.
3.
63%~86%
70%~94%
SVM
4.
191
P.W. Jordan(1998).Human factors for pleasure

in product use, Elsevier Science Ltd, vol. 29,
pp. 25-33.
D.N. Jiang and L.H. Cai(2004).Speech
Emotion Classification with the Combination of
Statistic Features and Temporal Features, IEEE
International Conference on Multimedia and
Expo , Taipei, Taiwan, vol. 3, pp. 1967-1970.
B. Schuller, G. Rigoll and M. Lang(2003).
Hidden Markov Model-based Speech Emotion
Recognition, Proc. of IEEE International
Conference on Acoustics, Speech, and Signal
Processing, Hong Kong, China, vol. 2, pp. 1-4.
D. Ververidis, C. Kotropoulos and I.
Pitas(2004).Automatic Emotional Speech
Classification,
Proceedings
of
IEEE
International Conference on Acoustics, Speech,
and Signal Processing, Montreal, Quebec,
Canada, vol. 1, pp. 593-596.
5.
B. Schuller, G. Rigoll and M. Lang(2004).

Speech Emotion Recognition Combining
Acoustic Features and Linguistic Information in
a Hybrid Support Vector Machine - Belief
Network Architecture, Proceedings of IEEE
International Conference on Acoustics, Speech,
and Signal Processing, Montreal, Quebec,
Canada, vol. 1, pp. 577-580.
6.
X.H. Le, G. Qunot and E. Castelli(2004).

Recognizing Emotions for Audio-Visual
Document Indexing," Proceedings of 9th
Symposium on Computers and Communications,
Alexandria, Egypt, vol. 2, pp. 580-584.
7.
T.L. Pao, Y.T. Chen and J.H. Yeh(2004).

Emotion Recognition from Madarin Speech
Signals, Proceedings of IEEE International
Symposium on Chinese Spoken Language
Processing, Hong Kong, pp. 301-304.
8.
C.C. Chang and C.K. Lin, LIBSVM: a library

for support vector machines. Software available
at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Speech Emotion Recognition and its Applications

Jiun-Sheng Li*
Chu-Chuan Huang**
Shin-Tzen Sheu***
Ming-Wheng Lin****
*Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lgs@itri.org.tw
** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
chuchuan@itri.org.tw
*** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
SerinaSheu@itri.org.tw
**** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lmw@itri.org.tw
Abstract
This paper proposed a speech emotion recognition method and its applications. Several
speech features such as pitch, formant, frame energy, and Mel-scale frequency cepstral
coefficients (MFCC) are considered in the proposed system. Support vector machine (SVM)
is used to classify speeches into four emotions. The experimental results showed the proposed
system performed 63.8% in outside tests and 86.8% in inside tests.
keywordsSpeech RecognitionEmotion RecognitionSpeech Emotion
192

語音情緒辨識技術與應用之研 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

語音情緒辨識技術與應用之研 PDF

Загружено:

Авторское право:

Доступные форматы

99 10 23

Speech Emotion Recognition and its Applications

(pitch)(formant)(frame energy)(Mel-scale Frequency

5. (Mel-scale Frequency Cepstral

Auto-correction function (ACF)

Auto-Correlation Function (ACF)

70% 94% 86.75%

35% 86% 63.8%

P.W. Jordan(1998).Human factors for pleasure

B. Schuller, G. Rigoll and M. Lang(2004).

X.H. Le, G. Qunot and E. Castelli(2004).

T.L. Pao, Y.T. Chen and J.H. Yeh(2004).

C.C. Chang and C.K. Lin, LIBSVM: a library

Speech Emotion Recognition and its Applications

Вам также может понравиться