Академический Документы
Профессиональный Документы
Культура Документы
D-03
**
Chu-Chuan Huang**
***
Shin-Tzen Sheu***
****
Ming-Wheng Lin****
*
**
***
****
Schuller et al.[5]
X.H. Le et al. [6] T.L. Pao et
al.[7] (Mel-scale Frequency
Cepstral Coefficients, MFCC)
(Linear predictor coefficient, LPC)
[5](Hidden Markov
Model, HMM)[2][3][7](Gaussian
Mixture Model, GMM)[5][6][7]
[1]
1.
2.
(Prosody)
(Pitch)(Energy)[2-5]
(Formant)[4]
1. (window)(frame)
x(n)
187
(window)
(speech features)
(windowing)
(frame) 15~30 (ms)
(feature parameters)
5~20 ms
16kHz
62.5 (s)
256 16 ms
1/2
8 ms 128
frame 256
EL x (m ) = log
x(n )2
n = m N +1
(2.4.2)
x
1
Ex
1kHz
100Hz1kHz
(mel-frequency)
(DCT)
1kHz 1kHz
4kHz 20
100200300400500600700800
9001000114813181514173719952291
2630302034674000Hz
m
2. (Pitch)
f
fundamental frequency C A
440Hz
frame n
local
maximum
256
ACF
Pitch
3. (formant)
(formant)
F1
F2
(20~4kHz)
0 k < f m 1
k f m 1
f m 1 k f m
f m f m 1
Bm (k ) =
f m +1 k
f m k f m +1
f m +1 f m
0
f m +1 < k
(2.5.1)
Bm(k) m fm
m fm-1 fm+1
M
4. (Frame energy)
(intensity)
E x (m ) =
1 m M
f m +1
Y (m ) = log
X (k )2 B m (k )
k = f
m 1
(2.5.2)
M
(discrete cosine transform, DCT)
x(n )
n = m N +1
(2.4.1)
188
1
c x (n ) =
M
n m 1
2
Y (m ) cos
m =1
Triangular filter
1
0.9
0.8
(2.5.3)
0.7
c x (n ) x(n)
(Mel-frequency Cepstral coefficient, MFCC)
13 n = 1,2,3,13
Gain
0.6
0.5
0.4
0.3
0.2
0.1
Speech data
0.8
0.6
Level
0.4
500
1000
1500
2000
2500
Frequency (Hz)
3000
3500
4000
0.2
0
-0.2
-0.4
-0.6
0.5
1.5
-3
2
Time (sec)
2.5
3.5
Frame
x 10
Level
pitchenergyformant
MFCC 13 23
SVM[8]
23
6 7
-5
-10
1.042
1.044
1.046
1.048
1.05
Time (sec)
1.052
1.054
1.056
1.058
1 frame
frame
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
50
100
150
200
250
150
200
250
50
100
2 ACF
3 Formant
Waveform
100
Amplitude
50
-50
-100
1
Energy
5000
Amplitude
4000
3000
2000
1000
189
(Confustion Matrix) 4 5
EnergyMFCC
()
20 P1~P10
P11~P20 )
25 35
4
()()
()()
30
20430=2400
20 19
20
2
P1
P2
P3
P4
74.17%
50%
64.16% 78.33%
P6
P7
P8
P9
78%
62.5% 51.67% 85.83%
P11
P12
P13
P14
85%
82.5% 55.83% 72.5%
P16
P17
P18
P19
50.83%
35%
62.5% 51.67%
63.79%
P5
70%
P10
70%
P15
57.5%
P20
37.5%
3
P1
P2
P3
P4
78.33% 93.33% 78.33% 91.67%
P6
P7
P8
P9
90%
78.33% 86.67% 93.33%
P11
P12
P13
P14
91.67%
90%
91.67%
85%
P16
P17
P18
P19
85%
95%
81.67%
90%
86.75%
P5
88.33%
P10
88.33%
P15
88%
P20
70%
190
401
130
20
32
379
14
32
18
445
30
18
246
49
22
105
306
264
34
1
39
253
5
2
2
260
2
1
33
1
3
36
264
19()4(
1()4(
)30(
)30(
)=2280
)=120
20()4(
1()4(
)15(
)15(
)=1140
)=60
1.
2.
3.
63%~86%
70%~94%
SVM
4.
191
5.
6.
7.
8.
Chu-Chuan Huang**
Shin-Tzen Sheu***
Ming-Wheng Lin****
*Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lgs@itri.org.tw
** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
chuchuan@itri.org.tw
*** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
SerinaSheu@itri.org.tw
**** Human Computer Interaction Technology, ITRI South, Industrial Technology Research Institute,
lmw@itri.org.tw
Abstract
This paper proposed a speech emotion recognition method and its applications. Several
speech features such as pitch, formant, frame energy, and Mel-scale frequency cepstral
coefficients (MFCC) are considered in the proposed system. Support vector machine (SVM)
is used to classify speeches into four emotions. The experimental results showed the proposed
system performed 63.8% in outside tests and 86.8% in inside tests.
keywordsSpeech RecognitionEmotion RecognitionSpeech Emotion
192