Академический Документы
Профессиональный Документы
Культура Документы
Recognition
Wei HAN, Cheong-Fat CHAN, Chiu-Sing CHOY and Kong-Pang PUN
Department of Electronic Engineering,
The Chinese University of Hong Kong
Hong Kong
reduction
in
recognition
accuracy
compared to the
Figure 1.
Mel-freuency
FilterBankl
Cepstrum
;
Delta
S't
s,
of the pre-emphasis
INTRODUCTION
S n = Sn
asn-I
(1)
Frame
Frame......
......
Frame f+2
50% Overlap
FFT
Logged
I.
pWindowing
& Overlap
Pre-emphasis
145
ISCAS 2006
I1
(2)
Ham(N) = 0.54 - 0.46 cos(2z n1)
N I
(C1,1 tdc 2(C1+2
where N is equal to 160, the number of points in
in one frame,
frame
10
oN
and n is from 1 to N.
After all the calculations, the total number of MFCC for
one frame is 26.
After the FFT block, the spectrum of each frame is
filtered by a set of filters, and the power of each band is
III. THE PROPOSED METHOD FOR MFCC EXTRACTION
calculated. To obtain a good frequency resolution, a 256point FFT is used [9]. Because of the symmetry property of
FFT, we only need to calculate the first 128 coefficients. The
echifra1meqoite
conventional algorithmdescribedin
II requires 160 multiplications for the window
filter bank consists of 33 triangular shaped band-pass filters,
Section
oeration 128x1o 2256) multiplications for the FFT
*~~~~~~~~~~~~~
in the Mel'
which are centered on equally spaced frequencies
.ug28
128 multiplications for the filter power
domain between OHz and 4kHz, as shown in Fig. 3. The
calculation,
mapping from linear frequency to Mel-Frequency is shown calculation and 33x12 multiplications for the DCT
calculation. A total of 1708 multiplications are required for
in equation (3).
each frame, which requires a huge amount of computational
c,6l
and~~~~~~~~pit
power.
(3)
700
Frequency Response
Magnitudei
1 A A A
Speeh
XPXpha
isWdovag
MdI--feq~~
Fi1fBap
mel-
l l 0 \ \
/ \
Frequency Response
~~~~~~Frequency
4000F
\
MagnitudeA
l
Frequency
0
4000
32
K
k=1
(4)
E = log
whc
*scluae
160
y
n=l
Sj
(5)
withot
an
winowin
Sn
-asn-I = Sn
1
~~~3
Sn
S
Sn = Sn - (sn- - Sn)
32
(7)
Cn-
andpre-
Recognition Accuracy
i
92
Hamming
Window
.. ..
'90S
91H
. . ...
Sub-Fraes+
18 19 20 21 22 23 24 25 26
33 No. of Filters
30
Frame f
f,1 and f,1 represent the original 160-point frames with 500O
overlap, and
calculation.
fn
f
11+
sf
Frequency Response
Magnitude\W
FFT
FFT
FFT
1
sf
sf
n+
Frequency
0
4000
147
SIMULATION RESULTS
TABLE I.
Feature
F3
31/32
Fl 0.97
algorithm. The training and test speech data are from the
ARURA 2 database. They are isolated English digits from
"0"to "9", and for "0" there are two kinds of pronunciation:
"zero" and "O". There are totally 1 144 utterances for training
and 468 test utterances. The speeches are lOdB SNR noise
signal constructed by adding noises to the clean data. The
training process and the back end of the speech recognition
system were done by the HTK toolkit. The back end of the
recognizer used the Hidden Markov Model (UIMM) [1]
approach. The models are from left to right with 8 states and
there are 2 mixtures in each state, as illustrated in Fig. 9.
,2
Set
F4 31/32
F5 3|1/32
Window
Length
160
80
80
80
No. of
FFT
256
Triangle
33
94.43%
256
Triangle
33
92.29%
92.08%
92.93%
256 Rectangle
128 Rectangle
33
23
REFERENCES
[1] L. Rabiner and Biing-Hwang Juang, Fundamentals of Speech
Recognition, Prentice Hall PTR, c1993
[2] Joseph W. Picone, "Signal Modeling Techniques in Speech
Recognition", Proceedings of the IEEE, vol. 81, No. 9, pages 1215-1247, 1993.
[3] Steven B. Davis and Paul Mermelstein, "Comparison of Parametric
Representations for Monosyllabic Word Recognition in Continuously
Spoken Sentences", IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. ASSP-28, No. 4, August 1980.
9xJ
**8
[4]
V. CONCLUSION
We have demonstrated that the new extraction algorithm
[11]
148
1983.