A Mandarin Text-to-Speech System Using Prosodic Hierarchy and A Large Number of Words

A Mandarin Text-to-Speech System Using Prosodic Hierarchy and a
Large Number of Words
msyu@dragon.nchu.edu.tw, s9256047@cs.nchu.edu.tw, s9256040@cs.nchu.edu.tw,

s9256013@cs.nchu.edu.tw
(Test-to-Speech)
(Synthesis units)
-CART(Classification And Regression Trees)
12224 2690
Text-to-Speech, Parser, Prosodic Hierarchy.
1.
1.1.
waveform- concatenation
synthesis units
(phoneme)(di-phone)(syllable)
TTS
(Text analysis)
Part-of-Speech, POS
(Prosody
prediction)Duration
Energy
(Pitch)
(Speech generation) PSOLA

Pitch-Synchronous Overlap and Add[9][8]
(Text-to-Speech) Corpus-based [6][9]
()TTS
[6] Corpus-based (sentence)

Decision Tree[6][9]
()
fading-infading-out
rule-based [10] bigram
CART
(Text)
(Text analysis and

prosody analysis)
(word-based unit selection)
(prosody modification)
(Speech )

&&
1.2.
[1]

TTS(Text-to-Speech)
(treebank) probabilistic context-free grammar(
PCFG)
PCFG [5]
[15]
PCFG PCFG
Bottom-up Cocke-Younger-KasamiCYK[1][7]
CYK
1.3.
TTS
(Sentence S)(Prosodic phrase PP)(Prosodic
word PW)(word)
4
(word)
Prosodic Word PW
Prosodic Phrase PP
Sentence
-
minor break
major break)
1.4.
[6] (word-based)
word-based
corpus-based
word-based corpus-based
12224 2690
409 409*5
1300
2.
CART
2.1.
2.1
1,923 6,250
51,525
EX:
2.2. CART(Classification And Regression Trees)

CART (binary)
Gini [2]
Yes
No
?
Yes
Yes
No
?
Yes
No
No
CART
A B
B
(no break)(minor break)
(major break) CART CART
CART
2.3.
Input sentence
POS Tagger
POS sequence
Parser
Syntax tree
Text analysis
Prosodic hierarchy
Output prosodic hierarchy

-
[13]
(prosodic hierarchy)
(Text analysis)
S
NP
Nac
NP
NP
DE
Ndabe
VP
VH
NP
NP
VP
DE
Caa
VC31
NP
Nab
VP
Dc
VC1
VP
Baa
Vj3
- 1
VC1
NP
NP
Nad
Nab
Neqa
2.3.1.
(prosodic
word)(prosodic phrase)(
)()
()
()
()
1-3
4-8
9-12
13-18
19-22
23-30
31-39
40-41
>42
PW
PP
12
13
14
15
16
16
2.3.2.
(bottom-up)
S
NP
VH
NP
NP
VP
VP
VP
Caa
VC1
VP
Nab
DE
VP
Daa
Vj3
NP
NP() VH()
VC1() VP()
Daa()Vj3()NP()
Daa() Vj3()
NP
VP
VP
VP
-
()()
9
15
16
[16]
- (Bottom-up)
3.
rule-based
(Phone Sequence)
Yes
Rule1
No
()
Yes
Rule 2
(Was the syllable in a word which was recorded?)
No
()
Rule 3
(waveform concatenation)
(Was the word recorded?)
(Select a syllable from monosyllable database)

-
rule1 rule3 rule2
1Position in
word2Tone3/C/V
10
( )
[11]
4.
TTS
[13]
overlapping
(ms)
Minor break
50
Major break
250
400
625
625
625
500
250
300
11
Prosodic Word
Overlapping
[8]
0.05
0.1
0.15
0.2
PSOLA cut off
fading-out
fading-in
fading-out n
x1 , x 2 , x3 ,..., x n ( 1)( 2)
fading-in fading-out1
fading-out2fading-in3
fading-out
xi ( fading _ out ) xi *
x i ( fading _ in ) xi *
n i 1
n 1
i
n 1
i 1 ~ n
i 1 ~ n
( 1)
( 2)
5.
5.1.
Version 2.1
54,902 %
51939
inside test
[15]
12

(tree structure)(Label)(Bracket)
PARSEVAL[4]
[15]
SP(Structure Precision)
SP=
#correct parsing tree of testing data

#treebank parsing tree of testing data
LPLabeled Precision
#label correct constituents in parser's parse of testing data

LP
#label constituents in parser's parse of testing data
LRLabeled Recall
#label correct constituents in parser's parse of testing data

LR
#label constituents in treebank's parse of testing data
LFLabeled F-measure
LP * LR * 2
LF
LP LR
BPBracketed Precision
#bracket correct constituents in parser's parse of testing data

BP
#bracket constituents in parser's parse of testing data
BRBracketed Recall
#bracket correct constituents in parser's parse of testing data

BR
#bracket constituents in treebank's parse of testing data
BFBracketed F-measure
BP * BR * 2
BF
BP BR
(%)
SP
LP
LR
LF
BP
BR
BF
38.78
61.96
64.31
63.11
70.04
72.80
71.39
70.04%
72.80%(parse tree)
13
5.2.
confusion matrix
Predicted labels
True
labels
B0
B1
B2
B0
C 00
C01
C02
B1
C10
C11
C12
B2
C20
C21
C22
B i ( i 0 ,1 , 2 ) () B 0
(no break) B 1 (minor break) B 2
(major break) C ii ( i =1 ,2 ,3 )
C ij (i ,j = 1 ,2 ,3 ; ij) B i B j
(Recall)
Rec i Cii
C
j 0
ij
(i = 0, 1, 2)
B0 (no break)
C
Rec0 00
(C00 C01 C02 )
(Precision)
Pre i Cii
C
j 0
ji
(i = 0, 1, 2)
B0 (no break)
C
Pre0 00
(C00 C10 C20
(Accuracy)
2
Acc=
C ii
i 0
i 0
j 0
ij
Acc
(C00 C11 C22 ) (C00 C01 C02 C10 C11 C12 C20 C21 C22 )
14
CART
CART
Acc1 Acc2 B1 B2
- CART
True labels
Predicted labels
B0
B1
B2
B0
30,434
3,198
126
B1
5,758
6,810
372
B2
635
1,381
514
Acc10.767
Pre00.826
Pre10.598
Pre20.508
Acc20.791
Rec00.902
Rec10.526
Rec20.203
- Bottom-Up
True labels
Predicted labels
B0
B1
B2
B0
156090
7384
4502
B1
54390
5471
5062
B2
7854
1828
2911
Acc10.669
Pre00.715
Pre10.372
Pre20.233
Acc20.698
Rec00.929
Rec10.084
Rec20.231
5.3.
(naturalness)preference testing
(intelligibility) MOSMean Opinion Score
5 excellent
4 good
3 fair
2 poor
1
unsatisfactory
8 6 2 MOS
paragraphs 20 15 25
Prosodic Word
Prosodic Word 97.2%
15
( MOS)
( MOS)
M01
4.05
0.497
0.474
M02
3.15
0.963
3.15
0.792
M03
3.3
0.714
3.3
0.640
M04
4.385
0.504
4.67
0.181
M05
3.55
0.668
3.65
0.852
M06
3.9
0.538
0.632
M07
4.2
0.748
0.707
M08
2.95
0.804
2.95
0.804
3.68
3.715
M01
40
60
M02
50
50
M03
55
45
M04
50
50
M05
70
30
M06
70
30
M07
50
50
M08
55
45
55
45
MOS
CART
CART MOS [3]
5
4.5
4
3
2
1
16
MOS
MOS CART
CART
CART
MOS
CART
3.45
3.65
3.55
3.35
3.5
3.9
4.2
4.1
3.71
3.48
3.9
3.45
3.95
3.45
4.18
3.88
4.25
3.82
CART
4.66
4.66
4.66
4.66
3.66
4.42
4.33
4.83
4.52
6.
97.2%
http://140.120.15.239/onlineTTS/cgitest.html
() or ()
[1]Aho, A. V. and Ullman, J. D., "The Theory of Parsing, Translation, and Compiling ",1972, Vol. 1,
Prentice-Hall, Englewood Cliffs, NJ.
[2]Breiman L, Friedman J. H., Olshen R. A., et al, "Classification and Regression Trees", Wadsworth,
Inc, 1984.
[3]Bao H., Wang A., Lu S., "A Study of Evaluation Method for Synthetic Mandarin Speech",
Proceedings of ISCSLP 2002, PP:383-386, Taipei, Taiwan.
[4]Charniak, E., "Treebank Grammars",
In Proceedings of the Thirteenth National Conference on
Artificial Intelligence, pp. 1031-1036. AAAI Press/MIT Press, 1996.

[5]Collins, M. J., "Head-Driven Statistical Models for Natural Language Parsing. ", Ph.D. Thesis,
University of Pennsylvania, Philadelphia, 1999.
17
[6] Chu M., Peng H., Yang H. Y. and Chang E., " Selecting Non-Uniform Units from A Very Large
Corpus for Concatenative Speech Synthesizer ", Proceedings of ICASSP 2001, IEEE, Volume 2,
pp.785 - 788, Salt Lake City.
[7]Ney, H. "Dynamic Programming Parsing for Context-Free Recognition", IEEE Transactions on
Signal Processing 1991, 39(2), 336-340.
[8] Hwang S. H. and Yei C. Y., "The Synthesis Unit Generation Algorithm for Mandarin TTS",
Proceedings of ICASSP 2002, IEEE, Volume 1, pp. 457 - 460, Orlando, Florida.
[9], "",
, 1998
[10], "", , 2001
[11], "", ,
2005
[12], "", , 2005
[13], "", , 2004
[14], "", , 2005

[15], , , " ", Proceedings of
ROCLING XVI, 2004, pp.141-150
[16], 1998
Caa
Cab
Cba
Cbb
Da
18
DE
DE
, , ,
Dfa
Dfb
Di
Dk
FW
Na
Nb
Nc
Ncd
Ncd
Nd
Nep
Ne
Neqa
Ne
Neqb
Ne
Nes
Ne
Neu
Ne
Nf
Ng
Ng
Nh
SHI
VA
VAC
VB
VC
VCL
VD
VE
VF
VG
VH
VHC
VI
VJ
VK
19
VL
V_2
20

A Mandarin Text-to-Speech System Using Prosodic Hierarchy and A Large Number of Words

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Mandarin Text-to-Speech System Using Prosodic Hierarchy and A Large Number of Words

Загружено:

Авторское право:

Доступные форматы

A Mandarin Text-to-Speech System Using Prosodic Hierarchy and a

Large Number of Words

msyu@dragon.nchu.edu.tw, s9256047@cs.nchu.edu.tw, s9256040@cs.nchu.edu.tw,

-CART(Classification And Regression Trees)

Text-to-Speech, Parser, Prosodic Hierarchy.

(Speech generation) PSOLA

(Text-to-Speech) Corpus-based [6][9]

[6] Corpus-based (sentence)

(Text analysis and

(word-based unit selection)

2.2. CART(Classification And Regression Trees)

Output prosodic hierarchy

(Was the word recorded?)

(Select a syllable from monosyllable database)

PSOLA cut off

#correct parsing tree of testing data

#label correct constituents in parser's parse of testing data

#label correct constituents in parser's parse of testing data

#bracket correct constituents in parser's parse of testing data

#bracket correct constituents in parser's parse of testing data

(C00 C01 C02 )

(C00 C10 C20

Prosodic Word 97.2%

In Proceedings of the Thirteenth National Conference on

Artificial Intelligence, pp. 1031-1036. AAAI Press/MIT Press, 1996.

[14], "", , 2005

Вам также может понравиться