Вы находитесь на странице: 1из 20

A Mandarin Text-to-Speech System Using Prosodic Hierarchy and a

Large Number of Words

msyu@dragon.nchu.edu.tw, s9256047@cs.nchu.edu.tw, s9256040@cs.nchu.edu.tw,


s9256013@cs.nchu.edu.tw

(Test-to-Speech)
(Synthesis units)

-CART(Classification And Regression Trees)

12224 2690

Text-to-Speech, Parser, Prosodic Hierarchy.

1.
1.1.
waveform- concatenation

synthesis units

(phoneme)(di-phone)(syllable)
TTS
(Text analysis)
Part-of-Speech, POS
(Prosody
prediction)Duration
Energy
(Pitch)

(Speech generation) PSOLA


Pitch-Synchronous Overlap and Add[9][8]

(Text-to-Speech) Corpus-based [6][9]

()TTS

[6] Corpus-based (sentence)


Decision Tree[6][9]

()

fading-infading-out
rule-based [10] bigram
CART

(Text)

(Text analysis and


prosody analysis)

(word-based unit selection)

(prosody modification)

(Speech )

&&

1.2.
[1]

TTS(Text-to-Speech)
(treebank) probabilistic context-free grammar(
PCFG)
PCFG [5]
[15]
PCFG PCFG
Bottom-up Cocke-Younger-KasamiCYK[1][7]

CYK

1.3.
TTS
(Sentence S)(Prosodic phrase PP)(Prosodic
word PW)(word)
4

(word)

Prosodic Word PW

Prosodic Phrase PP

Sentence

-
minor break
major break)

1.4.

[6] (word-based)
word-based
corpus-based

word-based corpus-based

12224 2690

409 409*5
1300

2.

CART
2.1.
2.1

1,923 6,250

51,525

EX:

2.2. CART(Classification And Regression Trees)


CART (binary)
Gini [2]

Yes

No

?
Yes

Yes

No

?
Yes

No

No

CART

A B

B
(no break)(minor break)
(major break) CART CART
CART

2.3.
Input sentence

POS Tagger
POS sequence
Parser
Syntax tree
Text analysis

Prosodic hierarchy

Output prosodic hierarchy


-

[13]
(prosodic hierarchy)
(Text analysis)

S
NP
Nac

NP
NP

DE

Ndabe

VP

VH
NP

NP

VP

DE

Caa

VC31

NP

Nab

VP

Dc

VC1

VP
Baa

Vj3

- 1

VC1

NP
NP

Nad

Nab

Neqa

2.3.1.
(prosodic
word)(prosodic phrase)(
)()
()
()

()

1-3

4-8

9-12

13-18

19-22

23-30

31-39

40-41

>42

PW

PP

12

13

14

15

16

16

2.3.2.

(bottom-up)

S
NP

VH

NP
NP

VP

VP
VP

Caa

VC1

VP

Nab

DE

VP
Daa

Vj3

NP

NP() VH()
VC1() VP()
Daa()Vj3()NP()
Daa() Vj3()

NP
VP

VP

VP

-
()()
9

15
16
[16]

- (Bottom-up)
3.
rule-based

(Phone Sequence)

Yes
Rule1

No
()
Yes
Rule 2
(Was the syllable in a word which was recorded?)
No
()

Rule 3

(waveform concatenation)

(Was the word recorded?)

(Select a syllable from monosyllable database)


-
rule1 rule3 rule2
1Position in
word2Tone3/C/V

10

( )

[11]

4.
TTS
[13]

overlapping

(ms)

Minor break

50

Major break

250

400

625

625

625

500

250

300

11

Prosodic Word

Overlapping
[8]

0.05

0.1

0.15

0.2

PSOLA cut off

fading-out

fading-in
fading-out n
x1 , x 2 , x3 ,..., x n ( 1)( 2)
fading-in fading-out1
fading-out2fading-in3
fading-out

xi ( fading _ out ) xi *
x i ( fading _ in ) xi *

n i 1
n 1

i
n 1

i 1 ~ n

i 1 ~ n

( 1)

( 2)

5.
5.1.
Version 2.1
54,902 %

51939
inside test
[15]

12


(tree structure)(Label)(Bracket)
PARSEVAL[4]
[15]

SP(Structure Precision)

SP=

#correct parsing tree of testing data


#treebank parsing tree of testing data

LPLabeled Precision

#label correct constituents in parser's parse of testing data


LP
#label constituents in parser's parse of testing data

LRLabeled Recall

#label correct constituents in parser's parse of testing data


LR
#label constituents in treebank's parse of testing data

LFLabeled F-measure

LP * LR * 2
LF
LP LR

BPBracketed Precision

#bracket correct constituents in parser's parse of testing data


BP
#bracket constituents in parser's parse of testing data

BRBracketed Recall

#bracket correct constituents in parser's parse of testing data


BR
#bracket constituents in treebank's parse of testing data

BFBracketed F-measure

BP * BR * 2
BF
BP BR

(%)
SP

LP

LR

LF

BP

BR

BF

38.78

61.96

64.31

63.11

70.04

72.80

71.39

70.04%
72.80%(parse tree)

13

5.2.

confusion matrix
Predicted labels

True
labels

B0

B1

B2

B0

C 00

C01

C02

B1

C10

C11

C12

B2

C20

C21

C22

B i ( i 0 ,1 , 2 ) () B 0
(no break) B 1 (minor break) B 2
(major break) C ii ( i =1 ,2 ,3 )
C ij (i ,j = 1 ,2 ,3 ; ij) B i B j
(Recall)

Rec i Cii

C
j 0

ij

(i = 0, 1, 2)

B0 (no break)

C
Rec0 00

(C00 C01 C02 )

(Precision)

Pre i Cii

C
j 0

ji

(i = 0, 1, 2)

B0 (no break)

C
Pre0 00

(C00 C10 C20

(Accuracy)
2

Acc=

C ii
i 0

i 0

j 0

ij

Acc

(C00 C11 C22 ) (C00 C01 C02 C10 C11 C12 C20 C21 C22 )

14

CART
CART

Acc1 Acc2 B1 B2
- CART
True labels

Predicted labels
B0

B1

B2

B0

30,434

3,198

126

B1

5,758

6,810

372

B2

635

1,381

514

Acc10.767

Pre00.826

Pre10.598

Pre20.508

Acc20.791

Rec00.902

Rec10.526

Rec20.203

- Bottom-Up
True labels

Predicted labels
B0

B1

B2

B0

156090

7384

4502

B1

54390

5471

5062

B2

7854

1828

2911

Acc10.669

Pre00.715

Pre10.372

Pre20.233

Acc20.698

Rec00.929

Rec10.084

Rec20.231

5.3.
(naturalness)preference testing
(intelligibility) MOSMean Opinion Score

5 excellent
4 good
3 fair
2 poor
1
unsatisfactory

8 6 2 MOS
paragraphs 20 15 25

Prosodic Word

Prosodic Word 97.2%

15

( MOS)

( MOS)

M01

4.05

0.497

0.474

M02

3.15

0.963

3.15

0.792

M03

3.3

0.714

3.3

0.640

M04

4.385

0.504

4.67

0.181

M05

3.55

0.668

3.65

0.852

M06

3.9

0.538

0.632

M07

4.2

0.748

0.707

M08

2.95

0.804

2.95

0.804

3.68

3.715

M01

40

60

M02

50

50

M03

55

45

M04

50

50

M05

70

30

M06

70

30

M07

50

50

M08

55

45

55

45

MOS
CART
CART MOS [3]

5
4.5
4

3
2
1

16

MOS
MOS CART
CART
CART
MOS

CART

3.45

3.65

3.55

3.35

3.5

3.9

4.2

4.1

3.71

3.48

3.9

3.45

3.95

3.45

4.18

3.88

4.25

3.82

CART

4.66

4.66

4.66

4.66

3.66

4.42

4.33

4.83

4.52

6.
97.2%
http://140.120.15.239/onlineTTS/cgitest.html

() or ()

[1]Aho, A. V. and Ullman, J. D., "The Theory of Parsing, Translation, and Compiling ",1972, Vol. 1,
Prentice-Hall, Englewood Cliffs, NJ.
[2]Breiman L, Friedman J. H., Olshen R. A., et al, "Classification and Regression Trees", Wadsworth,
Inc, 1984.
[3]Bao H., Wang A., Lu S., "A Study of Evaluation Method for Synthetic Mandarin Speech",
Proceedings of ISCSLP 2002, PP:383-386, Taipei, Taiwan.
[4]Charniak, E., "Treebank Grammars",

In Proceedings of the Thirteenth National Conference on

Artificial Intelligence, pp. 1031-1036. AAAI Press/MIT Press, 1996.


[5]Collins, M. J., "Head-Driven Statistical Models for Natural Language Parsing. ", Ph.D. Thesis,
University of Pennsylvania, Philadelphia, 1999.

17

[6] Chu M., Peng H., Yang H. Y. and Chang E., " Selecting Non-Uniform Units from A Very Large
Corpus for Concatenative Speech Synthesizer ", Proceedings of ICASSP 2001, IEEE, Volume 2,
pp.785 - 788, Salt Lake City.
[7]Ney, H. "Dynamic Programming Parsing for Context-Free Recognition", IEEE Transactions on
Signal Processing 1991, 39(2), 336-340.
[8] Hwang S. H. and Yei C. Y., "The Synthesis Unit Generation Algorithm for Mandarin TTS",
Proceedings of ICASSP 2002, IEEE, Volume 1, pp. 457 - 460, Orlando, Florida.
[9], "",
, 1998
[10], "", , 2001
[11], "", ,
2005
[12], "", , 2005
[13], "", , 2004

[14], "", , 2005


[15], , , " ", Proceedings of
ROCLING XVI, 2004, pp.141-150
[16], 1998

Caa

Cab

Cba

Cbb

Da

18

DE

DE

, , ,

Dfa

Dfb

Di

Dk

FW

Na

Nb

Nc

Ncd

Ncd

Nd

Nep

Ne

Neqa

Ne

Neqb

Ne

Nes

Ne

Neu

Ne

Nf

Ng

Ng

Nh

SHI

VA

VAC

VB

VC

VCL

VD

VE

VF

VG

VH

VHC

VI

VJ

VK

19

VL

V_2

20

Вам также может понравиться