Академический Документы
Профессиональный Документы
Культура Документы
PRP VB1 VB2
PRP VB2
VB1
VB TO TO
VB
!
VB
!
TO NN NN TO
PRP
"# VB2
# ' ( ) *
VB1
VB
!
NN TO
3. Inserted
+ # ,( " #
PRP VB2
# ' # -) * + - ' ( ) *
VB1
+ -+ * $ %&
kare ha ongaku wo kiku no ga daisuki desu
TO VB
5. Channel Output
%$ #+* . %
NN TO
& 4. Translated
rtable
KK
| t y } ~ y S K J J J t z JJ KK TJ N S PQ v t u t } y JJ KK O S L S SS yy ~ JJ KK MM MM MM
wx xt { ~ t ~ JJ KK T S JJ JJ x y JJ KK PP J S QO K K
x| tt x y JJ KK JJ JJ NM } ts y z JJ KK JJ NP N S y J KM M M ux w w JJ KK J S MO MQ
} s y J K J J M x t x y J K J P J v w J KJ M R
ttable
parent=VB node=PRP is the conditioning in- are many other combinations of such operations
dex. Using this label pair captures, for example, that yield the same Japanese sentence. Therefore,
the regularity of inserting case-marker particles. the probability of the Japanese sentence given the
When we decide which word to insert, no condi- English parse tree is the sum of all these probabil-
tioning variable is used. That is, a function word ities.
like ga is just as likely to be inserted in one place We actually obtained the probability tables (Ta-
as any other. In Figure 1, we inserted four words ble 1) from a corpus of about two thousand pairs
(ha, no, ga and desu) to create the third tree. The of English parse trees and Japanese sentences,
top VB node, two TO nodes, and the NN node completely automatically. Section 2.3 and Ap-
inserted nothing. Therefore, the probability of pendix 4 describe the training algorithm.
243D::7;243D74=@; 243D7::7;2432@I>; 243D7::7;2432:7;
obtaining the third tree given the second tree is
243D7::7F;B2432:2:2A5:;B2436589:F;B24365I2@<;I243D@82:2;I243DC82:2E 2.2 Formal Description
This section formally describes our translation
3.498e-9.
model. To make this paper comparable to (Brown
Finally, we apply the translate operation to et al., 1993), we use English-French notation in
each leaf. We assume that this operation is depen- this section. We assume that an English parse
dent only on the word itself and that no context
tree is transformed into a French sentence .
is consulted.2 The models t-table specifies the
probability for all cases. Suppose we obtained the B 333= B
Let the English parse tree consist of nodes
333?I
, and let the output French sentence
translations shown in the fourth tree of Figure 1. consist of French words .
The probability of the translate operation here is
243D@::7;B243D@82:2;=24329:C<;=243D9:9:9<;A832:2:2EG24322C Three random variables, , , and are chan-
.
The total probability of the reorder, insert and
nel operations applied to each node. Insertion
translate operations in this example is 243H>ACI>1; is an operation that inserts a French word just be-
3.498e-9 ;24322C1E
1.828e-11. Note that there
fore or after the node. The insertion can be none,
left, or right. Also it decides what French word
2
When a TM is used in machine translation, the TMs
to insert. Reorder is an operation that changes
the order of the children of the node. If a node
role is to provide a list of possible translations, and a lan-
guage model addresses the context. See (Berger et al., 1996). has three children, e.g., there are ways 90FE
to reorder them. This operation applies only to of children was used for . The last line in the
non-terminal nodes in the tree. Translation is above formula introduces a change in notation,
an operation that translates a terminal English leaf
word into a French word. This operation applies
meaning that
those
rameters
,
probabilities
, and
are
the model
, where
pa-
, ,
only to terminal nodes. Note that an English word and are the possible values for , , and
,
E
can be translated into a French NULL word. respectively.
The notation stands for a set
B1E A
In summary, the probability of getting a French
of values of . is a sentence given an English parse tree is
E 333? P =
4 Str : P 4
set of values of random variables associated with
. And is the set of all ran-
: ? 333B =
dom variables associated with a parse tree E
4 B
Str H
.
The probability of getting a French sentence
given an English parse tree is
and
= 4 P 4 B : ? :B
.
where
P
Str : The model parameters , , and
, that
where Str is the sequence of leaf words
is, the
and P , decide the behavior of the translation
probabilities P , P
of a tree transformed by from . model, and these are the probabilities we want to
The probability of having a particular set of estimate from a training corpus.
values of random variables in a parse tree is
2.3 Automatic Parameter Estimation
P
I
P To estimate the model parameters, we use the EM
events are calculated from the current model pa-
P P rameters. The model parameters are re-estimated
P based on the counts, and used for the next itera-
tion. In our case, an event is a pair of a value of a
The random variables ?<E
are as- random variable (such
as , , or ) and a feature
value (such as , , or ). A separate counter is
sumed to be independent of each other. We also used for each event. Therefore,
we need the same 4
assume that they are dependent on particular fea- number of counters, , , and ,
tures of the node . Then,
as
the
number
of entries
in the
probability tables,
P P
, , and .
P P P
The training procedure is the following:
P P P
, ?A , and
B 4
1. Initialize
.
all probability tables:
+= cnt the number of parameters in the model, each node
+= cnt
was re-labelled with the POS of the nodes head
$ += cnt
B : ?
word, and some POS labels were collapsed. For
4. For each , , and , example, labels for different verb endings (such
? ?
%
'&
as VBD for -ed and VBG for -ing) were changed
?A : :
%
(*)+
to the same label VB. There were then 30 differ-
4 B ?
%
-,
ent node labels, and 474 unique child label se-
quences.
5. Repeat steps 2-4 for several iterations.
Second, a subtree was flattened if the nodes
A straightforward implementation that tries all head-word was the same as the parents head-
possible combinations of parameters /
, is word. For example, (NN1 (VB NN2)) was flat-
very expensive, since there are
.
possi-
tened to (NN1 VB NN2) if the VB was a head
ble combinations, where and are the num-
word for both NN1 and NN2. This flattening was
ber of possible values for and , respectively (
motivated by various word orders in different lan-
particular
I
is uniquely decided when and are given for a
). Appendix describes an efficient
guages. An English SVO structure is translated
into SOV in Japanese, or into VSO in Arabic.
implementation that estimates the probability in These differences are easily modeled by the flat-
polynomial time.3 With this efficient implemen- tened subtree (NN1 VB NN2), rather than (NN1
tation, it took about 50 minutes per iteration on (VB NN2)).
our corpus (about two thousand pairs of English We ran 20 iterations of the EM algorithm as
parse trees and Japanese sentences. See the next described in Section 2.2. IBM Model 5 was se-
section). quentially bootstrapped with Model 1, an HMM
Model, and Model 3 (Och and Ney, 2000). Each
3 Experiment preceding model and the final Model 5 were
trained with five iterations (total 20 iterations).
To experiment, we trained our model on a small
English-Japanese corpus. To evaluate perfor- 3.2 Evaluation
mance, we examined alignments produced by the
learned model. For comparison, we also trained The training procedure resulted in the tables of es-
IBM Model 5 on the same corpus. timated model parameters. Table 1 in Section 2.1
shows part of those parameters obtained by the
3.1 Training training above.
We extracted 2121 translation sentence pairs from To evaluate performance, we let the models
a Japanese-English dictionary. These sentences generate the most probable alignment of the train-
were mostly short ones. The average sentence ing corpus (called the Viterbi alignment). The
length was 6.9 for English and 9.7 for Japanese. alignment shows how the learned model induces
However, many rare words were used, which the internal structure of the training data.
made the task difficult. The vocabulary size was Figure 2 shows alignments produced by our
3463 tokens for English, and 3983 tokens for model and IBM Model 5. Darker lines indicates
Japanese, with 2029 tokens for English and 2507 that the particular alignment link was judged cor-
tokens for Japanese occurring only once in the rect by humans. Three humans were asked to rate
corpus. each alignment as okay (1.0 point), not sure (0.5
Brills part-of-speech (POS) tagger (Brill, point), or wrong (0 point). The darkness of the
1995) and Collins parser (Collins, 1999) were lines in the figure reflects the human score. We
used to obtain parse trees for the English side of obtained the average score of the first 50 sentence
the corpus. The output of Collins parser was pairs in the corpus. We also counted the number
3
of perfectly aligned sentence pairs in the 50 pairs.
Note that the algorithm performs full EM counting,
whereas the IBM models only permit counting over a sub- Perfect means that all alignments in a sentence
set of possible alignments. pair were judged okay by all the human judges.
he adores listening to music he adores listening to music
Figure 2: Viterbi Alignments: our model (left) and IBM Model 5 (right). Darker lines are judged more
correct by humans.
The result was the following; language sentence. The model can make use of
syntactic information and performs better for lan-
Alignment Perfect
guage pairs with different word orders and case
ave. score sents
marking schema. We conducted a small-scale ex-
Our Model 0.582 10
periment to compare the performance with IBM
IBM Model 5 0.431 0
Model 5, and got better alignment results.
Our model got a better result compared to IBM
Model 5. Note that there were no perfect align- Appendix: An Efficient EM algorithm
ments from the IBM Model. Errors by the IBM
This appendix describes an efficient implemen-
Model were spread out over the whole set, while
tation of the EM algorithm for our translation
I
our errors were localized to some sentences. We
model. This implementation uses a graph struc-
expect that our model will therefore be easier to
ture for a pair . A graph node is either a
improve. Also, localized errors are good if the
major-node or a subnode. A major-node shows a
TM is used for corpus preparation or filtering.
We also measured training perplexity of the
pairing of a subtree of and a substring of . A
subnode shows a selection of a value for
models. The perplexity of our model was 15.79,
E 3 33
the subtree-substring pair (Figure 3).
and that of IBM Model 5 was 9.84. For reference,
Let 12 0 2 234 87 be a substring of
the perplexity after 5 iterations of Model 1 was
24.01. Perplexity values roughly indicate the pre-
065
from the word 2 with length 9 . Note this notation
is different from (Brown et al., 1993). A subtree
:
dictive power of the model. Generally, lower per-
is a subtree of below the node . We assume
plexity means a better model, but it might cause
over-fitting to a training data. Since the IBM
that a subtree is .
;2 0 is a pair of a subtree
Model usually requires millions of training sen-
A major-node :
and a substring 2 0 . The root of the graph is
tences, the lower perplexity value for the IBM
:
:
1< , where = is the length of . Each major-
Model is likely due to over-fitting.
node connects to several -subnodes :
?>
20 ,
4 Conclusion
showing which value of ?is >
selected. The
arc
between
: 2 0 and : 2 0 has weight
We have presented a syntax-based translation P . ?>
model that statistically models the translation pro-
cess from an English parse tree into a foreign-
A -subnode :
node with weight P
12 0 connects to a final-
if is a terminal node
E U < .
in . If is a non-terminal node, a > -subnode
M Str4 4ON 7P7TQSR P V
counts , , and 4 for each
connects to several -subnodes : ;2 0 ,
The
pair I are also in the figure. Those formulae
the arc is P
showing a selection
.
of a value . The weight of
replace the step 3 (in Section 2.3) for each training
A> -subnode is then connected to @ -subnodes pair, and these counts are used in the step 4.
: @ ;2 0 . The partition variable, @ , shows
:
The graph structure is generated by expanding
a particular way of partitioning 2 0 . the root node : ;< . The beta probability for
>
A @ -subnode : @ 2 0 is then connected each node is first calculated bottom-up, then the
to major-nodes which correspond to the children
of and the substring of 12 0 , decided by /@ .
alpha probability for each node is calculated top-
down. Once the alpha and beta probabilities for
A major-node can be connected from different @ -
each node are obtained, the counts are calculated
subnodes. The arc weights between -subnodes as above and used for updating the parameters.
and major-nodes are always 1.0.
The complexity
. XW
6 6
of this training algorithm is
@ . The cube comes from the number
A B CD EF G D H I
of parse tree nodes ( ) and the number of possible
P (|)
French substrings ( ).
FJKLGDHI
P (|)
Acknowledgments
F J K L G D H I
This work was supported by DARPA-ITO grant
N66001-00-1-9814.
FJKLGDHI
A B CD EF G D H I References
H. Alshawi, S. Bangalore, and S. Douglas. 2000.
Learning dependency translation models as collec-
Figure 3: Graph structure for efficient EM train- tions of finite state head transducers. Computa-
ing. tional Linguistics, 26(1).
M. Collins. 1999. Head-Driven Statistical Models for Y. Wang. 1998. Grammar Inference and Statistical
Natural Language Parsing. Ph.D. thesis, Univer- Machine Translation. Ph.D. thesis, Carnegie Mel-
sity of Pennsylvania. lon University.
A. Dempster, N. Laird, and D. Rubin. 1977. Max- D. Wu. 1997. Stochastic inversion transduction
imum likelihood from incomplete data via the em grammars and bilingual parsing of parallel corpora.
algorithm. Royal Statistical Society Series B, 39. Computational Linguistics, 23(3).
M. Franz, J. McCarley, and R. Ward. 1999. Ad hoc,
cross-language and spoken document information
retrieval at IBM. In TREC-8.
D. Jones and R. Havrilla. 1998. Twisted pair gram-
mar: Support for rapid development of machine
translation for low density languages. In AMTA98.
I. Melamed. 2000. Models of translational equiv-
alence among words. Computational Linguistics,
26(2).
F. Och and H. Ney. 2000. Improved statistical align-
ment models. In ACL-2000.
F. Och, C. Tillmann, and H. Ney. 1999. Improved
alignment models for statistical machine transla-
tion. In EMNLP-99.
P. Resnik and I. Melamed. 1997. Semi-automatic ac-
quisition of domain-specific translation lexicons. In
ANLP-97.