Вы находитесь на странице: 1из 8

A Syntax-based Statistical Translation Model

Kenji Yamada and Kevin Knight


Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292

kyamada,knight @isi.edu

Abstract is conditioned only on word classes and positions


in the string, and the duplication and translation
We present a syntax-based statistical are conditioned only on the word identity. Math-
translation model. Our model trans- ematical details are fully described in (Brown et
forms a source-language parse tree al., 1993).
into a target-language string by apply- One criticism of the IBM-style TM is that it
ing stochastic operations at each node. does not model structural or syntactic aspects of
These operations capture linguistic dif- the language. The TM was only demonstrated for
ferences such as word order and case a structurally similar language pair (English and
marking. Model parameters are esti- French). It has been suspected that a language
mated in polynomial time using an EM pair with very different word order such as En-
algorithm. The model produces word glish and Japanese would not be modeled well by
alignments that are better than those these TMs.
produced by IBM Model 5.
To incorporate structural aspects of the lan-
guage, our channel model accepts a parse tree as
1 Introduction an input, i.e., the input sentence is preprocessed
A statistical translation model (TM) is a mathe- by a syntactic parser. The channel performs oper-
matical model in which the process of human- ations on each node of the parse tree. The oper-
language translation is statistically modeled. ations are reordering child nodes, inserting extra
Model parameters are automatically estimated us- words at each node, and translating leaf words.
ing a corpus of translation pairs. TMs have been Figure 1 shows the overview of the operations of
used for statistical machine translation (Berger et our model. Note that the output of our model is a
al., 1996), word alignment of a translation cor- string, not a parse tree. Therefore, parsing is only
pus (Melamed, 2000), multilingual document re- needed on the channel input side.
trieval (Franz et al., 1999), automatic dictionary The reorder operation is intended to model
construction (Resnik and Melamed, 1997), and translation between languages with different word
data preparation for word sense disambiguation orders, such as SVO-languages (English or Chi-
programs (Brown et al., 1991). Developing a bet- nese) and SOV-languages (Japanese or Turkish).
ter TM is a fundamental issue for those applica- The word-insertion operation is intended to cap-
tions. ture linguistic differences in specifying syntactic
Researchers at IBM first described such a sta- cases. E.g., English and French use structural po-
tistical TM in (Brown et al., 1988). Their mod- sition to specify case, while Japanese and Korean
els are based on a string-to-string noisy channel use case-marker particles.
model. The channel converts a sequence of words Wang (1998) enhanced the IBM models by in-
in one language (such as English) into another troducing phrases, and Och et al. (1999) used
(such as French). The channel operations are templates to capture phrasal sequences in a sen-
movements, duplications, and translations, ap- tence. Both also tried to incorporate structural as-
plied to each word independently. The movement pects of the language, however, neither handles
VB
    VB

    
      
PRP VB1 VB2

PRP VB2
    
VB1

     
VB TO TO
     
VB

    !
VB

  !  
TO NN NN TO
PRP
 "# VB2
#      ' ( ) *
VB1

1. Channel Input 2. Reordered TO


      $ % &
VB

         VB
  !  
NN TO

   

3. Inserted

+ # ,( " #
PRP VB2
# ' # -) * + - ' ( ) *
VB1

+ -+ * $ %&
kare ha ongaku wo kiku no ga daisuki desu
TO VB
5. Channel Output

%$ #+* . %
NN TO

& 4. Translated

Figure 1: Channel Operations: Reorder, Insert, and Translate

nested structures. Figure 1 shows how the channel works. First,


Wu (1997) and Alshawi et al. (2000) showed child nodes on each internal node are stochas-
statistical models based on syntactic structure. tically reordered. A node with children has /
The way we handle syntactic parse trees is in- /10 possible reorderings. The probability of tak-
spired by their work, although their approach ing a specific reordering is given by the models
is not to model the translation process, but to r-table. Sample model parameters are shown in
formalize a model that generates two languages Table 1. We assume that only the sequence of
at the same time. Our channel operations are child node labels influences the reordering. In
also similar to the mechanism in Twisted Pair Figure 1, the top VB node has a child sequence
Grammar (Jones and Havrilla, 1998) used in their PRP-VB1-VB2. The probability of reordering it
knowledge-based system. into PRP-VB2-VB1 is 0.723 (the second row in
Following (Brown et al., 1993) and the other the r-table in Table 1). We also reorder VB-TO
literature in TM, this paper only focuses the de- into TO-VB, and TO-NN into NN-TO, so there-
tails of TM. Applications of our TM, such as ma- fore the probability of the second tree in Figure 1
chine translation or dictionary construction, will is 2436587:9<;=24365?>A@<;B243DC:@:9FEG243H>ACI>
.
be described in a separate paper. Section 2 de-
scribes our model in detail. Section 3 shows ex- Next, an extra word is stochastically inserted
perimental results. We conclude with Section 4, at each node. A word can be inserted either to
followed by an Appendix describing the training the left of the node, to the right of the node, or
algorithm in more detail. nowhere. Brown et al. (1993) assumes that there
is an invisible NULL word in the input sentence
2 The Model and it generates output words that are distributed
into random positions. Here, we instead decide
2.1 An Example the position on the basis of the nodes of the in-
We first introduce our translation model with an put parse tree. The insertion probability is deter-
example. Section 2.2 will describe the model mined by the n-table. For simplicity, we split the
more formally. We assume that an English parse n-table into two: a table for insert positions and
tree is fed into a noisy channel and that it is trans- a table for words to be inserted (Table 1). The
lated to a Japanese sentence.1 nodes label and its parents label are used to in-
1
The parse tree is flattened to work well with the model. dex the table for insert positions. For example,
See Section 3.1 for details. the PRP node in Figure 1 has parent VB, thus
p l n k p l q m l YY ZZ l YYm p l UUq m VV l nP S UU VV P S J l K mJ p L l q O m l su t t JJ \ ]KK PdS i M S TS b
JJ KK LJ PQ M S j n k p l qm mn o W U X V Y UU VV
YZY U VSU VP UU VV SS U Y ZV P Y YU Z V PY
UU VV PP U Y ZV SY YU Z V Y S JJ KK JJ MR LM YU Z V Y UW X V WW XX W[ X [ rr rr vx w w JJ KK JJ TT TO
JJ KK JP PN SS \\ ]] g^ _ h ` i a b _ b JJ KK LJ MJ NO JJ KK QJ RQ L S JJ KK MJ OJ OO JJ KK LJ JM TJ JJ KK TJ JJ JM JJ KK RJ JT JQ
rr rr xu z y JJ KK JJ RL JR
UV WX UW X V WU [ X V[ JJ KK L S OJ TL \ ] c de f a b J KP Q J J KP N P J KQ N P J KP Q S J KJ J L J KS J O rr { t  J K J Q P
WX [[ W[ X [ W X J KR  T M | z } ~ J KJ J J L
 
  ntable

rtable

KK
| t y } ~ y S K J J J t z JJ KK TJ N S PQ v t u t } y JJ KK O S L S SS yy ~ JJ KK MM MM MM
wx xt { ~ t ~ JJ KK T S JJ JJ x y JJ KK PP J S QO K K
x| tt x y JJ KK JJ JJ NM } ts y z JJ KK JJ NP N S y J KM M M ux w w JJ KK J S MO MQ
} s y J K J J M x t x y J K J P J v w J KJ M R
ttable

Table 1: Model Parameter Tables

parent=VB node=PRP is the conditioning in- are many other combinations of such operations
dex. Using this label pair captures, for example, that yield the same Japanese sentence. Therefore,
the regularity of inserting case-marker particles. the probability of the Japanese sentence given the
When we decide which word to insert, no condi- English parse tree is the sum of all these probabil-
tioning variable is used. That is, a function word ities.
like ga is just as likely to be inserted in one place We actually obtained the probability tables (Ta-
as any other. In Figure 1, we inserted four words ble 1) from a corpus of about two thousand pairs
(ha, no, ga and desu) to create the third tree. The of English parse trees and Japanese sentences,
top VB node, two TO nodes, and the NN node completely automatically. Section 2.3 and Ap-
inserted nothing. Therefore, the probability of pendix 4 describe the training algorithm.
243D::7;243D74=@; 243D7::7;2432@I>; 243D7::7;2432:7;
obtaining the third tree given the second tree is
243D7::7F;B2432:2:2A5:;B2436589:F;B24365I2@<;I243D@82:2;I243DC82:2E 2.2 Formal Description
This section formally describes our translation
3.498e-9.
model. To make this paper comparable to (Brown
Finally, we apply the translate operation to et al., 1993), we use English-French notation in
each leaf. We assume that this operation is depen- this section. We assume that an English parse
dent only on the word itself and that no context
tree is transformed into a French sentence .
is consulted.2 The models t-table specifies the
probability for all cases. Suppose we obtained the B 333= B
Let the English parse tree consist of nodes
333?I
, and let the output French sentence
translations shown in the fourth tree of Figure 1. consist of French words .
The probability of the translate operation here is
243D@::7;B243D@82:2;=24329:C<;=243D9:9:9<;A832:2:2EG24322C Three random variables, , , and are chan-
.
The total probability of the reorder, insert and
nel operations applied to each node. Insertion
translate operations in this example is 243H>ACI>1; is an operation that inserts a French word just be-

3.498e-9 ;24322C1E
1.828e-11. Note that there
fore or after the node. The insertion can be none,
left, or right. Also it decides what French word
2
When a TM is used in machine translation, the TMs
to insert. Reorder is an operation that changes
the order of the children of the node. If a node
role is to provide a list of possible translations, and a lan-
guage model addresses the context. See (Berger et al., 1996). has three children, e.g., there are ways 90FE
to reorder them. This operation applies only to of children was used for . The last line in the
non-terminal nodes in the tree. Translation is above formula introduces a change in notation,
an operation that translates a terminal English leaf
word into a French word. This operation applies
meaning that those
rameters

, 

probabilities

, and 
are

the model
, where
pa-
 
, ,


only to terminal nodes. Note that an English word and are the possible values for  , , and
,

E
can be translated into a French NULL word. respectively.
The notation stands for a set
B1E A
In summary, the probability of getting a French


of values of . is a sentence given an English parse tree is

E 333? P =
4 Str : P 4
set of values of random variables associated with
. And is the set of all ran-

: ? 333B =
dom variables associated with a parse tree E
4  B  


Str H
.
The probability of getting a French sentence
given an English parse tree is

 and

= 4 P 4 B : ? :B   . 
where
 
P
Str : The model parameters ,  , and
 , that
 

where Str is the sequence of leaf words
is, the
and P , decide the behavior of the translation

probabilities P , P

of a tree transformed by from . model, and these are the probabilities we want to
The probability of having a particular set of estimate from a training corpus.
values of random variables in a parse tree is

2.3 Automatic Parameter Estimation
P
I
P To estimate the model parameters, we use the EM

P =? algorithm (Dempster et al., 1977). The algorithm


iteratively updates the model parameters to max-
imize the likelihood of the training corpus. First,
This is an exact equation. Then, we assume that the model parameters are initialized. We used a
a transform operation is independent from other uniform distribution, but it can be a distribution
transform operations, and the random variables of taken from other models. For each iteration, the
each node are determined only by the node itself. number of events are counted and weighted by the
So, we obtain probabilities of the events. The probabilities of



events are calculated from the current model pa-


P P rameters. The model parameters are re-estimated
P based on the counts, and used for the next itera-
tion. In our case, an event is a pair of a value of a
The random variables ?<E
are as- random variable (such

as , , or ) and a feature
value (such as , , or ). A separate counter is
sumed to be independent of each other. We also used for each event. Therefore,

we need the same 4

assume that they are dependent on particular fea- number of counters,  , , and  ,
tures of the node . Then,

as
the

number

of entries 
in the
probability tables,
P P
, , and  .

P P P
The training procedure is the following:

P P P 


, ?A , and


  B    4

1. Initialize
 .
all probability tables: 

2. Reset all counters: ? , : , and ? .


  
where  , , and
are the relevant features to
, 3. For each pair  in the training corpus,

, and , respectively. For example, we saw

For all , such that Str ,
 Let cnt = P    4
that the parent node label and the node label were
used for  , and the syntactic category sequence Str : P
 For ! #" $ ,

modified in the following way. First, to reduce




+= cnt the number of parameters in the model, each node



+= cnt
was re-labelled with the POS of the nodes head
 $ += cnt

B : ?
      word, and some POS labels were collapsed. For
4. For each ,  , and , example, labels for different verb endings (such
 ? ?

%

 '& 

as VBD for -ed and VBG for -ing) were changed
?A : :


%

(*)+

to the same label VB. There were then 30 differ-
 4 B ?

%

 -, 

ent node labels, and 474 unique child label se-
quences.
5. Repeat steps 2-4 for several iterations.
Second, a subtree was flattened if the nodes
A straightforward implementation that tries all head-word was the same as the parents head-
possible combinations of parameters /
, is word. For example, (NN1 (VB NN2)) was flat-
very expensive, since there are

.

possi-
tened to (NN1 VB NN2) if the VB was a head
ble combinations, where and are the num-
word for both NN1 and NN2. This flattening was
ber of possible values for and , respectively (
motivated by various word orders in different lan-

particular
I
is uniquely decided when and are given for a
). Appendix describes an efficient
guages. An English SVO structure is translated
into SOV in Japanese, or into VSO in Arabic.
implementation that estimates the probability in These differences are easily modeled by the flat-
polynomial time.3 With this efficient implemen- tened subtree (NN1 VB NN2), rather than (NN1
tation, it took about 50 minutes per iteration on (VB NN2)).
our corpus (about two thousand pairs of English We ran 20 iterations of the EM algorithm as
parse trees and Japanese sentences. See the next described in Section 2.2. IBM Model 5 was se-
section). quentially bootstrapped with Model 1, an HMM
Model, and Model 3 (Och and Ney, 2000). Each
3 Experiment preceding model and the final Model 5 were
trained with five iterations (total 20 iterations).
To experiment, we trained our model on a small
English-Japanese corpus. To evaluate perfor- 3.2 Evaluation
mance, we examined alignments produced by the
learned model. For comparison, we also trained The training procedure resulted in the tables of es-
IBM Model 5 on the same corpus. timated model parameters. Table 1 in Section 2.1
shows part of those parameters obtained by the
3.1 Training training above.
We extracted 2121 translation sentence pairs from To evaluate performance, we let the models
a Japanese-English dictionary. These sentences generate the most probable alignment of the train-
were mostly short ones. The average sentence ing corpus (called the Viterbi alignment). The
length was 6.9 for English and 9.7 for Japanese. alignment shows how the learned model induces
However, many rare words were used, which the internal structure of the training data.
made the task difficult. The vocabulary size was Figure 2 shows alignments produced by our
3463 tokens for English, and 3983 tokens for model and IBM Model 5. Darker lines indicates
Japanese, with 2029 tokens for English and 2507 that the particular alignment link was judged cor-
tokens for Japanese occurring only once in the rect by humans. Three humans were asked to rate
corpus. each alignment as okay (1.0 point), not sure (0.5
Brills part-of-speech (POS) tagger (Brill, point), or wrong (0 point). The darkness of the
1995) and Collins parser (Collins, 1999) were lines in the figure reflects the human score. We
used to obtain parse trees for the English side of obtained the average score of the first 50 sentence
the corpus. The output of Collins parser was pairs in the corpus. We also counted the number
3
of perfectly aligned sentence pairs in the 50 pairs.
Note that the algorithm performs full EM counting,
whereas the IBM models only permit counting over a sub- Perfect means that all alignments in a sentence
set of possible alignments. pair were judged okay by all the human judges.
he adores listening to music he adores listening to music

hypocrisy is abhorrent to them hypocrisy is abhorrent to them

he has unusual ability in english he has unusual ability in english

he was ablaze with anger he was ablaze with anger

Figure 2: Viterbi Alignments: our model (left) and IBM Model 5 (right). Darker lines are judged more
correct by humans.

The result was the following; language sentence. The model can make use of
syntactic information and performs better for lan-
Alignment Perfect
guage pairs with different word orders and case
ave. score sents
marking schema. We conducted a small-scale ex-
Our Model 0.582 10
periment to compare the performance with IBM
IBM Model 5 0.431 0
Model 5, and got better alignment results.
Our model got a better result compared to IBM
Model 5. Note that there were no perfect align- Appendix: An Efficient EM algorithm
ments from the IBM Model. Errors by the IBM
This appendix describes an efficient implemen-
Model were spread out over the whole set, while
tation of the EM algorithm for our translation
I
our errors were localized to some sentences. We
model. This implementation uses a graph struc-
expect that our model will therefore be easier to
ture for a pair . A graph node is either a
improve. Also, localized errors are good if the
major-node or a subnode. A major-node shows a
TM is used for corpus preparation or filtering.
We also measured training perplexity of the

pairing of a subtree of and a substring of . A
subnode shows a selection of a value for
models. The perplexity of our model was 15.79,
E 3 33
the subtree-substring pair (Figure 3).
and that of IBM Model 5 was 9.84. For reference,
Let 12 0 2 234 87 be a substring of
the perplexity after 5 iterations of Model 1 was
24.01. Perplexity values roughly indicate the pre-
065
from the word 2 with length 9 . Note this notation


is different from (Brown et al., 1993). A subtree

:
dictive power of the model. Generally, lower per-
is a subtree of below the node . We assume
plexity means a better model, but it might cause
over-fitting to a training data. Since the IBM
that a subtree is .

;2 0 is a pair of a subtree
Model usually requires millions of training sen-
A major-node :

and a substring 2 0 . The root of the graph is
tences, the lower perplexity value for the IBM
:
:
1< , where = is the length of . Each major-
Model is likely due to over-fitting.
node connects to several -subnodes :
?>
20 ,
4 Conclusion
showing which value of ?is >
selected. The

arc between

: 2 0 and : 2 0 has weight
We have presented a syntax-based translation P . ?>
model that statistically models the translation pro-
cess from an English parse tree into a foreign-
A -subnode :
node with weight P

12 0 connects to a final-
if is a terminal node
E U < .
in . If is a non-terminal node, a > -subnode
 M Str4 4ON 7P7TQSR P  V 
counts  ,  , and  4 for each

connects to several -subnodes : ;2 0 ,

The
pair I are also in the figure. Those formulae
the arc is P

showing a selection
.
of a value . The weight of


replace the step 3 (in Section 2.3) for each training
A> -subnode is then connected to @ -subnodes pair, and these counts are used in the step 4.
: @ ;2 0 . The partition variable, @ , shows
:

The graph structure is generated by expanding
a particular way of partitioning 2 0 . the root node : ;< . The beta probability for
>
A @ -subnode : @ 2 0 is then connected each node is first calculated bottom-up, then the

to major-nodes which correspond to the children

of and the substring of 12 0 , decided by /@ .
alpha probability for each node is calculated top-
down. Once the alpha and beta probabilities for
A major-node can be connected from different @ -

each node are obtained, the counts are calculated
subnodes. The arc weights between -subnodes as above and used for updating the parameters.
and major-nodes are always 1.0.

The complexity
. XW
6 6
of this training algorithm is
@ . The cube comes from the number
A B CD EF G D H I
of parse tree nodes ( ) and the number of possible
P (|)
French substrings ( ).
FJKLGDHI
P (|)
Acknowledgments
F J K L G D H I
This work was supported by DARPA-ITO grant
N66001-00-1-9814.

FJKLGDHI

A B CD EF G D H I References
H. Alshawi, S. Bangalore, and S. Douglas. 2000.
Learning dependency translation models as collec-
Figure 3: Graph structure for efficient EM train- tions of finite state head transducers. Computa-
ing. tional Linguistics, 26(1).

J. Baker. 1979. Trainable grammars for speech recog-



This graph structure makes it easy

nition. In Speech Communication Papers for the
to obtain P
for a particular and 97th Meeting of the Acoustical Sciety of America.

 M Str 4 4ON 7P7PQSR P . A trace starting from

A. Berger, P. Brown, S. Della Pietra, V. Della Pietra,
the graph root, selecting one of the arcs from
major-nodes, -subnodes, and -subnodes, and J. Gillett, J. Lafferty, R. Mercer, H. Printz, and
L. Ures. 1996. Language Translation Apparatus
all the arcs from @ -subnodes, corresponds to a and Method Using Context-Based Translation Mod-

particular , and the product
of the weight on the els. U.S. Patent 5,510,981.
trace corresponds to P . Note that a trace E. Brill. 1995. Transformation-based error-driven
forms a tree, making branches at the @ -subnodes. learning and natural language processing: A case
We define an alpha probability and a beta prob- study in part of specch tagging. Computational Lin-
ability for each major-node, in analogy with the guistics, 21(4).
measures used in the inside-outside algorithm P. Brown, J. Cocke, S. Della Pietra, F. Jelinek, R. Mer-
for probabilistic context free grammars (Baker, cer, and P. Roossin. 1988. A statistical approach to
1979). language translation. In COLING-88.
The alpha probability (outside probability) is a
P. Brown, J. Cocke, S. Della Pietra, F. Jelinek, R. Mer-
path probability from the graph root to the node cer, and P. Roossin. 1991. Word-sense disambigua-
and the side branches of the node. The beta proba- tion using statistical methods. In ACL-91.
bility (inside probability) is a path probability be-
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mer-
low the node.
cer. 1993. The mathematics of statistical machine
Figure 4 shows formulae for alpha- translation: Parameter estimation. Computational
beta probabilities. From these definitions, Linguistics, 19(2).
YZ[]\$^ _`Z`\ _acbd`\$de ^ e fOgih bafcZ[kj$al\_`ZSa6b8b$fnmoqp r8stcunvsw mxe y+z{ |}~bab$f6Z[Ta(\$Obabx[nynm
oqp w; oqp\$ac[T8f6p wcw8 p r w p r w p 8 wT
6 O68 l lxT O
l
Z[Ta6[]\$ac[nfcp w e y\ yc[Pfbh/Oycdbx[ny Ze lZ\$ac[+e (([nxe \f6[]_`\$ac[T8fly;b$hm/1Z`e ^ `p w e y\(y[Pfb$hi\Obacbxx[ny Ze 6ZX\a6[+6Z`e ^ xac[n
b$h;Oyd`bxx[nym\/\a6[Tf p w e y+\?_`\a6[Tf]i\Obabx[bhq/Oycd`bxx[ny] pyxe __`e j$Oycdbx[nynm\/8Oyd`bxx[ny w }p r w \/
p r w \a6[f6Z[k\$a6 [ne jZfly;h acb\$ac[T8f p w f6bS$}
YZ[+d/[Pfl\(_acbd`\$de ^ e fgie y[P`[n\y
p w p r$ctcuT w
p r w e hr e y\kf6[Ta6 e `\$^
p r w p r w
p rtcu w e hr e y\bxf6[Ta6 e `\$^


Z T
[ 6
a
[
r
e
y \ l

Z e ^ ?
$
b
h r 
m $
\ `

u e
y (
\ 
_ c
a 
b /
_ T
[ 
a `
_ $
\
a 6
f e 6
f e $
b

b 1
h u }
YZ[]Pb8f6y$pxtO w m$p t6 w m`\ /?$p t6 w hb$a[n\6Z_`\e a ;tcu\a6[m

$ptO w oqp r$6tcuT w p r$ w` p r$ w` p r tcuT nw p r8stcu s;v w




c`ci c
$p /t6 w oqp r$ctcuT w p r$ w p r$ w p r tcuT wn p r8stcu s v w

n
c`l
p `tc w oqp r$ctcuT w p r$ wT p r8stcu s v w

c`T1 c

Z[Ta6[(z]?p z w m`| m\$`e yqfcZ[k^ [Tj$f6Zb$hum`ye `P[k\$1`j$^ e ycZ ba6?T\X(\fl6Z\(~a6[T`6Z b$al}

Figure 4: Formulae for alpha-beta probabilities, and the count derivation

M. Collins. 1999. Head-Driven Statistical Models for Y. Wang. 1998. Grammar Inference and Statistical
Natural Language Parsing. Ph.D. thesis, Univer- Machine Translation. Ph.D. thesis, Carnegie Mel-
sity of Pennsylvania. lon University.
A. Dempster, N. Laird, and D. Rubin. 1977. Max- D. Wu. 1997. Stochastic inversion transduction
imum likelihood from incomplete data via the em grammars and bilingual parsing of parallel corpora.
algorithm. Royal Statistical Society Series B, 39. Computational Linguistics, 23(3).
M. Franz, J. McCarley, and R. Ward. 1999. Ad hoc,
cross-language and spoken document information
retrieval at IBM. In TREC-8.
D. Jones and R. Havrilla. 1998. Twisted pair gram-
mar: Support for rapid development of machine
translation for low density languages. In AMTA98.
I. Melamed. 2000. Models of translational equiv-
alence among words. Computational Linguistics,
26(2).
F. Och and H. Ney. 2000. Improved statistical align-
ment models. In ACL-2000.
F. Och, C. Tillmann, and H. Ney. 1999. Improved
alignment models for statistical machine transla-
tion. In EMNLP-99.
P. Resnik and I. Melamed. 1997. Semi-automatic ac-
quisition of domain-specific translation lexicons. In
ANLP-97.

Вам также может понравиться