Вы находитесь на странице: 1из 3

Monolingual Machine Translation for Paraphrase Generation

This paper concentrates on application of statistical machine translation to generate novel


paraphrases for input sentences in same language.
The ability to recognize word sequences that semantically mean the same thing is of huge
importance to applications such as search, summarization, question answering etc. This paper
uses SMT as used by the noisy channel model of Brown.
The main motive is to show that SMT can be extended to create paraphrases given sufficient
amount of monolingual parallel data. Translation based approaches for paraphrase generation
have limited scalability owing to difficulty in finding large quantities of multiply-translated source
documents.
MultiSequence Alignments are used to identify sentences that share formal properties.
The training corpus mainly consists of different news stories reporting the same event. This
paper too uses the Levenshtein distance concept to extract likely paraphrases from the clusters.
The following sentences were rejected Sentence pairs where the sentences were identical or differed only in punctuation
Duplicate sentence pairs
Sentence pairs with significantly different lengths (the shorter is less than two-thirds the length
of the longer
Sentence pairs where the Levenshtein distance was greater than

After preprocessing the raw data the word alignment algorithm available at Giza++ was applied.
In order to capture the many-to-many alignments that identify correspondences between idioms
and other phrasal chunks, we align in the forward direction and again in the backward direction,
heuristically recombining each unidirectional word alignment into a single bidirectional
alignment. To evaluate the alignments the standards established by OCH and NEY were used.
The precision, recall and AER(Alignment Error Rate) was also calculated by the above listed
formulae.
Recent work shows that phrase based MT outperforms word based MT. The source and target
sentences can be viewed as word sequences. A word alignment A can be represented as a list
of source and target tokens.

All phrase pairs occurring in atleast one aligned sentence somewhere in the corpus were
combined into a single replacement database. This database termed as phrasal replacements
serves as back bone for the described model. The probabilities to these phrases is assigned via
IBM model 1. This approach proves effective because
1. Avoids the sparsity problems associated with estimating each phrasal replacement
probability with MLE
2. It appears to boost translation quality in more sophisticated translation systems by
inducing lexical triggering
As the work is restricted to generation of monolingual paraphrases the problem of inter phrase
reordering can be left untouched.
So the described model is solely by the phrasal replacements involved.
To generate paraphrases for given input Standard SMT decoding approach has been used. And
the following preprocessing was done
1. Text was lowercased
2. Text was tokenized
3. Few classes of named entities were identified using regular expressions
Steps of Decoding Process1. Lattice of all possible paraphrases is constructed based on the phrasal translation.
2. The lattice is a set of edges and vertices with each labeled with a sequence of words
and a real number.
3. Edge connecting vertex Vi to Vj labeled with words w1 to wk and the real number p
indicate that the sequence si+1 to sj can be replaced with the w with probability p.
4. The database is stored as a trie so the with words as edges os worst case complexity of
populating the trie is O(n2).
5. As the final and source language are the same an identity mapping for each word is
added.
The optimal path through the lattice is obtained by the product of replacement model and
language trigram model. Such algorithm easily reduce to Viterbi Algorithm; and using the DP
approach optimal search can be performed in worst case of O(kn). N is maximal target length
and k is the maximal number of replacements.
Experiments show that PR can produce rewordings that are evaluated as paraphrases more
frequently than generated by baseline or MSA techniques. This shows that even relatively high
AER for non-identical words is not an obstacle for successful paraphrase generation.
Since MSA templates use entire sentence pairs they may lead to semantically different
paraphrases.

The model is restricted by data sparsity despite large initial training data. Much work needs to
be done. Major agenda is to acquire larger data sets.
The paper presented a novel approach for generating sentence level paraphrases in large
semantic domain.
A second important contribution of this work is a method for building and tracking the quality of
large, alignable monolingual corpora from structured news data on the Web

Вам также может понравиться