You are on page 1of 375

CONTRIBUTIONS TO ENGLISH TO HINDI

MACHINE TRANSLATION USING


EXAMPLE-BASED APPROACH

DEEPA GUPTA

DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF TECHNOLOGY DELHI
HAUZ KHAS, NEW DELHI-110016, INDIA
JANUARY, 2005

CONTRIBUTIONS TO ENGLISH TO HINDI


MACHINE TRANSLATION USING
EXAMPLE-BASED APPROACH

by
DEEPA GUPTA
Department of Mathematics
Submitted
in fulfilment of the requirement of
the degree of
Doctor of Philosophy
to the

Indian Institute of Technology Delhi


Hauz Khas, New Delhi-110016, India
January, 2005

Dedicated to
My Parents,
My Brother Ashish and
My Thesis Supervisor...

Certificate

This is to certify that the thesis entitled Contributions to English to Hindi


Machine Translation Using Example-Based Approach submitted by Ms.
Deepa Gupta to the Department of Mathematics, Indian Institute of Technology
Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona fide
research work carried out by her under my guidance and supervision.
The thesis has reached the standards fulfilling the requirements of the regulations
relating to the degree. The work contained in this thesis has not been submitted to
any other university or institute for the award of any degree or diploma.

Dr. Niladri Chatterjee


Assistant Professor
Department of Mathematics
Indian Institute of Technology Delhi
Delhi (INDIA)

Acknowledgement

If I say that this is my thesis it would be totally untrue. It is like a dream come true.
There are people in this world, some of them so wonderful, who helped in making
this dream, a product that you are holding in your hand. I would like to thank all
of them, and in particular:
Dr. Niladri Chatterjee - mentor, guru and friend, taught me the basics of research
and stayed with me right till the end. His efforts, comments, advices and ideas
developed my thinking, and improved my way of presentation. Without his constant encouragement, keen interest, inspiring criticism and invaluable guidance, I
would not have accomplished my work. I admit that his efforts need much more
acknowledgement than expressed here.
I acknowledge and thank the Indian Institute of Technology Delhi and Tata Infotech
Research Lab who funded this research. I sincerely thank all the faculty members of
Department of Mathematics, especially, I express my gratitude for Prof B. Chandra
and Dr. R. K. Sharma, for providing me continuous moral support and help. I
thank my SRC members, Prof. Saroj Kaushik and Prof. B. R. Handa, for their time
and efforts. I also thank the department administrative staff for their assistance. I
extend my thanks to Prof. R. B. Nair and Dr. Wagish Shukla of IIT Delhi, and
Prof. Vaishna Narang, Prof. P. K. Pandey, Prof. G. V. Singh, Dr. D. K. Lobiyal,
and Dr. Girish Nath Jha of Jawaharlal Nehru University Delhi, for the enlightening
discussions on basics of languages.
I would like to express my sincere thanks to my friends Priya and Dharmendra
for many fruitful discussions regarding my research problem. I thank Mr. Gaurav

Kashyap for helping me in the implementation of the algorithms. In particular, I


would like to thank Inderdeep Singh, for his help in writing some part of the thesis.
I want to give special thanks to my friends, Sonia, Pranita and Nutan, for helping
me in both good and bad times. I would like to thank Prabhakhar for his brotherly
support. I extend my thanks to Manju, Anita, Sarita, Subhashini and Anju for
cheering me, always.
Shailly and Geeta - amazing friends who read the manuscript and gave honest comments. Both of them also stayed with me in the process, and handled me, and
sometimes my out-of-control emotions so well. Especially, I wish to extend my
thanks to Geeta for providing me stay in her hostel room, and also for her wonderful
help when my leg got fractured when we knew each other for a month only. I wish
to acknowledge Krishna for his constant help, both academic and nonacademic, and
his continuous encouragement.
I convey my sincere regards to my parents, and brothers for the sacrifices they have
made, for the patience they have shown, and for the love and blessing they have
showered. I thank Arun for his moral support. Most imperative of all, I would like
to express my profound sense of gratitude and appreciation to my sister Neetu. Her
irrational and unbreakable belief in me bordered on craziness at times.
I cannot avoid to mention my friend Sharad who deserves more than a little acknowledgement. His constant inspiration and untiring support has sustained my
confidence throughout this work.
Finally, I thank GOD for every thing.

Deepa Gupta

Abstract

This research focuses on development of Example Based Machine Translation (EBMT)


system for English to Hindi. Development of a machine translation (MT) system
typically demands a large volume of computational resources. For example, rulebased MT systems require extraction of syntactic and semantic knowledge in the
form of rules, statistics-based MT systems require huge parallel corpus containing
sentences in the source languages and their translations in target language. Requirement of such computational resources is much less in respect of EMBT. This makes
development of EBMT systems for English to Hindi translation feasible, where availability of large-scale computational resources is still scarce. The primary motivation
for this work comes because of the following:

a) Although a small number of English to Hindi MT systems are already available,


the outputs produced by them are not of high quality all the time. Through
this work we intend to analyze the difficulties that lead to this below par
performance, and try to provide some solutions for them.
b) There are several other major languages (e.g., Bengali, Punjabi, Gujrathi) in
the Indian subcontinent. Demand for developing MT systems from English to
these languages is increasing rapidly. But at the same time, development of
computational resources in these languages is still at its infancy. Since many
of these languages are similar to Hindi, syntactically as well as lexicon wise,
the research carried out here should help developing MT systems from English
to these languages as well.

The major contributions of this research may be described as follows:

1) Development of a systematic adaptation scheme. We proposed an adaptation


scheme consisting of ten basic operations. These operations work not only at
word level, but at suffix level as well. This makes adaptation less expensive in
many situations.
2) Study of Divergence. We observe that occurrence of divergence causes major
difficulty for any MT systems. In this work we make an in depth study of the
different types of divergence, and categorize them.
3) Development of Retrieval scheme. We propose a novel approach for measuring
similarity between sentences. We suggest that retrieval strategy, with respect
to an EBMT system, will be most efficient if it measures similarity on the basis
of cost of adaptation. In this work we provide a complete framework for an
efficient retrieval scheme on the basis of our studies on divergence and cost
of adaptation.
4) Dealing with Complex sentences. Handling complex sentences by an MT system is generally considered to be difficult. In this work we propose a split
and translate technique for translating complex sentences under an EBMT
framework.

We feel that the overall scheme proposed in this research will pave the way for
developing an efficient EBMT system for translating from English to Hindi. We
hope that this research will also help development of MT systems from English to
other languages of the Indian subcontinent.

ii

Contents
1 Introduction

1.1

Description of the Work Done and Summary of the Chapters . . . . .

1.2

Some Critical Points . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Adaptation in English to Hindi Translation: A Systematic Approach

23

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2

Description of the Adaptation Operations . . . . . . . . . . . . . . . 29

2.3

Study of Adaptation Procedure for Morphological Variation of Active


Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1

Same Tense Same Verb Form . . . . . . . . . . . . . . . . . . 38

2.3.2

Different Tenses Same Verb Form . . . . . . . . . . . . . . . . 42

2.3.3

Same Tense Different Verb Forms . . . . . . . . . . . . . . . . 46

2.3.4

Different Tenses Different Verb Forms . . . . . . . . . . . . . . 48

2.4

Adaptation Procedure for Morphological Variation of Passive Verbs . 51

2.5

Study of Adaptation Procedures for Subject/ Object Functional Slot


2.5.1

56

Adaptation Rules for Variations in the Morpho Tags of @DN> 59

Contents
2.5.2

Adaptation Rules for Variations in the Morpho Tags of @GN> 60

2.5.3

Adaptation Rules for Variations in the Morpho Tags of @QN . 64

2.5.4

Adaptation Rules for Variations in the Morpho Tags of Premodifier Adjective @AN> . . . . . . . . . . . . . . . . . . . . 64

2.5.5

Adaptation Rules for Variations in the Morpho Tags of @SUB

69

2.6

Adaptation of Interrogative Words . . . . . . . . . . . . . . . . . . . 73

2.7

Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 83

2.8

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3 An FT and SPAC Based Divergence Identification Technique From


Example Base

87

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2

Divergence and Its Identification: Some Relevant Past Work . . . . . 89

3.3

Divergences and Their Identification in English to Hindi Translation . 96


3.3.1

Structural Divergence . . . . . . . . . . . . . . . . . . . . . . . 97

3.3.2

Categorial Divergence

3.3.3

Nominal Divergence

3.3.4

Pronominal Divergence . . . . . . . . . . . . . . . . . . . . . . 107

3.3.5

Demotional Divergence . . . . . . . . . . . . . . . . . . . . . . 111

3.3.6

Conflational Divergence . . . . . . . . . . . . . . . . . . . . . 117

3.3.7

Possessional Divergence

3.3.8

Some Critical Comments . . . . . . . . . . . . . . . . . . . . . 131

. . . . . . . . . . . . . . . . . . . . . . 100
. . . . . . . . . . . . . . . . . . . . . . . 104

. . . . . . . . . . . . . . . . . . . . . 121

iv

Contents
3.4

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4 A Corpus-Evidence Based Approach for Prior Determination of


Divergence

135

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.2

Corpus-Based Evidences and Their Use in Divergence Identification . 136


4.2.1

Roles of Different Functional Tags . . . . . . . . . . . . . . . . 138

4.3

The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147

4.4

Illustrations and Experimental Results . . . . . . . . . . . . . . . . . 155

4.5

4.4.1

Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

4.4.2

Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.4.3

Illustration 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

4.4.4

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 166

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5 A Cost of Adaptation Based Scheme for Efficient Retrieval of Translation Examples

171

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5.2

Brief Review of Related Past Work . . . . . . . . . . . . . . . . . . . 171

5.3

Evaluation of Cost of Adaptation . . . . . . . . . . . . . . . . . . . . 178


5.3.1

5.4

Cost of Different Adaptation Operations . . . . . . . . . . . . 182

Cost Due to Different Functional Slots and Kind of Sentences . . . . 185

Contents

5.5

5.4.1

Costs Due to Variation in Kind of Sentences . . . . . . . . . . 186

5.4.2

Cost Due to Active Verb Morphological Variation . . . . . . . 187

5.4.3

Cost Due to Subject/Object Functional Slot . . . . . . . . . . 192

5.4.4

Use of Adaptation Cost as a Measure of Similarity . . . . . . . 197

The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes
198

5.6

5.5.1

Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 198

5.5.2

Syntactic Similarity . . . . . . . . . . . . . . . . . . . . . . . . 201

5.5.3

A Proposed Approach: Cost of Adaptation Based Similarity . 203

5.5.4

Drawbacks of the Proposed Scheme . . . . . . . . . . . . . . . 211

Two-level Filtration Scheme . . . . . . . . . . . . . . . . . . . . . . . 213


5.6.1

Measurement of Structural Similarity . . . . . . . . . . . . . . 214

5.6.2

Measurement of Characteristic Feature Dissimilarity . . . . . . 217

5.7

Complexity Analysis of the Proposed Scheme

5.8

Difficulties in Handling Complex Sentences . . . . . . . . . . . . . . . 226

5.9

Splitting Rules for Converting Complex Sentence into Simple Sentences229


5.9.1

. . . . . . . . . . . . . 222

Splitting Rule for the Connectives when, where, whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . 231

5.9.2

Splitting Rule for the Connective who . . . . . . . . . . . . 241

5.10 Adaptation Procedure for Complex Sentence . . . . . . . . . . . . . . 253


5.10.1 Adaptation Procedure for Connectives when, where, whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . 254
vi

Contents
5.10.2 Adaptation Procedure for Connective who . . . . . . . . . . 256
5.11 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.1 Illustration 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
5.11.2 Illustration 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.12 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6 Discussions and Conclusions

267

6.1

Goals and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

6.2

Contributions Made by This Research . . . . . . . . . . . . . . . . . . 268

6.3

Possible extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

6.4

Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.4.1

Pre-editing and Post-editing . . . . . . . . . . . . . . . . . . . 274

6.4.2

Evaluation Measures of Machine Translation . . . . . . . . . . 276

Appendices

280

281
A.1 English and Hindi Language Variations . . . . . . . . . . . . . . . . . 281
A.2 Verb Morphological and Structure Variations . . . . . . . . . . . . . . 285
A.2.1 Conjugation of Root Verb . . . . . . . . . . . . . . . . . . . . 286

291
B.1 Functional Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
B.2 Morpho Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
vii

Contents
C

299
C.1 Definitions of Some Non-typical Functional Tags and SPAC Sturctures 299

303
D.1 Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

305
E.1 Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective305

Bibliography

308

viii

List of Figures

1.1

An Example Sentence with Its Morpho-Functional Tags . . . . . . . . 20

2.1

The five possible scenarios in the SL SL TL interface of partial


case matching

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2

Example of Different Adaptation Operations . . . . . . . . . . . . . . 34

2.3

Some Typical Sentence Structures . . . . . . . . . . . . . . . . . . . . 83

3.1

Algorithm for Identification of Structural Divergence . . . . . . . . . 99

3.2

Correspondence of SPACs of E and H for Identification of Structural


Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.3

Algorithim for Identification of Categorial Divergence . . . . . . . . . 103

3.4

Correspondence of SPACs for the Categorial Divergence Example of


Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.5

Algorithim for Identification of Nominal Divergence . . . . . . . . . . 106

3.6

Correspondence of SPAC E and SPAC H of Nominal Divergence of


Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.7

Algorithim for Identification of Pronominal Divergence . . . . . . . . 110

LIST OF FIGURES
3.8

Correspondence of SPAC E and SPAC H of Pronominal Divergence


of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

3.9

Algorithm for Identification of Demotional Divergence . . . . . . . . . 114

3.10 Correspondence of SPAC E and SPAC H for Demotional Sub-type 4 115


3.11 SPAC Correspondence for Demotional Divergence of Sub-type 1 . . . 116
3.12 Algorithm for Identification of Conflational Divergence . . . . . . . . 120
3.13 Correspondence of SPAC E and SPAC H for Conflational Divergence
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.14 Algorithm for Identification of Possessional Divergence . . . . . . . . 129
3.15 Correspondence of SPAC E and SPAC H for Possessional Divergence
of Sub-type 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.16 Correspondence of SPAC E and SPAC H for Possessional Divergence
of Sub-type 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1

Schematic Diagram of the Proposed Algorithm . . . . . . . . . . . . . 153

4.2

Continuation of the Figure 4.1 . . . . . . . . . . . . . . . . . . . . . . 154

5.1

Schematic View of Module 1 for Identification of Complex Sentence


with Connective any of when, where, whenever, or wherever . 232

5.2

Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 237

5.3

Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 240

5.4

Schematic View of Module 1 for Identification of Complex Sentence


with Connective who . . . . . . . . . . . . . . . . . . . . . . . . . . 244

LIST OF FIGURES
5.5

Schematic View of the SUBROUTINE SPLIT . . . . . . . . . . . . . 246

5.6

Schematic View of Module 2 . . . . . . . . . . . . . . . . . . . . . . . 247

5.7

Schematic View of Module 3 . . . . . . . . . . . . . . . . . . . . . . . 249

5.8

Schematic View of Module 4 . . . . . . . . . . . . . . . . . . . . . . . 250

xi

List of Tables

1.1

Output of AnglaHindi and Shakti MT System

. . . . . . . . . .

2.2

Notations Used in Sentence Patterns . . . . . . . . . . . . . . . . . . 35

2.3

Adaptation Operations of Verb Morphological Variations in Present


Indefinite to Present Indefinite . . . . . . . . . . . . . . . . . . . . . . 39

2.4

Adaptation Operations of Verb Morphological Variations in Present


Indefinite to Past Indefinite . . . . . . . . . . . . . . . . . . . . . . . 44

2.5

Different Functional Tags Under the Functional Slot <S> or <O> . . 56

2.6

Different Possible Morpho Tags for Each of the Functional Tag under
the Functional Slot <S> or <O> . . . . . . . . . . . . . . . . . . . . 58

2.8

Adaptation Operations for Genitive Case to Genitive Case . . . . . . 62

2.10 Adaptation Operations for Pre-modifier Adjective to Pre-modifier Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67


2.11 Adaptation Operations for Subject to Subject Variations . . . . . . . 71
2.12 Different Sentence Patterns of Interrogative Words . . . . . . . . . . . 77

LIST OF TABLES
2.13 Functional & Morpho Tags Corresponding to Each Interrogative Sentence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.14 Adaptability Rules for Group G5 Sentence Patterns . . . . . . . . . . 83
2.15 Adaptation Rules for Variation in Kind of Sentences . . . . . . . . . . 84
3.1

Different Semantic Similarity Score between shock with trouble


and panic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.1

FT-features Instrumental for Creating Divergence . . . . . . . . . . . 138

4.2

Relevance of FT-features in Different Divergence Types . . . . . . . . 139

4.3

FT of the Problematic Words for Each Divergence Type . . . . . . . 142

4.4

Frequency of Words in Different Sections . . . . . . . . . . . . . . . . 144

4.5

PSD/NSD Schematic Representations . . . . . . . . . . . . . . . . . . 145

4.6

Values of s(di ) and m(di ) for Illustration 3 . . . . . . . . . . . . . . . 160

4.7

Some Illustrations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

4.8

Continuation of Table 4.7 . . . . . . . . . . . . . . . . . . . . . . . . 165

4.9

Results of Our Experiments . . . . . . . . . . . . . . . . . . . . . . . 166

5.1

Cost Due to Variation in Kind of Sentences . . . . . . . . . . . . . . . 187

5.2

Cost Due to Verb Morphological Variation Present Indefinite to Present


Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.3

Adaptation Operations of Verb Morphological Variation Present Indefinite to Past indefinite . . . . . . . . . . . . . . . . . . . . . . . . . 192

5.4

Costs Due to Adapting Genitive Case to Genitive Case . . . . . . . . 195


xiv

LIST OF TABLES
5.5

Cost of Adaptation Due to Subject/Object to Subject/Object . . . . 197

5.6

Best Five Matches by Using Semantic Similarity for the Input Sentence I work.

5.7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

Best Five Matches by Using Semantic Similarity for the Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . 201

5.8

Weighting Scheme for Different POS and Syntactic Role . . . . . . . 202

5.9

Best Five Matches by Syntactic Similarity for the Input Sentence I


work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

5.10 Best Five Matches by Syntactic Similarity for the Input Sentence Sita
sings ghazals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.11 Functional-morpho Tags for the Input English Sentence (IE) and the
Retrieved English Sentence (RE) . . . . . . . . . . . . . . . . . . . . 204
5.12 Retrieval on the Basis of Cost of Adaptation Based Scheme for the
Input Sentence I work. . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.13 Retrieval on the Basis of Cost of Adaptation Based Similarity for the
Input Sentence Sita sings ghazals. . . . . . . . . . . . . . . . . . . . 207
5.14 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence I work. by Using Semantic and Syntactic Based Similarity
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.15 Cost of Adaptation for Retrieved Best Five Matches for the Input
Sentence Sita sings ghazals by Using Semantic and Syntactic based
Similarity Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.16 Weights Used for Characteristic Features . . . . . . . . . . . . . . . . 220
xv

LIST OF TABLES
5.17 Notation Used in the Complexity Analysis . . . . . . . . . . . . . . . 222
5.19 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 2 . . . . . . . . . . . . 235
5.20 Typical Examples of Complex Sentence with Connective when, where,
whenever or wherever Handled by Module 3 . . . . . . . . . . . . 239
5.21 Typical Complex Sentences with Relative Adverb who Handled by
Module 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.22 Typical Complex Sentences with Relative Adverb who Handled by
Module 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
5.23 Typical Complex Sentences with Relative Adverb who Handled by
Module 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
5.24 Hindi Translation of Relative Adverbs . . . . . . . . . . . . . . . . . . 254
5.25 Patterns of Complex Sentence with Connective when, where,
whenever and wherever . . . . . . . . . . . . . . . . . . . . . . . . 255
5.26 Patterns of Complex Sentence with Connective who . . . . . . . . . 257
5.27 Five Most Similar Sentence for RC You go to India. Using Cost of
Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . . . . . 261
5.28 Five Most Similar Sentence for MC You should speak Hindi. Using
Cost of Adaptation based Scheme . . . . . . . . . . . . . . . . . . . . 261
5.29 Five Most Similar Sentence for RC He wants to learn Hindi. Using
Cost of Adaptation Based Scheme . . . . . . . . . . . . . . . . . . . . 263
5.30 Five Most Similar Sentence for MC The student should study this
book. Using Cost of Adaptation Based Scheme . . . . . . . . . . . . 263
xvi

LIST OF TABLES
A.2 Different Case Ending in Hindi . . . . . . . . . . . . . . . . . . . . . 283
A.3 Suffixes and Morpho-Words for Hindi Verb Conjugations . . . . . . . 286
A.4 Verb Morphological Changes From English to Hindi Translation . . . 288
E.1 Costs Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective307

xvii

Chapter 1
Introduction

Chapter 1. Introduction

Machine Translation (MT) is the process of translating text units of one language
(source language) into a second language (target language) by using computers. The
need for MT is greatly felt in the modern age due to globalization of information,
where global information base needs to be accessed from different parts of the world.
Although most of this information is available online, the major difficulty in dealing
with this information is that its language is primarily English. Starting from science,
technology, education to manuals of gadgets, commercial advertisements, everywhere
predominant presence of English as the medium of communication can be easily
observed. This world, however, is multi-lingual, where different languages are spoken
in different regions. This necessitates the development of good MT systems for
translating these works into other languages so that a larger population can access,
retrieve and understand them. Consequently, in a country like India, where English
is understood by less than 3% of the population (Sinha and Jain, 2003), the need
for developing MT systems for translating from English into some native Indian
languages is very acute. In this work we looked into different aspects of designing an
English to Hindi MT system using Example-Based (Nagao, 1984) technique. Two
fundamental questions that we feel we should answer at this point are:

The rationale behind choosing Example-Based Machine Translation (EBMT)


as the paradigm of interest;
The reason behind selecting Hindi as the preferred language.

Below we provide justifications behind these choices.


Development of MT systems has taken a big leap in the last two decades. Typically, machine translation requires handcrafted and complicated large-scale knowl-

edge (Sumita and Iida, 1991). Various MT paradigms have so far evolved depending
upon how the translation knowledge is acquired and used. For example,

1. Rule-Based Machine Translation (RBMT): Here rules are used for analysis
and representation of the meaning of the source language texts, and the
generation of equivalent target language texts (Grishman and Kosaka, 1992),
(Thurmair, 1990), (Arnold and Sadler, 1990).
2. Statistical- (or Corpus-) Based Machine Translation (SBMT): Statistical translation models are trained on a sentence-aligned translation corpus, which is
based on n-gram modelling, and probability distribution of the occurrence of
a source-target language pair in a very large corpus. This technique was proposed by IBM in early 1990s (Brown, 1990), (Brown et. al., 1992), (Brown et.
al., 1993), (Germann, 2001).

However, these techniques have their own drawbacks. The main drawback of
RBMT systems is that sentences in any natural language may assume a large variety of structures. Also, machine translation often suffers from ambiguities of various
types (Dorr et. al., 1998). As a consequences, translation from one natural language into another requires enormous knowledge about the syntax and semantics of
both the source and target languages. Capturing all the knowledge in rule form is
daunting task if not impossible. On the other hand, SBMT techniques depend on
how accurately various probabilities are measured. Realistic measurements of these
probabilities can be made only if a large volume of parallel corpus is made available.
However, availability of such huge data is not easy. Consequently, this scheme is
viable only for small number of language pairs.

Chapter 1. Introduction

Example-based Machine Translation (Nagao, 1984), (Carl and Way, 2003) makes
use of past translation examples to generate the translation of a given input. An
EBMT system stores in its example base of translation examples between two languages, the source language (SL) and the target language (TL). These examples are
subsequently used as guidance for future translation tasks. In order to translate a
new input sentence in SL, a1 similar SL sentence is retrieved from the example base,
along with its translation in TL. This example is then adapted suitably to generate a
translation of the given input. It has been found that EBMT has several advantages
in comparison with other MT paradigms (Sumita and Iida, 1991):

1. It can be upgraded easily by adding more examples to the example base;


2. It utilizes translators expertise, and adds a reliability factor to the translation;
3. It can be accelerated easily by indexing and parallel computing;
4. It is robust because of best-match reasoning.

Even other researchers (e.g. (Somers, 1999), (Kit et. al., 2002)) have considered
EBMT to be one major and effective approach among different MT paradigms,
primarily because it exploits the linguistic knowledge stored in an aligned text in a
more efficient way.
We apprehend from the above observation that for development of MT systems
from English to Indian languages, EBMT should be one of the preferred approaches.
This is because a significant volume of parallel corpus is available between English
and different Indian languages in the form of government notices, translation books,
1

Sometimes more than one sentence is also retrieved

advertisement material etc. Although this data is generally not available in electronic form yet, converting them into machine readable form is much easier than
formulating explicit translation rules as required by an RBMT system. In fact some
parallel data in electronic form has been made available through some projects (e.g.
EMILLE :http://www.emille.lancs.ac.uk/home.html). Also, there has been some
concerted effort from various government organizations like TDIL2 , CIIL Mysore3 ,
C-DAC Nodia4 , (Vikas, 2001) and various institutes, e. g., IIT Bombay5 , IIT Kanpur6 , LTRC (IIIT Hyderabad)7 and develop linguistic resources. At the same time
this data is not large enough to design an English to Hindi SBMT, which typically
requires several hundred thousand of sentences. These resources, we hope, will be
fruitfully utilized for developing different EBMT systems involving Indian languages.
Of the different Indian languages8 Hindi has some major advantages over the others as far as working on MT is concerned. Not only is Hindi the national language of
India, it is also the most popular among all Indian languages. With respect to Indian
languages, all the major works that have been reported so far (e.g. ANGLAHINDI
(Sinha et. al., 2002), SHIVA (http://shiva.iiit.net/) , SHAKTI (Sangal, 2004), MaTra (Human aided MT)9 ) are primarily concerned English and Hindi as their preferred languages. In 2003 Hindi has been considered as the surprise language
(Oard, 2003) by DARPA. As a consequence, different universities (e.g. CMU, Johns
Hopkins, USC-ISI) have invested efforts in developing MT systems involving Hindi.
2

http://tdil.mit.gov.in/
http://www.ciil.org/
4
http://www.cdacnoida.com/
5
http://www.cfilt.iitb.ac.in
6
http://www.cse.iitk.ac.in/users/isciig/
7
http://ltrc.iiit.net/
8
India has 17 official languages, and more than 1000 dialects
(http://azaz.essortment.com/languagesindian rsbo.htm)
9
http://www.ncst.ernet.in/matra/about.shtml
3

Chapter 1. Introduction

This world-wide popularity of the language makes the study of English to Hindi
machine translation more meaningful in todays context.
One major advantage of having the above-mentioned English to Hindi translation
systems available on-line is that it helped us in working on the systems to examine
the quality of their outputs. In this respect, we find that the outputs given by the
above systems are not always the correct translations of the inputs. The following
Table 1.1 illustrates the above statement with respect to the systems AnglaHindi
and Shakti. In this table we show the translations produced by the above two
systems for different inputs, and also show the correct translations of these sentences.
Input

Output of

Output of

Actual

Sentences

AnglaHindi

Shakti

Translation

Ram married Sita.

raam ne siita vi-

raam ne siitaa vi-

raam ne siitaa se

vahaa kiyaa

vaaha kiyaa

vivaaha kiyaa

pankhaa ho par

pankhaa

Fan is on.

la-

gaataar hai
This dish tastes

yaha

good.

achchhaa
hai

The soup
salt.

lacks

vyanjan
hotaa

pankhaa chal rahaa


hai

yah thalii achc- iss


chaa swaad letii
hai

vyanjan

swaad
hai

kaa

achchhaa

soop namak kam


hotaa hai

shorbaa namak soop mein namak


kamii hai
kam hai

It is raining.

yah varshaa ho
rahii hai

yah varshaa ho
rahii hai

They have a big


fight.

unke paas eka


badhii ladaae hai

unke
badhii unkii
ghamasan
ladaaiyaan hain
ladaii huii

varshaa
hai

Table 1.1: Output of AnglaHindi and Shakti MT


System

ho

rahii

1.1. Description of the Work Done and Summary of the Chapters

We have found many such instances where the outputs produced by the systems
may not be considered to be correct Hindi translations of the respective inputs. This
observation prompts us to study different aspects of English to Hindi translations in
order to understand the difficulty in machine translations, particularly with respect
to English to Hindi translation, also, how can these shortcomings be dealt with
under an EBMT framework. The research is concerned with the above studies.

1.1

Description of the Work Done and Summary


of the Chapters

The success of an EBMT system lies on two different modules: (i) Similarity measurement and Retrieval. (ii) Adaptation. Retrieval is the procedure by which a
suitable translation example is retrieved from a systems example base. Adaptation is the procedure by which a retrieved translation is modified to generate the
translation of the given input. Various retrieval strategies have been developed (e.g.
(Nagao, 1984), (Sato, 1992), (Collins and Cunningham, 1996)). All these retrieval
strategies aim at retrieving an example from the example base such that the retrieved
example is similar to the input sentence. This is due to the fact that the fundamental
intuition behind EBMT is that translations of similar sentences of the source language will be similar in the target languages as well. Thus the concept of retrieval is
intricately related with the concept of similarity measurement between sentences.
But the main difficulty with respect to this assumption is that there is no straightforward way to measure similarity between sentences. In different works different
approaches have been defined for measuring similarity between sentences. For example, Word-based metrics(e.g. (Nirenburg, 1993), (Nagao, 1984)), Character-based
6

Chapter 1. Introduction

metrics (e.g. (Sato, 1992)), Syntactic/Semantic based matching (e.g. (Manning and
Schutze, 1999)), DP-matching between word sequence (e.g. (Sumita, 2001)), Hybrid
retrieval scheme (e.g. (Collins, 1998)).
In all these works similarity measurement and adaptation are considered
in isolation. This we feel is the major hindrance with respect to EBMT. In this
work we therefore propose a novel approach for measuring similarity. We intend
to look at similarity from the point of view of adaptation. We suggest that a past
example will be considered as the most similar with respect to an input sentence, if
its adaptation towards generating the desired translation is the simplest. The work
carried out in this research is aimed at achieving this goal. Our studies therefore start
in the following way. We first look at adaptation in detail. An efficient adaptation
scheme is very important for an EBMT system because even a very large example
base cannot, in general, guarantee an exact match for a given input sentence. As
a consequence, the need for an efficient and systematic adaptation scheme arises
for modifying a retrieved example, and thereby generating the required translation.
Various adaptation schemes have been proposed in literature, e.g. (Veale and Way,
1997), (Shiri et. al., 1997), (Collins, 1998) and (McTait, 2001). A scrutiny of these
schemes suggest that primarily there are four basic adaptation operations, i.e. word
addition, word deletion, word replacement and copy.
In our approach we started with these basic operations: word addition, word
deletion, word replacement and copy. However, in this respect we notice the following:
1. Both English and Hindi relies heavily on suffixes for morphological changes.
There are a number of suffixes for achieving declension of verbs and nouns.
Further, in Hindi there are situations when morphological changes in the ad7

1.1. Description of the Work Done and Summary of the Chapters

jectives is also required depending upon the number and gender of the corresponding noun/pronoun. Since the number of suffixes is limited, we feel that
instead of purely word-based operations if adaptation operations are focused
on the suffixes, then in many situations significant amount of computational
efforts may be saved.
2. A further observation with respect to Hindi is that there are situations when instead of suffixes whole words are used for bringing in morphological variations.
For example, the present continuous form of Hindi verbs is: <Root form of the
verb> + <rahaa/rahii /rahe> + <hai /hain/ho>. Here the words rahaa,
rahii or rahe are used to achieve the morphological variation. Which of
these will be used depend upon the number and gender of the subject. Similarly, hai , hain and ho are used corresponding to situations when the
subject is singular or plural and person, respectively. We term these words as
morpho-words. Appendix A gives details of different Hindi morpho-words
and their usages.
A major fall out of the above observation is that in some situations, adaptation
may be carried out by dealing with the morpho-words instead of whole words, which
are computationally much less expensive than dealing with constituent words as a
whole. Thus we propose an adaptation scheme consisting of ten operations: addition,
deletion, and replacement of constituent words, addition, deletion, and replacement
of morpho-words, addition, deletion, and replacement of suffixes and copy. Chapter
2 of the thesis discusses these adaptation operations in detail.
One point, however, we notice with respect to the above operations is that the
above-mentioned operations cannot deal with translation divergences in an efficient
way. Divergence occurs when structurally similar sentences of the source language
8

Chapter 1. Introduction

do not translate into sentences that are similar in structures in the target language.
(Dorr, 1993). We therefore felt study of divergence is an important aspect for any
MT system. With respect to an EBMT system the need arises because of the two
reasons:
The past example that is retrieved for carrying out the task of adaptation
has a normal translation, but translation of the input sentence should involve
divergence.
The translation of the retrieved example involves divergence, whereas the input
sentence should have a normal translation.
In this work we made an in-depth study of divergence with respect to English to
Hindi translation. In this regard one may note that divergence is a highly languagedependent phenomenon. Its nature may change along with the source and target
languages under consideration. Although divergence has been studied extensively
with respect to translation between European languages (e.g. (Dorr et. al., 2002),
(Watanabe et. al., 2000)) very little studies on divergence may be found regarding
translations in Indian languages. The only work that came into our notice is in (Dave
et. al., 2002). In this work the author has followed the classifications given in (Dorr,
1993) and tried to find examples of each of them with respect to English to Hindi
translation. In this regard it may be noted that Dorr has described seven different divergence types: structural, categorical, conflational, promotional, demotional,
thematic and lexical, with respect to translations between European languages.
However, we find that all the different divergence types explained in Dorrs work
do not apply with respect to Indian languages. In fact, we found very few (if not
none) examples of thematic and promotional divergence with respect to English
9

1.1. Description of the Work Done and Summary of the Chapters

to Hindi translation. On the other hand we identified three new types of divergence
that have not so far been cited in any other works on divergence. We named these
divergences as nominal, pronominal and possessional, respectively. We have
further observed that all the different divergence types (barring structural) for
which we found instances in English to Hindi translation may be further divided into
several sub-categories. Chapter 3 explains in detail different divergence types and
their sub-types that we have observed with respect to English to Hindi translation,
and illustrates them with suitable examples. Some of these results have already been
presented in (Gupta and Chatterjee, 2003a) and (Gupta and Chatterjee, 2003b).
Presence of divergence examples in the example base makes straightforward application of the above-mentioned adaptation scheme difficult. As mentioned earlier,
application of the operations discussed in Chapter 2 will not be able to generate
the correct translation if the input sentence requires normal translation, whereas
the translation of retrieved example involves divergence, or vice versa. To overcome
this difficulty we suggest that the example base may be partitioned into two parts:
one containing examples of normal translation, the other containing the examples
of divergence, so that given an input sentence an EBMT system may retrieve an
example from the appropriate part of the example base. However, implementation
of the above scheme requires design of algorithms for:
1) Partitioning the example base sentences.
2) Designing an efficient retrieval policy.
We attempt to answer the first one by designing algorithms for identification of
translation divergence, i.e. if an English sentence and its Hindi translation are given
as input, these algorithms will detect whether this translation involves any of the said
10

Chapter 1. Introduction

types of divergence. The remaining part of Chapter 3 discusses different algorithms


that we developed for identification of divergence from a given English-Hindi pair
of sentences. The identification algorithms designed by us consider the Functional
tag (FT10 ) of the constituent words and the Syntactic Phrasal Annotated Chunk
(SPAC11 ) of the SL and TL sentences. When these two do not match for a source
language sentence and its translation in the TL, a divergence can be identified. With
respect to each divergence categories and their sub-categories we have identified
the appropriate FTs and SPACs whose presence/absence indicate possibilities of
certain divergence. By systematically analyzing the FTs and SPACs of the English
sentence and its Hindi translation the algorithms arrive at a decision on whether
this translation involves any divergence. Thus the algorithm partitions the example
base in two parts: Normal Example Base and Divergence Example Base. Some of
these algorithms have already been presented in (Gupta and Chatterjee, 2003b).
To answer the second question, we feel that given an input sentence if it can be
decided a priori whether its translation will involve divergence then the retrieval can
be made accordingly. To handle the situation when the translation of input sentence
does not involve any divergence, we devise a cost of adaptation based two-level
filtration scheme that enables quick retrieval from normal example base12 . Chapter 4
describes our scheme of retrieval from divergence example base in situations involving
divergence. Here our primary attempt is to develop a procedure so that given an
input English sentence it can decide whether its Hindi translation will involve any
type of divergence. Obviously, this decision has to be made before resorting to
the actual translation. Hence we call it prior identification of divergence. The
10

Appendix B provides details on the FTs.


SPAC structure is discussed in detail in Appendix C.
12
This scheme is discussed in Chapter 5.
11

11

1.1. Description of the Work Done and Summary of the Chapters

algorithm seeks evidence from the example base and the WordNet. In this work we
have used WordNet 2.013 to measure semantic similarity of the constituent words
of the input sentence, and various words present in the example base sentences to
arrive at a decision in this regard. The scheme works in the following way. We first
identified the roles of different Functional Tags (FT) towards causing divergence.
We observe with respect to different divergence type and sub-types that each FT
may have one of the three following roles;

1) Its presence is mandatory for the corresponding divergence (sub-)type to occur;


2) Its absence if mandatory for the corresponding divergence (sub-)type to occur;
3) Occurrence/non-occurrence of the divergence (sub-)type is not influenced by
the FT under consideration.

This knowledge is stored in the form of a table (Table 4.2) in Chapter 4. Given
an input sentence the scheme first determines its constituent FTs. We have used
ENGCG parser14 for parsing an input sentence and obtaining its FTs. This finding
is then compared to the above-mentioned knowledge base (Table 4.2) to identify
the set (D) of divergence types that may possibly occur in the translation of this
sentence. Further investigation is carried out to discard elements from the set D, so
that the divergence that may actually occur can be pin-pointed. In this respect we
proceed in the following way. Corresponding to each divergence type we identify the
functional tag that is at the root of causing the divergence. We call it the problematic FT corresponding to that particular divergence. Table 4.3 presents our finding
in this regard. Corresponding to each possible divergence (as found in D) the scheme
13
14

http://www.cogsci.princeton.edu/cgi-bin/webwn
http://www.lingsoft.fi/cgi-bin/engcg

12

Chapter 1. Introduction

works as follows. It first retrieves from the input sentence the constituent word corresponding to the problematic FT of the divergence type under consideration. Then
the semantic similarity of this word is compared to other words. Proximity in this
semantic distance is then used as a yardstick for similarity measurement. Chapter
4 discusses this scheme in detail.
Finally, in Chapter 5 we look at how cost of adaptation may be used as a similarity measurement scheme. It has been stated that no unique definition of similarity
exists for comparing sentences. Similarity between sentences may be viewed from
different perspectives. In this work, we have first considered two most general similarity schemes: syntactic similarity and semantic similarity. The ideas have
been borrowed from the domain of Information Technology (Manning and Schutze,
1999). According to the definition given therein semantic similarity is measured on
the basis of commonality of words. The more is the number of words common between two sentences, the more similarity is said to exist between the two sentences
under consideration. However, it has been shown in (Chatterjee, 2001) that this
measurement of similarity is not always helpful from EBMT point of view. For example, it has been shown there that although the sentences The horse had a good
run. and The horse is good to run on. have most of the key words common, the
structure of their Hindi translations are very different. Consequently, adaptation of
the translation of one of them to generate the translation of the other is computationally demanding. On the other hand, syntactic similarity between two sentences
is measured on the basis of commonality of morpho-functional tags between them.
In this case, adaptation may require a large number of constituent word replacement
(WR) operations. Each of these WR operations involves reference to some dictionary for picking up the appropriate words in the target language. Typically the

13

1.1. Description of the Work Done and Summary of the Chapters

dictionary access will involve accessing an external storage, and thereby will incur
significant computational cost. Thus a purely syntax-based similarity measurement
scheme may not be suitable for an EBMT system.
In this work we therefore propose that from EBMT perspective retrieval and
adaptation should be looked at in a unified way. In this chapter (i.e. Chapter
5) we investigate feasibility of the above proposal in depth. In this respect we first
look into the overall adaptation operations deeply. We have already observed that
these operations are invoked successively to remove the discrepancies between the
input sentence and the retrieved example. These discrepancies, as we observe, may
be in the actual words, or in the overall structure of the sentences. For illustration,
suppose the input sentence is The boy eats rice everyday., whose Hindi translation
ladkaa har roz chaawla khaataa hai has to be generated. The nature of the adaptation varies depending upon which example is retrieved from the example base. For
illustration:

a) If the retrieved example is The boy eats rice, the adaptation procedure needs
to apply a constituent word addition operation (WA) to take care of the adverb
everyday.
b) However, if the retrieved sentence is The boy plays cricket everyday. ladkaa
roz cricket kheltaa hai , then the adaptation procedure needs to invoke two
constituent word replacement (WR) operations : to replace Hindi of play,
i.e. khel with the Hindi of eat, i.e. khaa, and cricket (cricket) with
chaawal (rice).
c) In case the retrieved example is The boy is eating rice., one adaptation operation that is constituent word addition (WA) is required for the adverb
14

Chapter 1. Introduction

everyday. Further to take care of verb conjugation some morpho-word and


suffix operations need to be carried out. This is because the Hindi translation of The boy is eating rice is : ladkaa (boy) chaawal (rice) khaa (eat)
rahaa (..ing) hai (is). But the translation of the input sentence The boy
eats rice everyday should be ladkaa har roz chaawal khaataa hai . Thus the
morpho-word rahaa, which is required for the present continuous tense of
the retrieved sentence needs to be deleted. Further the suffix taa is to be
added to the root main verb to get the required present indefinite verb form
of the input.
d) However, if the retrieved example is Does the boy eat rice?, then adaptation
procedure needs to take care of the structural variation between the interrogative form of the retrieved example, and the affirmative form of the input
sentence.

Obviously, the more will be the discrepancy between the retrieved example and
the input sentence, the more will be the number of adaptation operations towards
generating the desired translation. The above illustrations make certain points evident:

a) Adaptation operations are required for performing two general tasks: dealing
with constituent words (along with their suffixes, morpho-words), and dealing
with the overall structure of the sentence.
b) Each invocation of adaptation operation pertains to a particular part of speech,
such as, noun, verb, adverb etc.
c) Of the ten adaptation operations (described earlier with respect to Chapter
15

1.1. Description of the Work Done and Summary of the Chapters

2) only the WA and WR operations require dictionary15 searches. Since dictionary search typically involves accessing an external device (e.g hard disk),
a dictionary search is computationally more expensive than other operations
(e.g. constituent word deletion, morpho-word operations) which are purely
RAM16 -based and hence computationally cheaper.

The above observations help us to proceed towards achieving the intended goal
of using cost of adaptation as a measurement of similarity. As a first step towards
achieving the intended goal, we suggest to divide the dictionary into several parts
based on the part-of-speech (POS) of the words. Division of the dictionary into
several parts according to the POS reduces the search time for each invocation of
the above procedures, and as a consequence, the search time is reduced. The cost of
adaptation based similarity measurement approach then proceeds along the following
line:

a) We first estimate the average cost for each of the ten adaptation operations.
We observe that these costs depend on two major types of parameters. On
one hand they depend on certain linguistic aspects, such as, the average length
of the sentences in both source and target languages, the number of suffixes
(used with different POS), the number of morpho-words etc. On the other
hand, these costs are related to the machine on which the EBMT system is
working. Since we aim at analyzing the costs in a general way, we assumed
these machine-dependent costs to be variables in all our analysis. For the linguistic parameters, we used values that we have obtained by analyzing about
15

By dictionary we mean a source language to target language word dictionary available online.
16
Random Access Memory

16

Chapter 1. Introduction

30,000 examples of English to Hindi translations. These examples were collected from various sources that are translation books, advertisement materials, childrens story books and government notices, which are freely available
in non-electronic form.
b) At the second step, we estimated the costs incurred in adapting various functional tags17 . In particular, we have considered cost of adaptation due to variations in active and passive verb morphology, subject/object, pre-modifying
adjective, genitive case and wh-family words. These costs are stored in various
tables, in Section 5.4.
c) At the third step we have considered costs of adaptation due to differences in
sentence structure. Here, we have considered four different sentence structures:
affirmative, negative, interrogative, negative-interrogative. These adaptation
costs too are stored in tabular form. Section 5.4 gives details of this analysis.
Once these basic costs are modelled, we are in a position to experiment on costs
of adaptation as a similarity measure vis-`a-vis semantics and syntax based similarity
measurement scheme discussed above. Our experiments have clearly established the
efficiency of the proposed scheme over the others. Part of this work is also presented
in Gupta and Chattrejee (2003c). Two apparent drawbacks of this scheme are:
1) It may end up in comparing a given input with all the example-base sentences
to ascertain the least cost of adaptation.
2) Another major question that may arise is that whether the cost of adaptation
scheme is efficient enough to handle sentences that are structurally more com17

In fact we worked on Functional Slots which are more general than Functional Tags. This
is discussed in detail in Section 2.2

17

1.1. Description of the Work Done and Summary of the Chapters

plicated, e.g. complex or compound sentences. It is a generally accepted fact


that complex sentences are difficult to handle in an MT system (Dorr et. al.,
1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003).
In order to deal with first difficulty we have proposed a two-level filtration scheme.
This scheme helps in selecting a smaller number of examples from the example base,
which may subsequently be subjected to the rigorous treatment for determining their
costs of adaptation with respect to the given input. We have also justified that this
scheme does not leave out the sentences whose translations are easier to adapt for
the given input.
In this work we have given a solution for the second problem too. We have
given rules for splitting a complex sentence into more than one simple sentence.
Translations of these simple sentences may then be generated by the EBMT system.
These individual translations may then be combined to obtain the translation of the
given complex sentence input. If the cost of adaptation based similarity measurement
scheme is applied for translating the simple sentences, then the cost of adaptation
of the complex sentence too can be estimated, by adding the individual costs with
the cost of combining the individual translations. Since the last operation is purely
algorithmic its computational complexity can be easily computed, and hence the
overall cost of adaptation be estimated. With respect to dealing with complex
sentences, we have however used certain restrictions. We considered sentences with
only one subordinate clause. Further, the presence of a connecting word is also
mandatory. Evidently, more complicated complex sentence structures are available,
and further investigations are required for developing techniques for handling them
in an EBMT framework.
In this connection we like to mention that we have explained the cost of adap18

Chapter 1. Introduction

tation with respect to a selected set of sentence structures, and for a selected set of
Functional slots. Definitely many more variations are available with respect to these
parameters. Consequently, more work has to be done to form rules for handling
these variations. However, we feel that the work described in research provides the
suitable guideline for further continuation of the research.

1.2

Some Critical Points

1) The aim of this research is not to construct an English to Hindi EBMT system.
Rather our intuition is to analyze the requirements that help in building an
effective EBMT system. The motivation behind this research came from two
major observations:
Although some MT system for translation from English to Hindi already
exist, the quality of their translation is often not up to the mark. This
promoted us to look into the process of MT to ascertain the inherent
difficulties.
We have chosen EBMT as our preferred paradigm because of its certain
advantages our other MT paradigms such as RBMT, SBMT. One major
advantage of EBMT is that it requires neither a huge parallel corpus as
required by SBMT, nor it requires framing a large rule base required by
RBMT. Study of EBMT is therefore feasible for us as we did not have
access to such linguistics resources.
2) In order to design our scheme we have studied about 30,000 English to Hindi
translation examples available off-line. Although now large volumes of English
19

1.2. Some Critical Points

English sentence: The horses have been running for one hour.
Tagged form: @DN> ART the, @SUBJ N PL horse %ghodaa%,
@+FAUXV V PRES have, @-FAUXV V PCP2 be, @-FMAINV V PCP1
run %daudaa%, @ADVL PREP for, @QN> NUM CARD one %ek %, @<P
N SG hour %ghantaa%.
Hindi sentence: ghode ek ghante se daudaa rahen hain
Figure 1.1: An Example Sentence with Its Morpho-Functional Tags
to Hindi parallel text is available on-line (EMILLE: http:// www.emille.lancs.ac.
uk/home.htm), the time when this work was started no such parallel corpus
was available to us. For our work we prepared an online parallel example base
of 4,500 sentences. These example pairs are chosen carefully so that different
sentence structure as well translation variations (divergence) are taken care of
as much as possible.
3) Each translation example record in our example base contains morpho-functional
tag18 information for each of the constituent word of the source language (English) sentence along with the sentence, its Hindi translation, and the root
word correspondence. Figure 1.1 provides an example of the records stored in
our example base.
The morpho-functional tags of a word indicates its syntactic function within
the sentence. The tags are helpful in identifying the root words, their roles
in the sentence and roles of the different suffixes (used for declensions) in the
overall sentence construction.

In this work we have studied the two major pillars of EBMT: Retrieval and
Adaptation. We feel that the studies made as well as the techniques developed by
18

Appendix B provides different morpho tags and functional tags that have been used in this
work. These tags are obtained by editing the sentence tagging given by the ENGCG parser :
(http://www.lingsoft.fi/cgi-bin/engcg) for English sentences.

20

Chapter 1. Introduction

this research will be helpful for developing MT system not only for Hindi but also for
other Indian languages (e.g. Bangla, Gujrati, Panjabi). All these languages suffer
from the same drawback - unavailability of linguistics resources. However, demands
for developing MT systems from English to these languages is increasing with time
not only because these are prominent regional languages of India, but also they
are important minority languages in other countries such as U.K. (Somers, 1997).
The studies made in the research should pave the way for developing EBMT system
involving these languages as well.

21

Chapter 2
Adaptation in English to Hindi
Translation: A Systematic
Approach

Adaptation in English to Hindi Translation: A Systematic Approach

2.1

Introduction

The need for an efficient and systematic adaptation scheme arises for modifying a
retrieved example, and thereby generating the required translation. This chapter is
devoted to the study of systematic adaptation approach. Various approaches have
been pursued in dealing with adaptation aspect of an EBMT system. Some of the
major approaches are described below.
1. Adaptation in Gaijian (Veale and Way, 1997) is modelled via two categories:
high-level grafting and keyhole surgery. High-level grafting deals with phrases.
Here an entire phrasal segment of the target sentence is replaced with another
phrasal segment from a different example. On the other hand, keyhole surgery
deals with individual words in an existing target segment of an example. Under
this operation words are replaced or morphologically fine-tuned to suit the
current translation task. For instance, suppose the input sentence is The girl
is playing in the park., and in the example base we have the following examples:
(a) The boy is playing.
(b) Rita knows that girl.
(c) It is a big park.
(d) Ram studies in the school.
For the high level grafting the sentences (a) and (d) will be used. Then keyhole
surgery will be applied for putting in the translations of the words park and
girl. These translations will be extracted from (b) and (c).
2. Shiri et. al. (1997) have proposed another adaptation procedure. It is based on
three steps: finding the difference, replacing the difference, and smoothing the
23

2.1. Introduction

output. The differing segments of the input sentence and the source template
are identified. Translations of these different segments in the input sentence
are produced by rule-based methods, and these translated segments are fitted
into a translation template. The resulting sentence is then smoothed over by
checking for person and number agreement, and inflection mismatches. For
example, assume the input sentence and selected template are:
SI

A very efficient lady doctor is busy.

St

A lady doctor is busy.

Tt

mahilaa chikitsak vyasta hai

The parsing process however shows that The very efficient lady doctor is a
noun phrase, and so matches it with The lady doctor - ek mahilaa chikitsak . The very efficient lady doctor is translated as ek bahut yogya mahilaa
chikitsak , by the rule-based noun phrase translation system. This is inserted
into Tt giving the following: Tt : ek bahut yogya mahilaa chikitsak vyasta hai.
3. ReVerb system (Collins, 1998) proposed the following adaptation scheme. Here
two different cases are considered: full-case adaptation and partial-case adaptation. Full-case adaptation is employed when a problem is fully covered by the
retrieved example. Here desired translation is created by substitution alone.
No addition and deletion are required for adapting TL0 for generating the translation of SL. Here TL0 and SL denote example base target language sentence
and input source language sentence, respectively. In this case five scenarios
are possible: SAME, ADAPT, IGNORE, ADAPTZERO and IGNOREZERO.
Partial-case adaptation is used when a single unifying example does not exist.
Here three more operations are required on the top of the above five. These
three operations are ADD, DELETE and DELETZERO.

24

Adaptation in English to Hindi Translation: A Systematic Approach

Figure 2.1: The five possible scenarios in the SL SL TL interface of partial


case matching

Note that there is a subtle difference between ADAPT and ADAPTZERO.


For ADAPT as well as for ADAPTZERO, both SL and SL0 have same links
but different chunks. If TL0 has words corresponding to the chunk which is
different in SL and SL0 , then the words in TL0 should be modified and this is
the case of ADAPT. One the other hand, if no corresponding chunk is present
in TL0 then it is the case of ADAPTZERO. Therefore, in that case no work is
needed for adaptation. Similar subtleties may be observed between DELETE
and DELETZERO, and also between IGNORE and IGNOREZERO. Other
operations (such as, SAME, ADD) have obvious interpretations. Figure 2.1
provides the conceptual view of partial case matching.
4. Somers (2001) proposes adaptation from case-based reasoning point of view.
The simplest of the CBR adaptation methods is null adaptation where no
changes are recommended. In a more general situation various substitution
methods (e.g. reinstatiation, parameter adjustment)/transformation methods
(e.g. commonsense transformation and model-guided repair) may be applied.
For example, suppose the input sentence (I) and the retrieved example (R)

25

2.1. Introduction

are:
I

That old woman has died.

That old man has died. wah boodhaa aadmii mar gayaa

To generate the desired translation of the word man aadmii is first replaced with the translation of woman aurat in R. This operation is called
reinstantiation. At this stage an intermediate translation wah boodhaa aurat
mar gayaa is obtained. To obtain the final translation wah boodhii aurat
mar gayii , the system must also change the adjective boodhaa to boodhii
and the word gayaa to gayii . This is called parameter adjustment.
5. The adaptation scheme proposed by McTait (2001) works in the following way.
Translation patterns that share lexical items with the input and partially cover
it are retrieved in a pattern matching procedure. From these, the patterns
whose SL side cover the SL input to the greatest extent (longest cover) are
selected. They are termed base patterns, as they provide sentential context in
the translation process. It is intuitive that the greater extent of the cover is
provided by the base patterns, the more is the context, and the lesser is the
ambiguity and complexity in the translation process. If the SL side of the base
pattern does not fully cover the SL input, any unmatched segments are bound
to the variable on the SL side of the base pattern. The translations of the SL
segments bound to the SL variables of the base pattern are retrieved from the
remaining set of translation patterns, as the text fragments and variables on
the TL side of the base pattern from translation strings.
The following is a simple example: given the source language input is I: AIDS
control programme for Ethiopia, suppose the longest covering base pattern is:
D1: AIDS control programme for (....) ke liye AIDS contral smahaaroo (...).
26

Adaptation in English to Hindi Translation: A Systematic Approach


To complete the match between I and the source language side of D1, a translation pattern containing the text fragment Ethiopia is required i.e.
D2: (...) Ethiopia (...) Ethiopia (...).
The TL translation T: ethiopia ke liye AIDS contral smahaaroo is generated
by recombing the text fragments: Ethiopia and ethiopia are aligned in D2
as are the variables in the base pattern D1. Since Ethiopia and ethiopia
are aligned on a 1:1 basis, and so are the variables in the base pattern D1, the
TL text fragment Ethiopia is bound to the variable on the TL side of D1 to
produce T.
6. In HEBMT (Jain, 1995) examples are stored in an abstracted form for determining the structural similarity between the input sentence and the example
sentences. The target language sentence is generated using the target pattern of the sentence that has lesser distance with the input sentence. The
system substitutes the corresponding translations of syntactic units identified
by a finite state machine in the target pattern. Variation in tense of verb,
and variations due to number, gender etc. are taken care at this stage for
generating the appropriate translation. This system translates from Hindi to
English, therefore, we explain its adaptation process with the example of Hindi
to English translation.
For example, suppose the input sentence is merii somavara ko jaa rahii hai
and its matches with examples sentence R: meraa dosta itavaar ko aayegaa.
Steps (a) to (f) below, show the process of translation.
(a) merii somavara ko jaa rahii hai (input sentence)

27

2.1. Introduction

(b) <snp>1 <npk2>2 <mv>3 (syntactic grouping)


(c) [Mary] [Monday] [go] (English translation of syntactic groups)
(d) <snp> <mv> {on} <npk2> (target pattern of example R)
(e) [Mary] [is going] on [Monday] (Translation after substitution)
(f) Mary is going on Monday (Final translated output)

Many other EBMT systems are found in literature, e.g. GEBMT (Brown, 1996,
1999, 2000, 2201), EDGAR (Carl and Hansen, 1999) and TTL (G
uvenir and Cicekli,
1998). But overall in our view the adaptation procedures employed in different
EBMT systems primarily consist of four operations:

Copy, where the same chunk of the retrieved translation example is used in
the generated translation;
Add, where a new chunk is added in the retrieved translation example;
Delete, when some chunk of the retrieved example is deleted; and
Replace, where some chunk of the retrieved example is replaced with a new
one to meet the requirements of the current input.

The operations prescribed in different systems vary in the chunks they deal with.
Depending upon the case it may be a phrase, a word or a sub-word (e.g. declensional
suffix).
1

snp : noun, adj+noun, noun+ kaa+noun


npk2: noun+ko
3
mv: verb-part
2

28

Adaptation in English to Hindi Translation: A Systematic Approach


With respect to English and Hindi, we find that both the languages depend
heavily on suffixes for verb morphology, changing numbers from singular to plural and vice versa, case endings, etc. Appendix A provides detail descriptions
of various Hindi suffixes. Keeping the above in view we differentiated the adaptation operations in two groups: word based and suffix based. The word based
operations are further subdivided into two categories: constituent word based and
morpho-word based. Thus the adaptation scheme proposed here consists of ten operations: Copy (CP), Constituent word deletion (WD), Constituent word addition
(WA), Constituent word replacement(WR), Morpho-word deletion (MD), Morphoword addition (MA), Morpho-word replacement(MR), Suffix addition (SA), Suffix
deletion (SD) and Suffix replacement (SR). Section 2.2 illustrates the roles of the
these operations in adapting a retrieved translation example.
The advantage of the above classification of adaptation operations is twofold.
Firstly, it helps in identifying the specific task that has to be carried out in the stepby-step adaptation for a given input. Secondly, it helps in measuring the average
cost of each of the above operations in a meaningful way, which in turn helps in
estimating the total adaptation cost for a given sentence. This estimate can be used
as a tool for similarity measurement between an input and the stored examples.
These issues are discussed in Chapter 5.

2.2

Description of the Adaptation Operations

The ten adaptation operations mentioned above are described below.


1. Constituent Word Replacement (WR): One may get the translation of the
input sentence by replacing some words in the retrieved translation example.
29

2.2. Description of the Adaptation Operations

Suppose the input sentence is: The squirrel was eating groundnuts., and the
most similar example retrieved by the system (along with its Hindi translation)
is: The elephant was eating fruits. haathii phal khaa rahaa thaa. The
desired translation may be generated by replacing haathii with the Hindi of
squirrel, i.e. gilharii and replacing phal with the Hindi of groundnuts,
i.e. moongphalii . These are examples of the operation of constituent word
replacement.
2. Constituent Word Deletion (WD): In some cases one may have to delete some
words from the translation example to generate the required translation. For
example, suppose the input sentence is: Animals were dying of thirst. If the
retrieved translation example is : Birds and Animals were dying of thirst.
pakshii aur pashu pyaas se mar rahe the, then the desired translation can
be obtained by deleting pakshii aur (i.e the Hindi of birds and) from the
retrieved translation. Thus the adaptation here requires two constituent word
deletions.
3. Constituent Word Addition (WA): This operation is the opposite of constituent
word deletion. Here addition of some additional words in the retrieved translation example is required for generating the translation. For illustration, one
may consider the example given above with the roles of input and retrieved
sentences being reversed.
4. Morpho-word Replacement (MR): In this case one morpho-word is replaced by
another morpho-word in the retrieved translation example. Consider a case
when the input sentence is: The squirrel was eating groundnuts., and the
retrieved example is: The squirrel is eating groundnuts. gilharii moongfalii

30

Adaptation in English to Hindi Translation: A Systematic Approach


khaa rahii hai . In order to take care of the variation in tense the morphoword hai is to be replaced with thaa. This is an example of Morpho-word
replacement.
5. Morpho-word Deletion (MD): Here some morpho-word(s) are deleted from the
retrieved translation example. For illustration, if the input sentence is He
eats rice., and the retrieved example is: He is eating rice. wah chaawal
khaa rahaa hai , then to obtain the desired translation4 first the morpho-word
rahaa is to be deleted from the retrieved translation example.
6. Morpho-word Addition (MA): This is the opposite case of morpho-word deletion. Here some morpho-words need to be added in the retrieved example in
order to generate the required translation.
7. Suffix Replacement (SR): Here the suffix attached to some constituent word
of the retrieved sentence is replaced with a different suffix to meet the current
translation requirements. This may happen with respect to noun, adjective
verb, or case ending . For illustration,
(a) To change the number of nouns
Boy (ladkaa) Boys (ladke)
The suffix aa is replaced with e so in order to get its plural form in
Hindi.
(b) Change of Adjectives
Bad boy (buraa ladkaa) Bad girl (burii ladkii )
The suffix aa is replaced with ii to get the adjective burii .
4

Of course the final translation will be obtained by adding the the suffix taa with the word
khaa.

31

2.2. Description of the Adaptation Operations

(c) Morphological changes in verb


He reads. (wah padtaa hai ) She reads. (wah padtii hai )
The suffix taa is replaced with tii to get the verb padtii , which is
required to indicate that the subject is feminine.
(d) Morphological changes due to case ending
boy (ladkaa) from boy (ladke se)
room (kamraa) in room (karmre mein)
The suffix aa is replaced with e to get the nouns ladke and kamre.
8. Suffix Deletion (SD): By this operation the suffix attached to some constituent
word may be removed, and thereby the root word may be obtained. This
operation is illustrated in the following examples:
(a) To change the number of nouns
women (aauraten) woman (aaurat),
The suffix en is deleted from aauraten to get the Hindi translation
of woman.
(b) Morphological changes in verb
He reads. (wah padtaa hai ) He is reading. (wah pad rahaa hai )
The suffix taa is deleted from padtaa to get the root form pad of
the English verb read.
(c) Morphological changes due to case ending
in the houses (gharon mein) houses (ghar )
in words (shabdon mein) words (shabd )
The suffix on is deleted from gharon and shabdon to get the Hindi
translation of nouns houses and words, respectively.
32

Adaptation in English to Hindi Translation: A Systematic Approach


9. Suffix Addition (SA): Here a suffix is added to some constituent word in the
retrieved example. Note that here the word concerned is in its root form in
the retrieved example. One may consider the examples given above with the
roles of input and retrieved sentences reversed as suitable examples for suffix
addition operation.
10. Copy (CP): When some word (with or without suffix) of the retrieved example
is retained in toto in the required translation then it is called a copy operation.
Figure 2.2 provides an example of adaptation using the above operations. In this
example the input sentence is He plays football daily., and the retrieved translation
example is:
They are playing football.

we

f ootball

khel

rahe hain

(They) (football) (play) (...ing)

(are)

The translation to be generated is : wah roz football kheltaa hai . When carried
out adaptation using both word and suffix operations the adaptation steps look as
given in Figure 2.2. In this respect one may note that Hindi is a free order language,
and consequently the position of adverb is not fixed. Hence the above input sentence
may have different Hindi translations:
wah roz football kheltaa hai
wah football roz kheltaa hai
roz wah football kheltaa hai
While implementing an EBMT system one has to stick to some specific format.
The adverb will be added according to the format adapted by the system.
33

2.2. Description of the Adaptation Operations

Input
Operations
Output

we

WR

wah

WA

roz

football

CP

football

khel

SA

kheltaa

rahe

MD

hain

MR

hai

Figure 2.2: Example of Different Adaptation Operations


Which adaptation operations will be required to translate a given input sentence
depends upon the translation example retrieved from the example base. A variety
of examples may be adapted to generate the desired translation, but obviously with
varying computational costs. For efficient performance, an EBMT system, therefore,
needs to retrieve an example that can be adapted to the desired translation with
least cost. This brings in the notion of similarity among sentences. The proposed
adaptation procedure has the advantage that it provides a systematic way of evaluating the overall adaptation cost. This estimated cost may then be used as a good
measure of similarity for appropriate retrieval from the example base. How cost of
adaptation may be used as a yardstick to measure similarity between sentences will
be described in Chapter 5.
Here our aim is to count the number of adaptation operations required in adapting a retrieved example to generate the translation of a given input. Obviously, depending upon the situation one has to apply some adaptation operations for changing
different functional slot 5 (Singh, 2003), such as, subject(<S>), object (<O>), verb
(<V>). Also certain operations are required for changing the kind of sentence, e.g.
5

The following example illustrates the difference between functional slots and functional tags.
Consider the sentence The old man is weak.. The subject of this sentence is the noun phrase The
old man. It consists of three functional tags, viz. @DN>, @AN> and @SUBJ, stating that the
is a determiner, old is adjective, and man is the subject. But, as mentioned above, the entire
noun phrase plays the role of subject of the sentence. Thus the functional slot for this phrase is
<S>, i.e. subject slot. Note that a particular functional slot may have variable number of words.
The sequence of functional slots in a sentence provide the sentence pattern. The difference between
various tags (e.g. POS tag, functional tag) is explained in detail in Appendix B.

34

Adaptation in English to Hindi Translation: A Systematic Approach


affirmative to negative, negative to interrogative etc. Table 2.2 contains the notations for roles of different functional slot and operators, which are required for the
subsequent discussion.
Operators

Role of operators

<>
&

For functional slot and part of speech and its transformation. E.g. <S>, <V> etc.
Both functional slots or past of speech and its transforma-

or

tion should be present.


Either first slot/tag, second slot/tag or both.

{}

For non-obligatory functional tag/slot or for optional adaptation operation.


For the property of functional slot/tag.

[]
Functional
Slot

Role of Functional slot

<LV>

Linking verbs in English are: are , am , was, were, become,

<V>
<AuxV>

seem etc., and in Hindi are: hai, hain, ho, thaa, the etc.
Auxiliary verb (if any) and main verb of the sentence
Auxiliary verb

<MainV>
<S>
<O>

Main verb
Subject
Object

<O1>
<O2>

First object
Second object

<SC>
<PCP1 form>
<PCP2 form>

Subjective complement
-ing verb form other than the main verb.
-ed or -en verb forms other than the main verb.

<to-infinitive>
<Adverb>

to-infinitive form of verb.


Adverb

<AdjP>
<PP>
<preposition>

Adjective phrase
preposition phrase
preposition

Table 2.2: Notations Used in Sentence Patterns


35

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

The following sections describe how many such operations are required in different cases. In particular we consider the following functional slot and sentence
kinds:
1. Tense and Form of the Verb. Since there are three tenses (viz. Present,
Past and Future) and four forms (Indefinite, Continuous, Perfect, and Perfect
Continuous), in all one can have 12 different structures of verb and passive
form verb structure also.
2. Subject/Object functional slot. Variations in subject/object functional slot
may happen in many different ways, such as, Proper Noun, Common Noun
(Singular or Plural), Pronoun, PCP1 form6 and PCP2 form7 . Study of variation in pre-modifier adjectives, genitive case, quantifier and determiner tags.
3. Study of wh-family interrogative sentences.
4. Kind of sentence. Whether the sentence is affirmative, negative, interrogative
and negative interrogative.
Systematic study of these patterns, and their components helps in estimating
the adaptation costs between them.

2.3

Study of Adaptation Procedure for Morphological Variation of Active Verbs

Hindi verb morphological variations depend on four aspects: gender, number and
person of subject and tense (and form) of the sentence. All these variations effect
6
7

-ing verb form other than the main verb


-ed or -en verb forms other than the main verb

36

Adaptation in English to Hindi Translation: A Systematic Approach


the adaptation procedure. In Hindi, these conjugations are realized by using suffixes
attached to the root verbs, and/or by adding some auxiliary verbs (see Table A.3 of
Appendix A). Since there are 12 different structures (depending upon the tense and
form), the adaptation scheme should have the capabilities to adapt any one of them
for any of the input type. Hence altogether 1212, i.e. 144 different combinations are
possible. However, Table A.3 (Appendix A) shows that in Hindi, perfect continuous
form as that of any tense has the same verb structure of continuous form in the same
tense. Therefore we exclude perfect continuous form from our discussion. Thus our
work concentrate on 99, i.e. 81 possibilities.
These 99 possible combinations of verb morphology variations are divided into
four groups. These are:
1. Same tense same verb form
2. Different tenses same verb form
3. Same tense different verb forms
4. Different tenses different verb forms
In the following subsections we explain in detail how adaptation is carried out in
those four groups. One may note that in both English and Hindi verb morphological
variations depend not only on the tense and form of the verb, but also on the gender,
number and person of the subject of the sentence. However, since Hindi grammar
does not support neutral gender, every noun is considered as masculine or feminine.
Therefore, adaptation rules have been developed keeping in view the above. In
general, these rules have been represented in the form of tables where the column and
row headers specify the nature of the subject of the input sentence and the retrieved
37

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

example, respectively. The row and column headers are of the form gender, person
and number of the subject where gender can be one of M or F, person can be one of
1, 2 or 3 specifying whether first, second and third person, and the number is either
S or P suggesting a singular or plural. Note that, here the gender of the English
sentence subject is assigned according to the Hindi grammar rule. The content of
the (i, j)th cell suggests the adaptation operations need to be carried out when the
subject of the input sentence matches the specification of the j th column header,
and that of the retrieved example matches the specification of the ith row header.

2.3.1

Same Tense Same Verb Form

Here the input sentence and the retrieved example both have the same tense and
form. Yet, verb morphological variations may occur in the translation depending
upon variations in the number, gender and person of the subject.
For illustration, we consider the case when both the input and the retrieved sentences have the main verb in present indefinite form. Table 2.3 lists the adaptation
operations involved for verb morphological variations. In general, in this situation
the verb adaptation requires at most one suffix replacement and one morpho-word
replacement. Suffix replacement will confine to the set {taa, te, tii } (call it S1 ),
while the morpho-word replacement is associated with the set {hain, hai, ho, hoon}
(call it M1 ) (Refer Table A.3). Note that if the person, the number and the gender of
a subject in both input and retrieved sentences are same then only copy operations
will be performed.
We illustrate with an example how Table 2.3 is to be used for adaptation of verb
morphological variations. Suppose the input sentence is She eats rice., and

38

Input

M1S

F1S

M1P

F1P

M1S

CP

SR

SR+MR

F1S

SR

CP

M1P

M2S

F2S

M3S

M3P

F3S

F3P

SR+MR SR+MR

SR+MR

MR

SR+MR

SR+MR

SR+MR

SR+MR

MR

SR+MR

MR

SR+MR SR+MR

MR

MR

SR+MR SR+MR

CP

SR

MR

SR+MR

SR+MR CP

SR+MR

SR

F1P

SR+MR MR

SR

CP

MR+SR

MR

SR+MR SR

MR

CP

M2S

SR+MR SR+MR

MR

SR+MR CP

SR

SR+MR MR

SR+MR

SR+MR

F2S

SR+MR MR

SR+MR

MR

CP

SR+MR SR+MR

MR

MR

M3S

MR

SR+MR

SR+MR SR+MR

SR+MR

CP

SR

SR+MR

M3P

SR+MR SR+MR

CP

SR

MR

SR+MR

SR+MR CP

SR+MR

SR

F3S

SR+MR MR

SR+MR

MR

SR+MR

MR

SR

CP

MR

F3P

SR+MR MR

SR

CP

SR+MR

MR

SR+MR SR

MR

CP

Retd

Table 2.3: Adaptation Operations of Verb Morphological


Variations in Present Indefinite to Present Indefinite

SR+MR

SR+MR

Adaptation in English to Hindi Translation: A Systematic Approach

39

SR+MR

SR

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

the retrieved example is We eat rice. ham chaawal khaate hain. In the input
sentence the subject is 3rd person, feminine and singular, whereas in the retrieved
sentence the subject is 3rd person, masculine and plural.
The cell (3, 9), i.e. corresponding to (M1P, F3S) suggests that two adaptation operations are required: suffix replacement (SR) and morpho-word replacement
(MR). The suffix te is replaced with taa in the main verb khaate as a suffix
replacement operation, the morpho-word hain is replaced with hai in the retrieved Hindi sentence to get the Hindi translation of the input sentence. Although,
there is a need to replace the subject ham with wah, to get the appropriate
Hindi translation of the input sentence but this is not considered in the discussion
on verbs. The translation of the input sentence is: wah chaawal khaatii hai .
Under this group nine combinations are possible, taking into an account three
tenses and three forms for each of them. Adaptation rule tables similar to the other
eight possibilities have been developed in a similar way. Salient features of these
verb morpho variations are discussed below:

1. Past indefinite to past indefinite: Here the verb morpho variation is doing in
a way similar to the present indefinite case discussed just above. However, for
morpho-word replacement, the set to be considered is {thaa, the, thii } (call it
M2 ) instead of the set M1 .
2. Future indefinite to future indefinite: In this case either a copy operation or
a suffix replacement operation is used to handle the verb morphological variations. Accordingly, from Table 2.3 all the morpho-word replacement operations
have to be removed in order to handle the future indefinite case. Further, it
has to be taken into account that for the suffix replacement (SR) operations
40

Adaptation in English to Hindi Translation: A Systematic Approach


the set {oongaa, oongii, oge, ogii, egaa, egii, enge, engii } (call it S2 ) is to be
considered (See Table A.3 of Appendix A) instead of the set S1 , i.e. {taa, te,
tii } used for present indefinite.
3. Present continuous to present continuous: In this case either a copy operation or one/two morpho-word replacements are required to deal with the verb
morphological variations depending upon the variations in the gender, number
and person of the subjects concerned (see in Section A.2 of Appendix A). Thus
the rule table for handling this case may be obtained by modifying Table 2.3
in the following way. Each suffix replacement operation is to be replaced with
an additional morpho-word replacement to take care of number, gender and
person of the subject. This new morpho-word replacement operation will be
restricted to the set {rahaa, rahii, rahe} (call it M3 ).
4. Past continuous to past continuous: Here the verb morpho variation is done
in a way similar to the present continuous case discussed above. Hence, here
too, one may have two morpho-word replacements. For one of them the set
M2 , i.e. {thaa, thii, the} is to be considered instead of the set M1 , i.e. {hai,
hain, ho, hoon}. The set required for other morpho-word replacement is M3 ,
i.e. {rahaa, rahii, rahe}.
5. Future continuous to future continuous: In this case too either a copy operation or one/two morpho-word replacements are required. If morpho-word
replacement operations are carried out then the relevant sets are as follows:
first set is M3 as discussed above. The other morpho-word replacement will
take care of the sense of the future tense, and therefore, instead of the set M1
the set {hoongaa, hoongii, honge, hogaa, hogii, hoge} (call it M4 ) has to be

41

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

used in Table 2.3.


6. Present perfect, Past perfect, Future perfect: If the input and the retrieved
example both have any one of these three then the verb morphology and adaptation operations imitate the rules of continuous form of the respective tense.
The only relevant change is that instead of the set M3 in all the three cases
the set {chukaa, chukii, chuke} (call it M5 ) is to be considered.

The morpho-words and suffixes for adaptation operations for all above discussed
cases can be referred from Table A.3.
In case of present perfect, past indefinite and past perfect sometimes there is a
case ending ne with the subject (see Section A.1). In that case, the verb morphology variation will change according to the gender and number of the object,
instead of the gender, number and person of the subject. For past indefinite to past
indefinite transformation, the adaptation operation will either be copy operation or
suffix replacement, whereas in the other two cases the adaptation operations can
be either copy operation or suffix replacement and morpho-word replacement. All
possible suffix variations and morpho-word variations are listed in Section A.2.

2.3.2

Different Tenses Same Verb Form

In this group the verb morphological variation depends on gender, number and person of the subject, and also on the variation in the tenses of the input and the
retrieved example. This group comprises of eighteen combinations of verb morphology variations. These 18 possibilities occur due to three different tenses (present,
past, future), and three verb forms (indefinite, continuous, perfect). Some members

42

Adaptation in English to Hindi Translation: A Systematic Approach


of this group are present indefinite to past indefinite, present indefinite to future
indefinite, present continuous to past continuous, etc.
For illustration, we consider the following examples where input sentence is in
present indefinite and the retrieved example is in past indefinite.

Example 1 : Suppose the input sentence is She drinks water., and the retrieved sentence is She drank water. with the Hindi translation as wah paanii piitii thii . The
subjects of both the input and the retrieved example are feminine, 3rd person and
singular. In this situation, only one adaptation operation is required, i.e. morphoword replacement. The morpho-word thii is to be replaced with the morpho-word
hai to convey the sense of present indefinite form of the input sentence. Therefore,
the desired translation is wah paanii piitii hai .

Example 2 : Here the input sentence is She reads books, and the retrieved sentence
is He read books. with the Hindi translation as wah kitaabe padhtaa thaa. The
subject of the input is feminine, 3rd person and singular whereas in the retrieved
sentence the subject is masculine, 3rd person and singular. In this situation two
adaptation operations are required:

1. One suffix replacement: the suffix taa is to be replaced with tii ; and
2. One morpho-word replacement: the morpho-word thaa is replaced with
hai .

Hence the appropriate translation of the input sentence in Hindi is generated as


wah kitaabe padhtii hai .

43

M1S

F1S

M1P

F1P

M1S

MR

SR+MR

SR+MR

F1S

SR+MR MR

M1P

M2S

F2S

M3S

M3P

F3S

F3P

SR+MR SR+MR

SR+MR

MR

SR+MR

SR+MR

SR+MR

SR+MR

MR

MR

SR+MR MR

SR+MR

MR

SR+MR SR+MR

MR

SR+MR MR

SR+MR

SR+MR SR+MR

MR

SR+MR

F1P

SR+MR MR

SR+MR

MR

MR

SR+MR MR

SR+MR

MR

M2S

SR+MR SR+MR

MR

SR+MR MR

SR+MR

SR+MR SR+MR

MR

SR+MR

F2S

SR+MR MR

SR+MR

MR

MR

SR+MR MR

SR+MR

MR

M3S

MR

SR+MR

SR+MR SR+MR

SR+MR

MR

SR+MR

SR+MR

F3S

SR+MR MR

SR+MR

MR

MR

SR+MR MR

SR+MR

MR

M3P

SR+MR SR+MR

MR

SR+MR MR

SR+MR

SR+MR SR+MR

MR

SR+MR

F3P

SR+MR MR

SR+MR

MR

MR

SR+MR MR

SR+MR

MR

Retd

44

SR+MR

SR+MR

SR+MR

SR+MR

SR+MR

SR+MR

Table 2.4: Adaptation Operations of Verb Morphological


Variations in Present Indefinite to Past Indefinite

SR+MR

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

Input

Adaptation in English to Hindi Translation: A Systematic Approach


The above two examples summarize the relevant adaptation operations to deal
with verb morphology variations while adapting from past indefinite sentence to
present indefinite sentence.

Adaptation may be carried out by either one MR

(morpho-word replacement) operation, or one SR (suffix replacement ) and one MR


(morpho-word replacement) operations. Typically, one morpho-word from the set
{thaa, the, thii } is replaced with one from the set {hain, hai, ho, hoon} under the
MR operations. Under suffix replacement, when necessary, one element of the set
{taa, tee, tii } is replaced with another from the same set.
Table 2.4 provides all possible adaptation operations which occur due to the
variation in the gender, number and person of the subject.
Some important points regarding the adaptation rules for other remaining 17
combinations of verb morphological variation are discussed below.
1. If the input sentence is in future indefinite form, and the retrieved example is
in either past indefinite or present indefinite then a single set of adaptation
operations will work for dealing with verb morphological variations due to
variation in gender, number and person of the subject.
The adaptation operations for this set are suffix replacement and morpho-word
deletion.
In suffix replacement the suffix {taa, tee, tii } is replaced by {oongaa,
oongii, oge, ogii, egaa, egii, enge, engii }.
Also since in future indefinite case in Hindi no additional morpho-word is
required, the additional morpho-word that comes with present indefinite
(i.e. {hoon, hai, ho, hain}) or that occurs with past indefinite (i.e. {thaa,
the, thii } has to be deleted.
45

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

As an illustration, suppose the input sentence is I will eat rice., and the
retrieved sentence is I eat rice. with Hindi translation main chaawal khaataa
hoon. Here the suffix taa is replaced with suffix oongaa, and the morphoword hoon is deleted from the retrieved Hindi translation. Therefore, the
Hindi translation of the input sentence is main chaawal khaaoongaa.
2. Similarly, to adapt from future indefinite to present or past indefinite, one
suffix replacement and one morpho-word addition have to be carried out. The
suffix replacement will just be opposite to the one discussed above. In the
morpho-word addition is also to be done in the same spirit. It will be clear
by the example which has been discussed above, if the roles of the input and
retrieved example will be reversed.
3. If the verb form is continuous or perfect then regardless of the tense of the
sentence, the same Table 2.4 will work. This verb form and tenses consideration
may occur in the input sentence or in the retrieved sentence. The only change
to be incorporated is that a morpho-word replacement (MR) has to be carried
out instead of suffix replacement (SR) operation in the adaptation rule Table
2.4.
For these tenses and verb forms, the suffixes for the suffix replacement and the
morpho-words for the morpho-word replacement can be referred from Table
A.3.

2.3.3

Same Tense Different Verb Forms

Here the input sentence and the retrieved example have the same tense but they
have different verb forms. Here too eighteen combinations of verb morphological
46

Adaptation in English to Hindi Translation: A Systematic Approach


variation are possible. For example: present indefinite to present continuous, past
indefinite to past perfect, future perfect to future indefinite, etc. Different cases are
discussed below.
Suppose the verb of the input sentence is in future indefinite form, and the verb in
the retrieved example is in future continuous form. Three adaptation operations are
required to take care of all the possible variations in gender, number and person of the
subject. These operations are one suffix addition and two morpho-word deletions.
The suffix addition one item from the suffix set {oongaa, oge, oongii, ogge, egi,
enge, egaa} is to be added in the root form of the main verb of the retrieved Hindi
translation. Note that, since the retrieved example is in future continuous form,
the main verb will be in its root form only, and, therefore, no suffix deletion or
replacement is required. The two morpho-word deletions will be restricted to the sets
{rahaa, rahii, rahe} and {hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
The following example illustrates this adaptation procedure:
Let the input sentence be She will eat rice., and the retrieved example is She
will be eating rice. wah chaawal khaa rahii hogii . In this case, a suffix will
be added to khaa that is oongii , and the last two words of the retrieved Hindi
sentence will be deleted that is rahii and hogii . The addition of suffix oongii
takes place because the subject of the input sentence is feminine, 3rd person and
singular.
If the retrieved example is in future perfect form, instead of future continuous
form as in above, the adaptation operations will remain same. The only modification
is that one of the morpho-word deletion will be from the set {chukaa, chukii, chuke}
instead of the set {rahaa, rahii, rahe}.
For illustration, suppose for the same input as given above, the retrieved example
47

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

is She will have eaten rice. wah chaawal khaa chukii hogii . In order to adapt
verb morphology the morpho-wordsf chukii and hogii are to be deleted, and the
suffix oongii is to be added to the verb khaa. And thus one gets the required
verb morphology.
If the roles of the input and the retrieved sentence are reversed in the above
cases, then, in place of suffix addition, suffix deletion has to be carried out. Further,
the two morpho-word deletions are to be replaced with corresponding morpho-word
additions.
Adaptation rules for dealing with other verb morphological variations belonging
to this group have been developed in a similar way. One may refer to Section A.2
of Appendix A to figure out the appropriate suffixes and morpho-words that will be
involved in the necessary addition/deletion/replacement operation.

2.3.4

Different Tenses Different Verb Forms

The remaining thirty six possibilities out of the total eighty one combinations of verb
morphological variations belong to this group. Since it is not possible to discuss all
of them in this report, some typical once are considered for the present discussion.
In particular, we discuss the case where the input sentence is in present indefinite
form. For illustration, we consider the retrieved examples of the following types: (i)
past continuous (ii) past perfect (iii) future continuous (iv) future perfect. It will
be shown that a single set of adaptation operations is sufficient for all the four cases
mentioned above. These adaptation operations are one suffix addition (SA), one
morpho-word replacement (MR) and one morpho-word deletion (MD). The purpose
of these three operations are as follows:

48

Adaptation in English to Hindi Translation: A Systematic Approach


For the present indefinite tense the relevant suffix for the main verb is one of
{taa, tii, te} depending upon the gender, person and number of the subject.
However, if the retrieved sentence is one of the four types mentioned above
then the root verb in Hindi translation is in its root form. Consequently, the
suffix addition is mandatory.
In the present indefinite form, one of the following morpho-words { hoon, hai,
ho, hain} has to be used depending upon the number, gender and person of
the subject. However, if the retrieved example is in past tense (irrespective of
continuous or perfect verb form), then the relevant morpho-word set is {thaa,
thii, the}. On the other hand, if the retrieved sentence is in future tense,
whether continuous or perfect verb form, then the relevant morpho-word set is
{hoongaa, honge, hogii, hoge, hogaa, hongii }. The morpho-word replacement
is required to have the right morpho-word in the generated translation.
The morpho-word deletion operation is required to take care of indefinite verb
form of the input sentence. In this case no other morpho-word is necessary.
However, in order to indicate continuous form of the verb (irrespective of past
or future tense) one of the following morpho-words {rahaa, rahii, rahe} is required. Similarly, for perfect verb form an additional morpho-word is required
from the set {chukaa, chuke, chukii }. For adaptation of verb morphology this
addition morpho-word has to be deleted from the retrieved example.

For illustration, suppose the input sentence is She eats rice., and the retrieved
example is one of the following:

49

2.3. Study of Adaptation Procedure for Morphological Variation of Active Verbs

(A)

She was eating rice.

wah chaawal khaa rahii thii

(B)

She had eaten rice.

wah chaawal khaa chukii thii

(C)

She will be eating rice.

wah chaawal khaa rahii hogii

wah chaawal khaa chukii hogii

(D) She will have eaten rice.

Evidently, the sentences (A), (B), (C) and (D) are in past continuous, past
perfect, future continuous and future perfect form, respectively. The modification
in the retrieved Hindi translations are as follows:
In case of translation of all retrieved examples the suffix tii will be added in
khaa (Hindi of eat).
The morpho-word rahii or chukii (depending upon the case) will be deleted.
The morpho-word thii will be replaced with hai if the retrieved example
is either (A) or (B).
The morpho-word hogii is replaced with hai in Hindi translation of example (C) or (D).
Therefore, the required translation of the input sentence by incorporating all
these modification in the respective translation of retrieved examples is wah chaawal
khaatii hai .
In a similar way one can identify that the set of three adaptation operations
will be required if the input is in past indefinite form, and the retrieved is one of
present continuous, present perfect, future continuous or future perfect. However,
in order to carry out the morpho-word replacement one has to confine the selection
to the set {thaa, the, thii }. It will replace the relevant morpho-word, which is one
50

Adaptation in English to Hindi Translation: A Systematic Approach


of {hoon, hai, ho, hain} for present tense, or one of {hoongaa, honge, hogii, hoge,
hogaa, hongii } for future tense, of the retrieved Hindi example. The suffix addition
and morpho-word deletion operations are restricted to the same set as mentioned
above.
Similarly, one can identify adaptation operations when the roles of the input
and the retrieved sentence are reversed in the cases discussed above. Evidently, in
these cases one suffix deletion, one morpho-word replacement and one morpho-word
addition will be required for adapting the verb morphology variations. One can
easily figure out the relevant sets of morpho-words and suffixes keeping in view the
above discussions.
The above discussion takes care of sixteen possible combinations of different
verb morphological variations. Other verb morphological variations of this group
have been identified. However, due to the stereotype nature of discussion, we do not
present all other different cases in this report. One may identify the relevant suffixes
and the morpho-words by referring to Section A.2 of Appendix A.

2.4

Adaptation Procedure for Morphological Variation of Passive Verbs

The above discussion of adaptation procedures for verb morphological variation has
been limited to the active form of verb. Similar adaptation procedures have also
been studied when the verb is in the passive form. Ideally, the passive form should
exist for all the three tenses and all the four verb forms. However, the passive forms
of verbs for present perfect continuous, past perfect continuous, future continuous,
51

2.4. Adaptation Procedure for Morphological Variation of Passive Verbs

and future perfect continuous tenses are cumbersome, and are rarely used (Ansell,
2000). We, therefore, restrict our discussion to the other eight more commonly used
forms of the passive voice only. Since adaptation may take place from an active voice
sentence to a passive one, and vice versa, we classified these adaptation procedures
into three broad groups:
1. Passive verb form to passive verb form (88, i.e. 64 cases)
2. Passive verb form to active verb form (89, i.e. 72 cases)
3. Active verb form to passive verb form (98, i.e. 72 cases)
For each of the above mentioned three groups, we discuss few cases in detail:
Passive verb form to passive verb form
If the input sentence is in past indefinite passive verb form, and the retrieved
example is in present continuous passive verb form or past continuous passive verb
form, then a single set of adaptation operations is sufficient. These adaptation
operations are one morpho-word replacement, two morpho-word deletions and one
suffix replacement. The suffix replacement depends upon the particular Hindi verb
under consideration, and also upon the gender and number of the subject. So this
operation is required in some examples of such cases. However, the other three
adaptation operations are mandatory. The purpose of these four operations are as
follows:
In the past indefinite passive verb form, one of the following morpho-words
{gayaa, gayii, gaye} has to be used depending upon the number and gender
of the subject. However, if the retrieved example is in continuous passive form
52

Adaptation in English to Hindi Translation: A Systematic Approach


(irrespective of present or past tense), then the relevant morpho-word is jaa.
This necessitates the operation of morpho-word replacement.
The morpho-word deletion operations are required to take care of indefinite
passive verb form of the input sentence. In this case no other morpho-word
is necessary. However, in order to indicate continuous form of the verb (irrespective of past or present tense) one of the following morpho-words set
{rahaa, rahii, rahe} is required, and also in order to indicate present and past
tenses one more morpho-word is required. This morpho-word comes from the
set {hain, ho, hoon, hai } if the retrieved sentence is in present tense; or it
comes from the set {thaa, the, thii } in case of past tense. For adaptation of
verb morphology these two morpho-words are to be deleted from the retrieved
example.
In case of optional suffix replacement, the appropriate suffix has to be decided
according to the rules of PCP form of verb given in Appendix A of Section
A.2.
Consider the following examples.
Example 1 : The input sentence is in past indefinite passive verb form, and the
retrieved example is in present continuous passive verb form.
Input sentence

: The nut was eaten by the squirrel.

Retrieved example

: The nuts are being eaten by the squirrel.


gilharii ke dwaaraa moongphaliyaan khaayii
jaa rahii hain

The Hindi translation of the input sentence is gilharii ke dwaaraa moongphalii


khaayii gayii . Evidently, to generate this translation the morpho-words rahii
53

2.4. Adaptation Procedure for Morphological Variation of Passive Verbs

and hain are to be deleted, the morpho-word jaa is to be replaced with the
morpho-word gayii . Note that there is no change to the main verb khaayii .
Example 2 : Here we consider the same input sentence but the retrieved example is
The apple was begin eaten by the squirrel. gilharii ke dwaaraa seb khaayaa jaa
rahaa thaa. Evidently, to generate the required translation gilharii ke dwaaraa
moongphalii khaayii gayii all the operations given in Example 1 are to be carried
out. Further, due to the change in gender of the subject8 , i.e. the suffix yaa is
replaced with the suffix yii . To generate the final translation of the input sentence
one more adaptation operation is needed, the constituent word seb is replaced with
moongphalii , but that is not a part of the set of adaptation operations mentioned
above.
Passive verb form to active verb form
Here too we illustrate the verb morphology adaptation with the help of a specific
case: the input sentence is in the present indefinite passive verb form, and the
retrieved sentence is in active verb form in the same tense and form. Here one can
identify that one suffix replacement, one morpho-word addition and one morphoword replacement (depending upon the situation) are required to carry out the verb
morphology adaptation task. Significance of these three operations are as follows:

The suffix {taa, te, tii} in the main verb of the active retrieved sentence is
replaced with an appropriate suffix according to the rules of PCP form of verb
given in Section A.2 of Appendix A.
A morpho-word from the set {jaataa, jaatii, jate}, whose elements are essentially declensions of the verb jaa, is to be added after the main verb.
8

seb (apple) is masculine but moongphalii (nut) is feminine

54

Adaptation in English to Hindi Translation: A Systematic Approach


The appropriate morpho-word is dependent on the gender and number of the
subject.
Since the retrieved example is in present tense, it must contain one of the
morph-word {hai, hain, ho, hoon}. Again since the input sentence is also
in present tense its Hindi translation will also have one morpho-word from
the same set. Hence, depending upon the gender, number and person of the
respective subjects, the same morpho-word may be retained, or it may have
to be replaced with another morpho-word from the same set.
The above is explained with the help of the following example.
Input sentence

: This food is cooked by Sita.

Retrieved example

: Sita cooks this food.


sitaa yah khaanaa banaatii hai

The Hindi translation of the input sentence is yah khaanaa sitaa ke dwaaraa
banaayaa jaataa hai . Evidently, to deal with the verb morphology in the generated
translation, two adaptation operations have to be performed. The suffix tii of the
main verb of the retrieved sentence is to be replaced with yaa, and the morphoword jaataa is to be added after the main verb.
As the input sentence is the passive form of the retrieved sentence, ke dwaaraa
is added before the subject siitaa. This is necessary to generate the appropriate
translation of the input sentence but is not a part of the set of adaptation operations
mentioned above.
Active verb form to passive verb form
If the roles of the above mentioned input and the retrieved example are reversed,
one suffix replacement, one morpho-word replacement and one morpho-word deletion
55

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

will be required for adapting the verb morphology. One can easily figure out the
relevant sets of morpho-words and suffixes keeping in view the above discussions.
The adaptation rules for all other possible variations mentioned earlier have been
formulated in a similar way. However, the similar nature of discussions prohibits us
to describe all of them in detail.

2.5

Study of Adaptation Procedures for Subject/


Object Functional Slot

Subject (<S>) and Object (<O>) functional slots can be sub-divided into a number
of functional tags. These tags act as pre-modifier and post-modifier of the subject
(@SUBJ) and/or object (@OBJ) functional tag. The maximum possible structure
of the <S> or <O> functional slot using different tags is:

Functional

Functional Tag Patterns

Slot
<S> or <O>:

{@DN> or @GN> or @QN> or @AN> } & (@SUBJ or @OBJ)


& {@<NOM-OF & {@DN> or @GN> or @QN> or @AN> }
& @<P }

<S> or <O>:

{@DN> or @GN> or @QN> or @AN> } & (@SUBJ or @OBJ)


& {@<NOM & {@DN> or @GN> or @QN> or @AN> } &
@<P}

Table 2.5: Different Functional Tags Under the Functional Slot <S> or <O>

56

Adaptation in English to Hindi Translation: A Systematic Approach


Table 2.5 lists only those structures which are present in our example base, and
are studied in the course of present research work. Here, {} is used for showing nonobligatory (see Table 2.2) functional tags/slots. The definitions of the functional tags
are given in detail in Appendix B. The part of speech and its transformation under
the morpho tags for <S> or <O> functional slots are noun(N), pronoun(PRON),
proper noun (<Proper>), adjective(A) with transformations ABS, PCP1 (-ing
participle form), and PCP2 (-ed participle form), adverb(ADV) and gerund(PCP1
form). All possible variations in the morpho tags of the functional tags under the
<S> and <O> functional slot are listed in Table 2.6.
Functional tags

Functional tags and their morpho tags

@DN>:

@DN> ART
@DN> DEM

@GN>:

@GN> PRON PERS GEN SG1


@GN> PRON PERS GEN PL1
@GN> PRON PERS GEN SG2/PL2
@GN> PRON PERS GEN PL3
@GN> PRON PERS GEN SG3
@GN> <Proper> GEN SG
@GN> N GEN SG
@GN> N GEN PL

@QN>:

@QN> NUM CARD


@QN> NUM ORD
@QN> NUM <Fraction> SG
@QN> NUM <Fraction> PL
@QN> <Quant> DET SG
@QN> <Quant> DET PL
@QN> <Quant> DET SG/PL

@AN>:

@AN> A ABS
@AN> A PCP1
@AN> A PCP2
57

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

Functional tags

Functional tags and their morpho tags

@SUBJ or @OBJ

(@SUBJ or @OBJ or @<P) PRON PERS SG1

or @<P :

(@SUBJ or @OBJ or @<P) PRON PERS PL1


(@SUBJ or @OBJ or @<P) PRON PERS SG2/PL2
(@SUBJ or @OBJ or @<P) PRON PERS PL3
(@SUBJ or @OBJ or @<P) PRON PERS SG3
(@SUBJ or @OBJ or @<P) <Proper> N SG
(@SUBJ or @OBJ or @<P) N SG
(@SUBJ or @OBJ or @<P) N PL
(@SUBJ or @OBJ or @<P) PCP1

@<NOM-OF:

@<NOM-OF PREP

@<NOM:

@<NOM PREP

Table 2.6: Different Possible Morpho Tags for Each of the


Functional Tag under the Functional Slot <S> or <O>

We explain Table 2.5 and Table 2.6 with an example. Consider the sentence This
old man is sitting in Rams office. Its parsed version, obtained using the ENGCG
parser, is:
@DN> DEM this,
@AN> A ABS old,
@SUBJ N SG man,
@+FAUXV V PRES be,
@-FMAINV V PCP1 sit,
@ADVL PREP in,
@GN> <Proper> GEN SG Ram,
58

Adaptation in English to Hindi Translation: A Systematic Approach


@OBJ N SG office
< $. >.

Here, the tags that start with @ are called functional tags e.g. @DN> - determiner,
@GN> - genitive case, @AN> - pre-modifier adjective etc. In Table 2.6 these tags
are succeeded by morpho tags, such as, SG - singular, PERS - personal pronoun,
GEN - genitive. Appendix B provides more details on these tags.
In the following discussion, the adaptation rules for functional tags due to the
variation of morpho tags are given.

2.5.1

Adaptation Rules for Variations in the Morpho Tags


of @DN>

The morpho tags ART and DEM are associated with the functional tag @DN>
(see Table 2.6). The morpho tag ART is associated with the English words the,
an and a, and DEM is associated with this, these, that etc. The word the
does not have any Hindi equivalent, hence it is absent in all Hindi translations.
Corresponding to articles a and an often no Hindi word used in the translation.
However, in some cases the word ek (meaning one) is used depending upon
the context. For adaptation of these words no morphological changes take place.
Therefore, if @DN> ART is present in the parsed version of either in the input
or in the retrieved sentence, and it is corresponding to the word the, then no
adaptation operation will be performed. With respect to determiners (word having
DEM morpho tags such as this (yah), these (ye), that (wah) etc.), adaptation
procedure is straightforward.
59

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

For illustration, consider the input sentence This man is kind., and the retrieved
example is The man is kind. aadmii dayaalu hai . Note that no Hindi word
exists in the retrieved Hindi sentence corresponding to the word the. But the input
sentence contains the determiner this. Therefore, its Hindi translation yah is
required to be added before the subject aadmii in the generated translation. Hence
the translation of the input sentence is yah aadmii dayaalu hai .

2.5.2

Adaptation Rules for Variations in the Morpho Tags


of @GN>

The functional tag @GN> is used for a genitive (i.e. possessive) case. Eight possible
morpho tag variations are listed in Table 2.6. These variations occur due to the
variations in gender, number and person of three different POS, which are N, PRON
and <Proper>. When the part of speech of the genitive word is N or <Proper>,
then the genitive case in Hindi is indicated with one of the case endings from the set
{kaa, ke, kii } as a morpho-word. Its usage depends upon the gender and number
of the noun following the word corresponding to the tag @GN>. When the genitive
word is pronoun (PRON) the case endings are transformed into suffixes. Following
examples illustrate different genitive case structures in Hindi.
kaa is used when the noun following it is masculine singular. For example:
the washermans son

- dhobii kaa betaa

the pundits house

- panditon kaa ghar

ke is used when the noun following it is masculine plural. For example:


the gardeners sons

- malii ke bete

these mens horses

- in aadmiyoon ke ghodhe
60

Adaptation in English to Hindi Translation: A Systematic Approach


ke is also used when the noun following it is masculine singular with a case
ending. For example:
on the doctors child

daaktar ke bachche par

in the kings villages

raja ke gaon mein

kii is used when the noun following it is feminine irrespective of whether it


is singular or plural with/without any case ending. For example:
the Brahmins book

- brahmin kii pothii

on the kings command

- raja kii agya par

on the mountains peaks

- pahadon kii chotiion par

There are occasions when morpho changes occur to the genitive word (when it
is noun) due to the case ending kaa, ke and kii . These rules are listed in
Appendix A. For example: the boys horse ladke kaa ghodha. Although, the
Hindi of boy is ladkaa, its oblique form ladke has been used in the above
example. This happens because of the case ending kaa.
If POS of the genitive case is proper noun, then too, the same case ending {kaa,
ke, kii } is used as a morpho-word according to the gender and number of the noun
following it. In this case no morpho changes occur in the genitive word due to the
case ending. For example: Paruls home - paarul kaa ghar , Rams book - raam kii
kitaab, in Rams home - raam ke ghar mein.
As mentioned above, when the POS of the genitive word is pronoun the case
ending is attached with it in a form of the suffix. In case of first and second person
pronoun the suffix comes from the set {aa, e, ii }. However, in case of third person
pronoun the entire morpho-word is used as a suffix. The following examples illustrate
the genitive case with respect to pronoun.

61

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

my son

- meraa betaa

my sons

- mere bete

my daughter

- merii betii

my daughters

- merii betiyaan

his son

- uskaa betaa

his sons

- uske bete

on my son

- mere bete par

on our son

- hamaare bete par

his daughter

- uskii betii

his daughters

- uskii betiyaan

in my book

- merii kitaab mein

in their villages - unke gaon mein

Once these structures are known, adaptation rules for different variations of
genitive case may be formulated by referring to Table 2.6. Table 2.8 has been
designed to indicate the adaptation procedure of different genitive case. The headers
of the rows and columns of this table correspond to three POS: <Proper>, N and
PRON.

Input

<proper>

PRON

(CP or ({WR}+

WR+ {MR}+ {SR or SA}

WR+ MD+SA

(CP or ({WR}+ {MR} +

WR+ MD+SA

Retd
<proper>

{MR}))
N

WR+ {MR}

{(SR or SA or SD)}))
PRON

WR + MA

WR + MA +{SR or SA }

(CP or SR or
(WR+SA))

Table 2.8: Adaptation Operations for Genitive Case to


Genitive Case

62

Adaptation in English to Hindi Translation: A Systematic Approach


We explain these adaptation rules with the help of the following example. Suppose the input sentence is The boys uniform is new., and the retrieved example
is Paruls toy is new. paarul kaa khiloonaa nayaa hai . The translation of the
input sentence is ladke kii wardii naii hai . In order to generate this translation
from the retrieved example the following adaptation operations need to be carried
out.
The word boy corresponds to the genitive case in the input sentence, its part
of speech is noun (N), and in the retrieved sentence the part of speech of paurl is
proper noun (<Proper>). Hence, according to cell(1, 2), i.e. (<Proper>, N) the
set of adaptation operation is WR+ {MR}+ {SR or SA}. This indicates that one
word replacement is mandatory and other two operations are carried out depending
upon the particular example under consideration.
Here, the nouns that follow the genitive cases are uniform and toy, respectively. Their Hindi translations are wardii (which is feminine and singular) and
khiloonaa (which is masculine and singular), respectively. The possessive case
ending, therefore, will not be the same. One morpho-word replacement is needed
to adapt genitive case ending: the morpho-word kaa is to be replaced with kii .
Therefore, that optional morpho-word replacement is required in this example. The
genitive word paarul in retrieved Hindi sentence is to be replaced with ladkaa.
However, a suffix replacement is necessary in this genitive word, viz. ladkaa becomes ladke.
Thus, in this example all the three adaptation operations are needed to adapt
the genitive case. In some situations all these operations may not be required. For
illustration, to adapt the genitive case Paruls uniform paarul kii wardhi to
boys uniform no change is required in the case ending kii .
63

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

2.5.3

Adaptation Rules for Variations in the Morpho Tags


of @QN

The functional tag @QN is a quantifier tag. It is of two types: numeral (NUM) (e.g.
two - do, fourth - choothaa, one-third - ke-tihaaii , two-thirds - do-tihaaii ),
and quantitative (<Quant>) (e.g. some- kuchh, all - sab, many - bahut). More
details of this functional tag and its morpho tags are given in Appendix B. Seven
variations in total (see Table 2.6) are possible due to the changes in the number (SG,
PL, SG/PL) and numeral properties, i.e. ordinal, cardinal number etc. But as far as
Hindi translation is concerned these seven variations do not play any role in the Hindi
translation. Therefore, no suffix operations or morpho-word operations are relevant
in this case. Only a single word operation, i.e deletion/addition/replacement/copy,
is required depending upon the tags in the input and the retrieved sentences.
For illustration, to adapt the translation of the retrieved example: Two men
are coming here. do aadmii yahaan aa rahe hain to generate the translation of
the input sentence Some men are coming here., the adaptation procedure should
replace kuchh (i.e. some) with do (i.e. two). The Hindi translation of the
input sentence is, therefore, kuchh aadmii yahaan aa rahe hain.
Other cases may be dealt in a similar way.

2.5.4

Adaptation Rules for Variations in the Morpho Tags


of Pre-modifier Adjective @AN>

Adjectives fall into two classes, viz., uninflected and inflected (Kellogg and Bailey,
1965). Uninflected adjectives, as the term implies, remain unchanged before all

64

Adaptation in English to Hindi Translation: A Systematic Approach


nouns and under all circumstances. English adjectives are necessarily uninflected they undergo no morphological changes with the variation in the nouns they qualify.
But Hindi adjectives may fall under both the categories. For example: achchhaa
(good) is inflected adjective, while iimaandaar (honest) is uninflected. For
illustration:
good boy

- achchhaa ladkaa

honest boy

- iimaandaar ladkaa

good girl

- achchhii ladkii

honest girl

- iimaandaar ladkii

good boys

- achchhe ladke

honest boys

- iimaandaar ladke

good girls

- achchhii ladkiyaan

honest girls

- iimaandaar ladkiyaan

Adjectives are of two types: basic adjectives, and participle forms, i.e. those that
are derived from verbs (Kachru, 1980). The inflection rules of these two types are
discussed below.

Basic adjectives: These adjectives are those which are adjective themselves such as
sundar beautiful, achchhaa good. ENGCG parser denotes them as ABS.
The rules of inflection for these adjectives are as follows.

1. If an adjective in Hindi ends with aa, then it changes into e for plural.
For example, bur aa ladkaa (bad boy) and bur e ladke (bad boys).
2. An adjective ending with aa changes into ii for feminine, e.g. bur ii
ladkii (bad girl) and bur ii ladkiyaan (bad girls).
3. If an adjective in Hindi ends with any other vowel, it does not change in any
case.

65

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

Participle form of adjectives: Participle forms are of two types:


present participle form of adjective (-ing form), denoted as A(PCP1);
past participle form of adjective (-ed form), denoted as A(PCP2).
In Hindi the following rules govern the structures of these adjective forms.

1. In order to attain A(PCP1) form of adjective a suffix from the set {taa, te,
tii } is added to the root form of the verb. But in case of past participle form
A(PCP2) an appropriate suffix is attached according to the rules of the PCP
form of verb (see Section A.2).
2. Further, in most cases a morpho-word (from the set {huaa, huye, huii }) also
needs to be added after the modified verb.
3. Participle forms of adjectives are also inflected according to the gender and
number of the corresponding noun.

The following examples illustrate the above points.


A(PCP1) form of adjective:
A falling stone

girtaa huaa patthar

A dancing girl

naachtii huii ladkii

A running horses

daudte huye ghode

A(PCP2) form of adjective:


A tired man

thakaa huaa aadmii

A broken chair

tutii huii kursii

A rotten apples

sade huye seb

66

Adaptation in English to Hindi Translation: A Systematic Approach


The pre-modifier adjective tag @AN>, therefore, has three possible morpho tag
variations (see Table 2.6): @AN> A ABS, @AN> A PCP1, and @AN> A
PCP2. Adaptation rules for adjectives have been formulated keeping in view all
the morpho-transformations discussed above. The following Table 2.10 presents
these rules.
The following examples illustrate the usage of the rule Table 2.10. Suppose
the input sentence is Faded flower do not look good.. We consider two different
retrieved examples to describe the adaptation procedure.
Example 1 : Here the retrieved example is:
Beautiful flower looks good. sundar phool acchaa dikhtaa hai
The adaptation operations for this example should follow the cell(1, 3), i.e. (ABS,
A(PCP2)) of the above table as the pre-modifier adjective of the subject is of the
form ABS in the retrieved sentence, and of the form A(PCP2) in the input
sentence.
Input

ABS

A(PCP1)

A(PCP2)

Retd
ABS

(CP or SR or WR+ MA+ SA

WR+ MA+ (SR or SA)

(WR+{SR})
A(PCP1) WR+ {SR} +

A(PCP2)

(CP or ({WR} +

((SR+ {SR}) or (WR+

MD

{2SR}))

SR+ {SR})

WR + {SR} +

((SR+

MD

(WR+ SR+ {SR}))

{SR})

or

(CP or ({WR}+ {SR}


+ {SR or SA}))

Table 2.10: Adaptation Operations for Pre-modifier Adjective to Pre-modifier Adjective

67

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

The pre-modifier adjective sundar is replaced with murjhaa, and suffix yaa
is added in that word. The morpho-word huaa will be added after this modified
verb in the retrieved Hindi sentence as the subject (phool ) is singular masculine.
Hence three adaptation operations (viz. one constituent word replacement, one
morpho-word addition and one suffix addition) are required for carrying out the
adaptation task. However, there may be situations when the suffix replacement may
have to be carried out in place of suffix addition.
As not is present in the input, and its translation (in Hindi is nahiin) is
added to retrieved Hindi sentence to generate the appropriate translation of the
input sentence. But this modification is not a part of the adaptation operation listed
in Table 2.10. Hence, the Hindi translation of the input sentence is murjhaayaa
huaa phool acchaa nahiin dikhtaa hai .
Example 2 : Another retrieved example is:
Fading flowers do not look good.
murjhate huye phool achchhe nahiin dikhte hain
Cell(2, 3), i.e. (A(PCP1), A(PCP2)) of the Table 2.10 lists the necessary adaptation
operations. The two possible sets of operations are ((SR+ {SR}) or (WR+ SR+
{SR})), i.e. one suffix operation is optional in both the sets.
In this example only two suffix replacements are required, i.e. first set of operations. The suffix te is replaced with yaa in the pre-modifying adjective
murjhaate, and the suffix ye is replaced with aa in the morpho-word huye.
In some situations, the second suffix operation may not be needed if the gender and
person of the qualified nouns are same in both the input and the retrieved example. If the input and the retrieved example have different verbs in the participle
68

Adaptation in English to Hindi Translation: A Systematic Approach


form then, accordingly, one word replacement operation has to be invoked. This
operation will take care of the variation in the participle verb. One can realize this
if blooming flowers khilte huye phool has to be adapted to translate fading
flowers (murjhaate huye phool ).
Note that the present discussion is limited to the adaptation procedure of premodifier adjective, therefore, in order to generate the Hindi translation of the input
sentence, it has been assumed that other modifications in the sentence have been
incorporated already, therefore, the Hindi translation is murjhaayaa huaa phool
achchhaa nahiin dikhtaa hai .
The present discussion concentrates on the variations in the pre-modifier adjective form. The adaptation rule table developed therein corresponds to nouns
belonging to subject and object functional slots, for which the adjective works as an
attributive one. The same rule Table 2.10 works for an attributive adjective corresponding to noun belonging to any functional slot/tag other than subject and object
as discussed above. Another usage of adjective may be noticed, in both English and
Hindi, that is the predicative one. In Hindi, a predicative adjective (subjective complement) agrees with its subject in number and gender. For example, He is good
wah achchhaa hai and She is good wah achchhii hai . The rules given in Table
2.10 works for predicative adjective as well.

2.5.5

Adaptation Rules for Variations in the Morpho Tags


of @SUB

The subject tag @SUBJ is the main and obligatory tag under the subject slot.
As listed in Table 2.6, nine possible morpho tag variations have been observed for
69

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

the subject functional tag. Within these nine possible morpho tags, there are total
four parts of speech: noun (N), proper noun (<Proper>), pronoun (PRON) and
gerund (PCP1). The variations in these parts of speech may occur due to either a
case ending or number. In this respect the following may be observed.
The only case ending that may occur with respect to subject is ne. If the
POS of the subject is noun or pronoun then morphological changes may occur
due to this case ending. For example,
ladkaa + ne -ladke ne (boy)

bchchaa + ne - bchche ne (child)

wah + ne - usne (he/ she)

ham + ne - hamne (we)

More details of this case ending are given in Appendix A. It may be noted that
no morphological changes occur to the subject due to this case ending if the
POS of the subject is proper noun or PCP1.
Morphological changes may occur in nouns due to variations in number (singular or plural) also. For example,
Singular

Plural

boy - ladkaa

boys - ladke

house - ghar

houses - ghar (No change)

cloth - kapadaa

clothes - kapade

girl - ladkii

girls - ladkiyaan

class - kakshaa

classes - kakshaayen

In PCP1 form, always a suffix naa is added to the root form of the verb.
For example, Swimming is a good exercise. tairnaa achchaa wyaayaam
hai .
70

Adaptation in English to Hindi Translation: A Systematic Approach


Input
Retd
N
<proper>
PRON
PCP1

N
(CP or ({WR} +
{SR or SA or SD}))
WR + {SR or SA
or SD}
WR + {SR or SA
or SD}
WR + {SR or SA
or SD}

<proper>

PRON

PCP1

WR

WR

WR+WR +SA

WR

WR+SA

(CP
WR)
WR
WR

or

(CP
WR)
WR

or WR+SA
(CP or WR)

Table 2.11: Adaptation Operations for Subject to Subject Variations

The rule Table 2.11 presents the relevant adaptation operations for different
variations in the subject discussed above. The following examples illustrate some of
these rules.
Example 1 : Suppose the input sentence is The boy is playing., and the retrieved
example is Boys are playing. ladke khel rahe hain. Since the subject of the
input sentence is boy, to generate its Hindi translation ladkaa only the suffix e
is replaced with aa from the subject ladke in the retrieved Hindi sentence. This
is because the root word of the subject in both the input and the retrieved sentence
are same, that is boy. However, if the root words of the subjects differ, then a
word replacement will definitely be needed. Additionally, some suffix replacement
or addition may be needed to take care of the number of the subject. For example,
to adapt boy to sister it needs only one constituent word replacement: ladkaa
is to be replaced with bahan. On the other hand, if the boy is to be adapted to
sisters (i.e. plural form) then a word replacement (ladkaa bahan) followed
by suffix addition (bahan bahanen) will be required.

71

2.5. Study of Adaptation Procedures for Subject/ Object Functional Slot

Therefore, the cell(1,1), i.e. (N, N) corresponds to above discussed operation set,
i.e. (CP or ({WR} + {SR or SA or SD}).

Example 2 : Consider the input sentence He is a good man.. Let the corresponding retrieved example be Walking is a good exercise. sair karnaa ek achchhaa
vyaayaam hai . The subject of the input sentence is he, and its POS is pronoun
(PRON), while the subject of the retrieved example is walking, and its POS is ing verb form, i.e. gerund (PCP1). In this case, adaptation operation as mentioned
in cell(4, 3), i.e. (PCP1, PRON) is required for doing the changes in the retrieved
translation. In this case the adaptation operation is word replacement. The word
sair karnaa is to be replaced with wah.
For the functional tags @OBJ and @>P, the same adaptation rule table can be
used because the morpho variation for these functional tags are same as that for
@SUBJ as given in Table 2.6.
For the last two functional tags (@<NOM-OF, @<NOM) there is only one possible morpho tag variation where the POS is preposition in both the cases. The
functional tag @<NOM-OF corresponding to the preposition of, and its translation in Hindi is either kaa, ke or kii . It is based on the gender, number
and person of the word corresponding to @<P tag. In case of @<NOM, there is
no particular postposition as in @<NOM-OF, a fixed Hindi translation cannot be
mentioned, and the translation takes place according to the prepositions in the input
sentence.

72

Adaptation in English to Hindi Translation: A Systematic Approach

2.6

Adaptation of Interrogative Words

This section discusses sentences that start with interrogative words, which are of
two types: interrogative pronoun (such as, who, what, whom, which, whose) and
interrogative adverbs (such as, when, where, how, why). This study has been done on
a selected set of representative sentences from the example base. This study focuses
on finding the usages of different interrogative words and corresponding translation
patterns. The major findings of this study are as follows.

An interrogative word may have many different structures in English.


The same interrogative word may have different Hindi translations in different
contexts, and consequently, the structures of the corresponding Hindi translations may also vary.
Different interrogative words may generate Hindi translations of the same
structure.

The above findings are important from EBMT point of view because commonality
of the interrogative words may not lead to the most useful retrieval. In order to
retrieve the most similar translation example, one may have to look into sentences
involving some other interrogative words. Table 2.12 shows the examples and their
patterns. The interrogative sentence patterns are denoted as INi , i=1,2,...,26. These
examples have been taken from the example base. The patterns of the sentences are
decided from the parsed versions of various examples given by the ENGCG parser.

73

2.6. Adaptation of Interrogative Words

P. No.

Interrogative Sentence Pattern Examples

IN1 :

Who &<LV> &<S>?

Who are you? tum kaun ho?

IN01 :

Who &<LV> & <PP>?

Who is at the door? darwaaje par


kaun hai?

IN2 :

Who &<V> &<O>


{<PP> or <Adverb> or (<PP>

Who knows music? sangiit kaun


jaantaa hai?

<Adverb>)}?
Who has played tunes on guitar well? gitaar par dhun kisne
achchhii bajaaii hai?
IN3 :

Who
&<AuxV>
&<S>
&<MainV> {<Adverb>}?

Who do you like most? tum kisko


sab se zyaadaa pasand karte ho?

IN4 :

Who
&<AuxV>
&<S>
&<MainV> &<Preposition>?

Who are you laughing at? tum kis


par hans rahe ho?

IN5 :

What &<LV> &<SC>?

What is this? yah kyaa hai?

IN6 :

What

&<AuxV>

&<S>

&<MainV>?
IN7 :

What
&<N>
&<SC/S>?

What do you like?

tum kyaa

pasand karte ho?


&<LV>

What color is the cap? topii kaun


se rang kii hai?
OR topii kis rang kii hai?
What shapes are these balls? ye
gendhen kaun se aakaaro kii hain?
OR ye gendhen kin aakaaro
kii hain?

IN8 :

What &<N> &<AuxV> &<S>


&<MainV> {<Adverb>}?

What book have you read recently?


tum ne kaun sii kitaab abhii
padhii hai?
OR tum ne kis kitaab ko abhii
padhaa hai?

IN9 :

Which &<LV> &<SC>?

Which is the best book? kaun sii


kitaab sab se achchhii hai?

74

Adaptation in English to Hindi Translation: A Systematic Approach

P. No.

Interrogative Sentence Pattern Examples

IN10 :

Which

IN11 :

&<AuxV>

&<S>

Which do you feel better? tumhe

&<MainV> &<Adverb>?

kaun saa zyaadaa sahii lagtaa


hai?

Which &<N> &<LV> &(<SC>


or <S>) {<PP>}?

Which fruit is good for his health?


kaun saa phal us kii sehat ke liye
achchhaa hai?
Which fruit is this? yah kaun saa
phal hai?

IN12 :

Which &<N> &<AuxV> &<S>

Which book will you take? tum

&<MainV> {<Adver>}?

kaun sii kitaab logii?


OR tum kis kitaab ko logii?
Which student will you call? tum
kis chhaatra ko bulaaoge?
OR tum kaun saa chhaatra bulaaoge?

IN13 :

<Preposition> &which &<N>

In which hotel will you stay? tum

&<AuxV> &<S> &<MainV>


{<O>}?

kis hotal main rukoge?


OR tum kaun se hotal
main rukoge?
To which boy did you give the book?
tum ne kis ladke ko kitaab dii?
OR tum ne kaun se ladke
ko kitaab dii?
To which boys did you give the
books? tum ne kin ladkon ko kitaaben dii?
OR tum ne kaun se ladkon
ko kitaaben dii?

IN14 :

Whose &<N> &<LV> &(<S>

Whose book is this? yah kiskii

or <SC>)?

kitaab hai?

75

2.6. Adaptation of Interrogative Words

P. No.

Interrogative Sentence Pattern Examples

IN15 :

Whose

&<N>

&<AuxV>

Whose book are you reading? tum

&<S> & <MainV>?

kiskii kitaab padh rahe ho?

IN16 :

Whose
&<N>
&<V>
&{(<PP>) or (<Adverb>) or
(<PP> <Adverb>)}?

Whose pen is lying on the table?


kiskii kalam mez par padii hai?

IN17 :

<Preposition> &whose &<N>


&<AuxV> &<S> &<MainV>?

In whose name will you transfer this


home? tum kiske naam par yah
ghar hastaantaran karoge?

IN18 :

<Preposition>

&whom

To whom are you listening? tum

&<AuxV> &<S> &<MainV>?

kis ko sun rahe ho?

IN19 :

Whom
&<AuxV>
&<MainV>?

&<S>

Whom do you prefer? tum kisko


pasand karte ho?

IN20 :

Why
&<LV>
&<S>
&(<Adverb> or <AdjP>)?

Why are you here? tum yahaan


kyon ho?
Why are you so stupid? tum itne
murkh kyon ho?

IN21 :

Why
&<AuxV>
&<S>
&<MainV>
{(<O>)
or
(<Adverb>)
(<O>
&

Why did you abuse the old man?


tum budhe aadmii ko gaalii kyon
de rahe ho?

<Adverb>)}?
Why are you weeping? tum kyon
ro rahe ho?
IN22 :

Where

&<LV>

&<S>

Where are you? tum kahaan ho?

&{<preposition> or <Adverb>}?
IN23 :

Where

&<AuxV>

&<S>

&<MainV>?
IN24 :

When

&<AuxV>

Where do you live? tum kahaan


rahte ho?

&<S>

When did the British quit India?

&<MainV> &{<O> or <Adverb>


or <PP> or (<O> <Adverb>)}?

british bhaarat ko kab choda kar


gaye?
When does he go to bed? wah
bistar par kab jataa hai?

76

Adaptation in English to Hindi Translation: A Systematic Approach

P. No.

Interrogative Sentence Pattern Examples

IN25 :

How

&<LV>

&<S>

How are you? tum kaise ho?

&{<Adverb>}?
How is she? wah kaisii hai?
How is he? wah kaisaa hai?
IN26 :

How

&<AuxV>

&<S>

&<MainV> {<O> or <Adverb>


or (<O> <Adverb>)}?

How are you feeling now? tumhe


ab kaisaa lag rahaa hai?
How is she looking today? wah aaj
kaisii dikh rahii hai?

Table 2.12: Different Sentence Patterns of Interrogative


Words

Each interrogative word has a particular role in a sentence. According to their


roles, parser assigns different functional tags and corresponding morpho tags to these
interrogative words. The functional tags and corresponding morpho tags as assigned
by the parser to the interrogative words of the above sentences are given in Table
2.13.
Interrogative Sentence
word

No.

Who:

IN1
IN01 ,

What:

Which:

pattern

Functional & Morpho tags


@PCOMPL-S PRON WH SG/PL

IN2

IN3

@SUB PRON WH SG/PL


@OBJ PRON WH SG/PL

IN4

@<P PRON WH SG/PL

IN5

@SUB PRON WH SG/PL

IN6
IN7 , IN8

@OBJ PRON WH SG/PL


@DN> DET WH SG/PL

IN9
IN10
IN11 , IN12 , IN13

@SUB PRON WH SG/PL


@OBJ PRON WH SG/PL
@DN> DET WH SG/PL
77

2.6. Adaptation of Interrogative Words

Whose:

IN14 , IN15 , IN16 , IN17

@GN> DET WH GEN SG/PL

Whom:

IN18
IN19

@<P PRON WH SG/PL


@OBJ PRON WH SG/PL

Why:

IN20 , IN21

@ADVL ADV WH

Where:

IN22 , IN23

@ADVL ADV WH

When:

IN24

@ADVL ADV WH

How:

IN25 , IN26

@ADVL ADV WH

Table 2.13: Functional & Morpho Tags Corresponding to


Each Interrogative Sentence Patterns

Note that, Table 2.12 by no means provides an exhaustive list of English sentence
patterns involving interrogative words. However, these are the sentence patterns that
are predominatingly present in our example base. By examining the structures of
the corresponding Hindi sentences one can easily see that it is the role of the word
concerned that is most important in determining the Hindi sentence structures. One
may easily find that an interrogative word that has more than one functional tag and
having different translations in Hindi which certainly implies different translation
structures. These variations corresponding to each interrogative word are explained
below:
Variation in Translation of who: Table 2.13 shows four different functional tags
for this word. The observed translation patterns for these are as follows:

1. @PCOMPL-S: If who is used as the subjective complement, as in pattern


IN1 , then the only way it is translated into Hindi is kaun.
2. @SUB: When who is used as a subject, as in pattern IN01 and IN2 its trans78

Adaptation in English to Hindi Translation: A Systematic Approach


lation into Hindi may have two possibilities depending upon the tense of the
sentence in case of IN2 sentence pattern. If the tense and verb form are present
perfect, past indefinite or past perfect, the translation of who in Hindi is
kisne. In all other tense and verb form who is translated as kaun.
However, the translation of who in case of IN01 sentence pattern is kaun.
3. @OBJ: If the functional tag assigned to who is @OBJ, (as in pattern IN3 )
then its Hindi translation is kisko
4. @<P: This tag implies that here who is used as a complement of preposition
(as in pattern IN4 ). In this case the Hindi translation is kis.

Variation in translation of word what: The four translation patterns of sentences


involving what as the interrogative word show only three functional tags for the
word what (see Table 2.13). Their translations have the following variations.

1. In translation patterns IN5 and IN6 the interrogative word what is used as
subject (@SUB) and object (@OBJ), respectively. In both the cases what is
translated as kyaa.
2. In case of sentence patterns IN7 and IN8 the word what is used as a determiner and its functional tag is @DN>. However, due to the variations in
overall sentence patterns in these two cases the different translations for the
word what have been observed. In both the cases, the Hindi translation is
of the form kaun followed by one of {saa, se, sii } depending upon the number
and gender of the noun following the word what. However, in both the cases
of IN7 and IN8 the more translation of what has been observed, i.e. kis or

79

2.6. Adaptation of Interrogative Words

kin according to the number of the noun following the word what. Further, the morpho-word ki is added after the noun in case of IN7 sentence
pattern. While in case of IN8 , the morpho-word ko is added after the noun.

Variation in translation of the word which: As shown in Table 2.12, five different
sentence patterns, viz. IN9 to IN13 , have been observed corresponding to the word
which. In all these cases, although the functional tag for the word which varies,
its translation to Hindi is done in the same way using the word kaun followed by
one of the morpho-word from the set {saa, sii, se} depending upon the number and
gender of the noun following the word which.
However, in both the cases IN12 and IN13 , one more translation of which has
been observed, i.e. kis or kin according to the number of the noun following the
word which. Further, the morpho-word ko is added after the noun.
Variation in translation of the word whose: Although four different sentence patterns (i.e. IN14 , IN15 , IN16 and IN17 ) have been observed for English sentences
involving the word whose, in all of them the functional tag of this word has been
found to be @GN>. Consequently, its translation into Hindi is also found to be the
same, i.e. one from the set {kiskaa, kiske, kiskii, kinkii, kinkaa, kinke}. The actual
usage depends upon the gender and number of the noun following the word whose.
Variation in translation of the word whom: Two possibilities have been observed
in this case:
1. @<P: Under this functional tag the word whom is used as a complement of
the preposition, as in sentence pattern IN18 . In this case the Hindi translation
of this word is kis.
80

Adaptation in English to Hindi Translation: A Systematic Approach


2. @OBJ: In this case the functional tag corresponding to whom is object, as
in sentence pattern IN19 . The corresponding Hindi translation of this word is
kisko.

Variation in translation of interrogative adverb words: Under this case four words
have been studied: why, where, when and how. Their Hindi translations
are as follows:
Irrespective of the sentence patterns (i.e. IN20 , IN21 , IN22 , IN23 and IN24 ) the
first three of the above four interrogative adverbs have unique translations in
Hindi. The Hindi translation of why is kyon, that of where is kahaan,
while when is translated as kab.
In both the sentence patterns IN25 and IN26 , the translation of the word how
into Hindi is one from the set {kaisaa, kaisii, kaise}. This variation in the
translation is governed by the gender and number of the subject of the underlying sentence pattern.
The above study of interrogative words suggests that sentence having different
interrogative words may have same translation patterns. The following examples
illustrate this point.
(A)

Why are you going today?


tum aaj kyoo jaa rahe ho?

(B)

Where are you going today?


tum aaj kahaan jaa rahe ho?

(C)

When are you going today?


tum aaj kab jaa rahe ho?
81

2.6. Adaptation of Interrogative Words

Suppose one has to generate the translation of the English sentence How are you
going today?. Its Hindi translation is tum aaj kaise jaa rahe ho? . Obviously, this
translation can be generated easily if one of the above three examples is considered
as a retrieved sentence.
Based on the above observations we cluster the above sentence patterns into
several groups as given below.

G1 : IN21 , IN23 , IN24 and IN26


G2 : IN1 , IN01 IN20 , IN22 , and IN25
G3 : IN2 , IN3 , IN6 , IN10 and IN19
G4 : IN7 , IN11 , and IN14
G5 : IN8 , IN12 , IN15 , and IN16
G6 : IN5 , and IN9
G7 : IN4 , and IN18
G8 : IN13 , and IN17

Adaptation of the interrogative word within each group of sentences is relatively


easy, and typically, can be done using simple operations. Table 2.14 shows the
operations required for the adaptation within the group G5 . However, the adaptation
between two different groups may not be so simple because the remaining part of the
sentences also need to be taken into consideration. And therefore, more structural
transformational of the retrieved examples will be needed.

82

Adaptation in English to Hindi Translation: A Systematic Approach

Input IN8

IN12

IN15

IN16

Retd
IN8

(CP or WR)

(CP or WR)

WR+SA+WD

WR+SA+WD

IN12

(CP or WR)

(CP or WR)

WR+SA+WD

WR+SA+WD

IN15

WR+ WA

WR+ WA

(CP or SR)

(CP or SR)

IN16

WR+ WA

WR+ WA

(CP or SR)

(CP or SR)

Table 2.14: Adaptability Rules for Group G5 Sentence


Patterns

2.7

Adaptation Rules for Variation in Kind of Sentences

Here we consider four kinds of sentences: Affirmative (AFF), Interrogative (INT),


Negative (NEG) and Negative-Interrogative (NINT). Typical sentence structures of
these four types are given in Figure 2.3.

Ram eats rice.

ram chaawal khaataa hai

Ram does not eat rice.

ram chaawal nahiin khaataa hai

Does Ram eat rice?

kyaa ram chaawal khaataa hai?

Does Ram not eat rice?

kyaa ram chaawal nahiin khaataa hai?

Figure 2.3: Some Typical Sentence Structures

83

2.7. Adaptation Rules for Variation in Kind of Sentences

One may notice that in Hindi the negative and interrogative structures are obtained by addition of the words nahiin and kyaa, respectively. Also note that
the position of kyaa is always at the beginning of the sentence - hence its addition or deletion needs no traversing through the sentence. Typically, nahiin
occurs before the main verb of the Hindi sentence. However, since Hindi is relatively
free order, it may occur at some other position also. Adaptation operations are,
therefore, as follows:

Word addition (WA) (for nahiin);


Word deletion (WD) (for nahiin);
Morpho-word addition (MA) (for kyaa); and
Morpho-word deletion (MD)(for kyaa)

Table 2.15 gives the required operations for all types of variation in the kind of
sentences. The expressions are obtained by deciding upon which of the words are
being added and/or deleted for the adaptation.
Input AFF

NEG

INT

NINT

WA + MA

Retd
AFF

CP

WA

MA

NEG

WD

CP

MA + WD MA

INT

MD

WA + MD CP

WA

NINT

WD + MD

MD

CP

WD

Table 2.15: Adaptation Rules for Variation in Kind of


Sentences

84

Adaptation in English to Hindi Translation: A Systematic Approach

2.8

Concluding Remarks

In this chapter we have described different adaptation operations that may be used
for adapting a retrieved Hindi translation example to generate the translation of a
given input. The novelty of the scheme is that not only does it work in the word
level, it deals with suffixes as well. The advantage of the above scheme is that
since the number of suffixes is very limited, it reduces the overall cost of adaptation.
Chapter 5 discusses how the cost for each of the operations is evaluated.
This chapter looks into the process of adaptation itself. The adaptation operations described in this chapter are to be used in succession in order to generate the
required translation. The overall adaptation scheme will first have to look into the
discrepancies in the input sentence and the retrieved example. Discrepancies may
occur in different functional slots of the sentences, and also in the kind of sentences.
Once the discrepancies are identified, appropriate adaptation operations have to
be applied to remove them. Thus successive applications of these operations will
generate the required translation in an incremental way.
In this chapter we have considered variations in the different tense and verb forms
both in active and passive, variations in subject/object functional slot, variation in
wh-family word (e.g. what, who, where, when) and their sentence patterns.
We have also worked on Modal verbs (e.g should, might, can, could, may),
and their respective sentence patterns. However, due to similar nature of discussion
we do not elaborate on them in this report.
Of the different sentence kinds, we have discussed four (viz. affirmative, negative,
interrogative and negative interrogative) in this chapter. Evidently one may find
many other kinds of sentences (e.g. Imperative, Exclamatory). We have not dealt
85

2.8. Concluding Remarks

with them in this work, however, we feel that they can be treated in similar fashion.
With respect to each of the variations we have identified the minimum number
of operations that are required for the overall adaptation of the retrieved example.
We presented these required operations in the form of various tables. The advantage of these tables is that they can be used as yardsticks for measuring the total
adaptation cost, which in turn may be used as a measurement of similarity between
an input sentence and the sentences of the example base. These issues are discussed
in Chapter 5.
The above-mentioned scheme of adaptation works well under the implicit assumption that translations of similar source language sentences are similar in the
target language as well. However, in reality one may find examples when the above
assumption does not hold good. For example, consider the two English sentences
It is running. and It is raining.. Although these two sentences are structurally
very similar, their Hindi translations are structurally very different. The first sentence is translated as wah (it) bhaag (run) rahaa (..ing) hai (is). But the second
one is translated as baarish (rain) ho (be) rahii (..ing) hai (is). Hence in order
to translate the first sentence if the second one is retrieved from the example base,
then the translation generated through the above-mentioned adaptation procedure
will not be able to produce the correct translation of the said input. Such instances
are primarily due to some inherent characteristics of the source and target language,
which is termed as translation divergence (Dorr, 1993). The existence of translation divergences makes the straightforward transfer from source structures into
target structures difficult. Study of adaptation therefore needs a careful study of
divergence as well. The following chapter discusses divergences in English to Hindi
translation in detail.
86

Chapter 3
An FT and SPAC Based
Divergence Identification
Technique From Example Base

An FT and SPAC Based Divergence Identification Technique

3.1

Introduction

Divergence is a common phenomenon in translation between two natural languages.


Typically, translation divergence occurs when structurally similar sentences of the
source language do not translate into sentences that are similar in structure in the
target language (Dorr, 1993). As a consequence, dealing with divergence assumes
special significance in the domain of EBMT.
For illustration, consider the following English sentences and their Hindi translations:
(A) : She is in a shock. wah

sadme

(she) (shock)
(B) : She is in trouble.

wah
(she)

(C) : She is in panic.

wah

mein

hai

(in)

(is)

pareshaanii

mein

(trouble)

(in)

(is)

rahii

hai

ghabraa

(she) (panic) (..ing)

hai

(is)

Items (A) and (B) above are examples of normal translation pattern. The prepositional phrases (PP) of the English sentences are realized as PP in Hindi, and that
the prepositions occur after the corresponding noun is in accordance with the Hindi
syntax. However, in example (C) one may notice huge structural variation. Here,
the sense of the prepositional phrase in panic is realized by the verb ghabraa rahii
hai (is panicking). Hence this is an instance of a translation divergence.
Assuming that the English sentence in (A) is given as the input to an English to
Hindi EBMT system, two scenarios may be considered:

1. The retrieved example is (B) i.e. She is in trouble. In this case, the correct
Hindi translation may be generated in a straightforward way by using word
87

3.1. Introduction

replacement operation to replace pareshaanii with sadme.


2. If example (C) is retrieved for adaptation, the generated translation may be
wah (she) sadmaa (shock) rahii (. . . ing) hai (is), which is not a syntactically
correct Hindi sentence.

Thus the output of the system will depend entirely on the sentence ((B) or (C))
which will be retrieved to generate the translation of the input (A). Given a very
similar structure of the three sentences, the retrieval may eventually depend on the
semantic similarity of the prepositional phrase (PP) of the input with the PPs of the
stored examples. With respect to the above illustration, this implies that similarity
between the sentences may be measured by the semantic similarity between shock
and trouble in case (1), and the semantic similarity between shock and panic
in case (2). Table 3.1 gives this similarity value under different schemes given in
WordNet::Similarity web interface (http://www.d.umn.edu /mich0212/ cgi-bin/
similarity/similarity.cgi), considering the words as nouns, and taking their sense
number 1 as given by WordNet 2.0.
Similarity measure

Similarity

between

Similarity

between

shock and trouble

shock and panic

Lin
Leacock & Chodorow
Resnik

0.2989
1.3863
2.734

0.5172
1.6376
5.2654

Jiang & Conrath


Wu & Palmer

0.078
0.3333

0.1017
0.5

Path lengths
Adapted Lesk

0.1111
1

0.1429
2

Table 3.1: Different Semantic Similarity Score between


shock with trouble and panic

88

An FT and SPAC Based Divergence Identification Technique


The above values show that under all the above measures panic is more similar
to shock. But we observe from the translation point of view that example (B)
proves to be more useful in producing the appropriate translation. This happens
because of the presence of divergence in the translation of example (C).
Identification of divergence may therefore be considered paramount for an EBMT
system. Such an algorithm may be used in partitioning the example base into
different classes: divergence and normal. This, in turn, helps in efficient retrieval of
past examples which enhances the performance of an EBMT system. The present
work aims at designing algorithms for identification of divergence examples in a
example base of translations.
This chapter is organized as follows. Section 3.2 discusses some related past
work on divergence and its identification. Section 3.3 presents a detailed study of
divergence categories for English to Hindi translation along with their identification
algorithms.

3.2

Divergence and Its Identification: Some Relevant Past Work

Various approaches have been pursued in dealing with translation divergence. These
may be classified into four categories:

1. Transfer approach. Here transfer rules are used for transforming a source
language (SL) sentence into target language (TL) by performing lexical and
structural manipulations. These rules may be formed in several ways: by

89

3.2. Divergence and Its Identification: Some Relevant Past Work

manual encoding (Han et. al., 2000), by analysis of parsed aligned bilingual
corpora (Watanabe et. al, 2000) etc.
2. Interlingua approach. Here, identification and resolution of divergence are
based on two mappings (the Generalized Linking Routine (GLR), the CSR
(Canonical Syntactic Realization)) and a set of Lexical Conceptual Structure
(LCS) parameters. In general, translation divergence occurs when there is an
exception either to the GLR or to the CSR (or to both) in one language but
not in the other. This premise allows one to formally define a classification
of all possible lexical-semantic divergences that could arise during translation.
This approach has been pursued in the UNITRAN (Dorr, 1993) system that
deals with translation from English to Spanish and English to German.
3. Generation-Heavy Machine Translation (GHMT) approach. This scheme works
in two steps. In the first step, rich target language resources, such as wordlexical semantics, categorical variations and sub-categorization frames are used
for generating multiple structural variations from a target-glossed syntactic dependency representation of SL sentences. This is symbolic-overgeneration
step. This step is constrained by a statistical TL model that accounts for
possible translation divergences. Finally, a statistical extractor is used for
extracting a preferred sentence from the word lattice of possibilities. Evidently, this scheme bypasses explicit identification of divergence, and generates
translations (which may include divergence sentences) otherwise. MATADOR
(Habash, 2003), a system for translation between Spanish and English uses
this approach.
4. Universal Networking Language. (UNL) based approach. UNL has been devel-

90

An FT and SPAC Based Divergence Identification Technique


oped to play the role of Interlingua to access, transfer and process information
on the internet (Uchida and Zhu, 1998). In UNL, sentences are represented
using hypergraphs with concepts as nodes and relations as directed arcs. A
dictionary of concepts (termed as Universal Word or UW) is maintained. A divergence is said to occur if the UNL expression generated from the both source
and target language analyzer differ in structure. This approach has been proposed for English to Hindi machine translation in (Dave et. al., 2002).
Each of the above schemes, however, has its own shortcomings when applied in
English to Hindi context. For example, the Generation-Heavy approach requires rich
resources for the target language. Creation of such heavy resources requires significant amount of efforts, and it is not currently available for Hindi. The Interlingua
approach requires deep semantic analysis of the sentences, but it has been observed
elsewhere that an MT system can work even without such semantic details (Dorr
et. al., 1998). Similarly, creation of exhaustive set of rules to capture all the lexical
and structural variations that may be witnessed in English to Hindi translation is
too formidable. Even in case of UNL based approach, each UW of the dictionary
contains deep syntactic, semantic and morphological knowledge about the word.
Creation of such a dictionary even for a restricted domain is difficult, and needs
deep semantic analysis of each word.
With respect to Hindi the major problem in applying the above techniques is
that such linguistic resources are not freely available. As a consequence, application of the above techniques in English-Hindi context is severely constrained, at
least presently, due to scarcity of linguistic resources for Hindi. Although Hindi
is one of the major languages of the present world, research in NLP on Hindi
(and other Indian languages too) is still in its infancy. Even though research in
91

3.2. Divergence and Its Identification: Some Relevant Past Work

NLP involving Indian languages has been enthusiastically pursued, and is sponsored by the government and several educational institutes over the last few years
(http://tdil.mit.gov.in/tdilsept2001.pdf), it will take some time before various linguistic resources are easily available. This motivates us to develop a simpler algorithm that requires as little linguistic resources as possible. The usefulness of such
techniques will be twofold:

1. Study of EBMT for Hindi can be pursued successfully


2. The methods can be used for other languages too where linguistic resources
are scarce.

The proposed approach uses only the functional tags (FT) and the syntactic
phrasal annotated chunk (SPAC) structures of the source language (SL) and target
language (TL) sentences for identification of divergence in a translation example. A
translation divergence occurs when some particular FT upon translation is realized
with the help of some other FT in the target language. Thus occurrence of divergence
may be identified by comparing the roles of different constituent words in the source
and target language sentence. Thus the proposed approach aims at designing an
algorithm that uses as little linguistic resources as possible.
The most fundamental work before developing any such algorithm is to determine
the different types of divergence that may be found in English to Hindi translation.
Since divergence is a language-dependent phenomenon, it is not expected the same
set of divergences will occur across all languages. In this respect one may refer to
(Dorr, 1993) which provides the most detailed categorization of Lexical-semantic
divergences for translation among the European languages. There divergence has

92

An FT and SPAC Based Divergence Identification Technique


been put in seven broad types: structural, conflational, categorial, promotional, demotional, thematic and lexical. Section 3.3 discusses these divergence types in detail.
In a more recent work (Dorr et. al., 2002) and (Habash and Dorr, 2002), the divergence categories have been redefined in the following way. Under the new scheme six
different types of divergence have been considered: light verb construction, manner
conflation, head swapping, thematic, categorial, and structural. The differences in
the two categorizations may be summarized as follows:

1. A light verb construction involves a single verb in one language being translated
using a combination of a semantically light verb and another meaning unit
(a noun, generally) to convey the appropriate meaning. In English to Hindi
(and perhaps in many other Indian languages) context such happenings are
very common. Hence this is not considered as a divergence for English to Hindi
translation. Later, this point will be discussed in detail under the conflational
divergence.
2. Head swapping essentially combines both promotional and demotional divergences under one heading.
3. Lexical divergence, which is a mixture of more than one divergence, has not
been considered.
4. All other divergence categories remain as they are under the new scheme.

Thus, the new categorization is essentially a regrouping of some of the above


types. The basic motivation behind the present work is to study the relevance of
the above-mentioned seven types of divergence in the context of English to Hindi
translation. For this work we have analyzed more than 4500 translation examples
93

3.2. Divergence and Its Identification: Some Relevant Past Work

obtained from different bilingual sources (such as, storybooks, translation books,
recipe books). This analysis suggests that English to Hindi translation divergence
is in many cases somewhat different in its characteristics, and therefore need to be
redefined. In the following subsections we describe the various types of divergence
that may be found in the context of English to Hindi translation, and their subtypes. We also discuss the algorithm to identify each type of divergence, and its
characteristics in more detail.
It may be noted that Dave et. al (2002) also studied English to Hindi divergence
in detail. However, they have restricted their discussions to the above-mentioned
seven categories only. Our studies of English to Hindi translation divergences reveal
the following:

1. All of the above-mentioned seven categories do not apply with respect to English to Hindi translation.
2. Instances of thematic and promotional divergence have not been found in English to Hindi translation
3. Structural divergence, in the English to Hindi context, occurs in the same way
as in European languages.
4. Some variations from the definitions given in (Dorr, 1993) may be noticed in
the occurrence of categorial, conflational, demotional divergences.
5. Three new types of divergence may be found with respect to English to Hindi
translation. These are named as nominal, pronominal and possessional.
6. Most of the divergence types may be further subdivided into several sub-types.

94

An FT and SPAC Based Divergence Identification Technique


In Section 3.3 we discuss all the relevant divergence types and their sub-types
that we have observed in English to Hindi translation, and provide algorithms for
their identification. As mentioned earlier, the identification technique uses functional
tags (FT) and syntactic phrase annotated chunk (SPAC) of both the source language
sentence and its translation. For each divergence type we identify the FTs that are
instrumental in causing the divergence. Each divergence type is defined on the basis
of which FTs of the English sentence it is concerned with, and also to which FTs it
is mapped upon translation.
The proposed algorithm requires the following FTs and SPAC of both the languages:

Subject (S), object (O), verb (V), subjective complement (SC), adjectival complement by preposition(SC C), subjective predicative adjunct (PA), verb complement (VC) and adjunct (A).
Categories in the SPAC structure are as follows:
POS tags: noun (N), adjective (Adj), verb (V), auxiliary verb (AuxV), preposition (P), adverb (Adv), determiner (DT), personal Pronoun (PRP), possesive
case of personal pronoun (PRP$) and cardinal number (CD).
Phrases: N, Adj, V, ADV and P are called the lexical heads of the phrases.
For each category a suffix P is used to denote a phrase.

In Appendix B and Appendix C, definitions of these FTs and SPAC are discussed
in detail. With this background we proceed to define the divergence types/ sub-types
and their identification scheme.

95

3.3. Divergences and Their Identification in English to Hindi Translation

3.3

Divergences and Their Identification in English to Hindi Translation

We order the different divergence types on the basis of the FTs of the source language
sentence with which they are concerned. Accordingly, we observe the following:

Structural divergence is concerned with the object of the English sentence.


Categorial divergence is characterized by how the subjective complement (SC)
and predicative adjunct (PA) of the English sentence are realized upon translation.
Nominal divergence concerns with the SC of the English sentence.
Pronominal divergence is related to both SC and verb of the English sentence.
Demotional divergence, conflational divergence, and possessional divergence
may be identified by studying how the main verb of the English sentence is
realized upon translation.

In the following subsections we provide the different divergence types and their
identification schemes. In the description of all the algorithms the following convention for representation will be followed:

a) The input to the algorithms will be an English sentence and its Hindi sentence.
These two will be denoted as E and H, respectively.
b) All these identification algorithms will return 0 if a particular divergence is
absent in the sentence. Otherwise it will return a value n, indicating that the
96

An FT and SPAC Based Divergence Identification Technique


corresponding divergence is present in the translation, and its sub-type is n.
It may be noted that the number of possible sub-types may be different for
different divergence types.

3.3.1

Structural Divergence

A structural divergence is said to have occurred if the object of the English sentence
is realized as a noun phrase (NP) but upon translation in Hindi is realized as a
prepositional phrase (PP). The following examples illustrate this. One may note
that different Hindi prepositions (e.g. se, par, ko, kaa) have been used in different
contexts leading to structural divergence.

Ram will attend this meeting.


ram

iss

sabhaa mein

jaayegaa

(Ram) (this) (meeting) (in) (will go)


Ram married Sita.
ram ne sita

se

vivah kiyaa

(Ram) (Sita) (with) (married)


Ram will answer this question.
ram

iss prashn kaa

uttar degaa

(Ram) (this) (question)

(will answer)

97

3.3. Divergences and Their Identification in English to Hindi Translation

Ram will beat Mohan.


ram mohan ko

maregaa

(Ram) (Mohan) (will beat)


Ram has signed the paper.
ram ne kagaj par

hastaakshar kar diyaa hai

(Ram) (paper) (on) (signature) (has done)

Analysis of various translation examples reveals the following points with respect
to structural divergence, which we use to design the algorithm for identification of
structural divergence:
If the main verb of an English sentence is a declension of be verb, then the
structural divergence cannot occur.
Structural divergence deals with the objects of both the English sentence and
its Hindi sentence. Therefore, if any one of the two sentences has no objects
then structural divergence cannot occur.
If both the sentences have objects, and their SPAC structures are same then
also structural divergence does not occur.
In other situations structural divergence may occur only if the SPAC of the
object of the English sentence is an NP, and the SPAC of the object of the
Hindi sentence is a PP.
The algorithm for identification of structural divergence has been designed to
take care of the above conditions. Figure 3.1 gives the corresponding algorithm. For
98

An FT and SPAC Based Divergence Identification Technique

Step1.
Step2.
Step3.
Step4.

IF(root word of the main verb of E is be")THEN RETURN(0)


IF((the object of E is null) OR (the object of H is null))
THEN RETURN(0)
IF(the SPAC of the object of E EQUALS
the SPAC of the object of H )THEN RETURN(0)
IF((the SPAC of the object of E is NP)AND
(the SPAC of the object of H is PP))THEN RETRUN(1)

Figure 3.1: Algorithm for Identification of Structural Divergence

Figure 3.2: Correspondence of SPACs of E and H for Identification of Structural


Divergence
structural divergence, as discussed above, there is only one possible sub-type. Thus
depending upon the case the algorithm given in Figure 3.1 returns either 0 or 1.

Illustration
Consider for illustration the following sentence pair:
E: Andre will marry Steffi.
H: andre (andre) steffi (steffi) se (from) vivaah karegaa (will marry)
The SPACs of these two sentences and their correspondences are given in Figure 3.2. Here bold arrows represent correspondence, and dotted lines indicate no
correspondence. Note that, the objects of E and H are not null; in E the object
99

3.3. Divergences and Their Identification in English to Hindi Translation

is Steffi, whereas in H the object is steffi se. But their SPACs are [NP [Steffi /
N]] and [PP [NP [ steffi / N]] [ se / P]], respectively, which are not equal. Therefore,
structural divergence is identified.

3.3.2

Categorial Divergence

Categorial divergence concerns with the subjective complement (SC) or predicative


adjunct (PA), if any, of the English sentence. In the event of categorial divergence,
the SC or PA, upon translation, is realized as the main verb of the Hindi sentence.
This happens irrespective of whether the SC is an NP or AdjP or whether the PA
is a PP or adverb, in the underlying English sentence. Thus categorial divergence
in English to Hindi translation context is different from its definition given by Dorr
(1993) in the context of European languages. There, categorial divergence is concerned with adjectival SCs which upon translation map into noun, verb or PP. This
subtle difference allows us to redefine categorial divergence in English to Hindi context. In particular, depending upon the nature of the SC or PA , four sub-types
have been identified.
The definitions and characteristics of the four sub-types are given below.
1. Categorial sub-type 1: When the SC of the English sentence is used as an adjective, but upon translation is realized as the main verb of the Hindi sentence,
then this divergence occurs. For illustration, consider
Ram is afraid of lion. ram

sher

se dartaa hai

(Ram) (lion) (of)

(fears)

The adjective of the English sentence afraid is realized in Hindi by the verb
darnaa meaning to fear, and dartaa hai is its conjugation for present
100

An FT and SPAC Based Divergence Identification Technique


indefinite tense, when the subject is third person, singular and masculine.
2. Categorial sub-type 2 : Here the SC is an NP in the English sentence. Upon
translation the noun part gives the verb of the corresponding Hindi sentence.
The adjective part is realized as an adverb upon translation.
Consider for illustration the following:
Ram is a regular user of the library.
ram pustakaalay ko baraabar istemaal kartaa hai
(Ram) (library) (of)

(regularly)

(uses)

Here the focus is on the word user which is a noun, and has been used as
an SC in the above English sentence. This provides the main verb istemaal
karnaa (meaning to use) of the Hindi sentence. Its conjugation for present
indefinite tense is istemaal kartaa hai, when the subject is third person, singular and masculine. The adjective regular of the noun user is realized as
the adverb baraabar .
3. Categorial sub-type 3 : In the event of this divergence an adverbial PA of an
English sentence is realized as the main verb of the Hindi sentence. Consider
for illustration the following translation:
The fan is on.
paankhaa (fan) chal (move) rahaa (..ing) hai (is)
The main verb of the Hindi sentence is chalnaa i.e to move. Its sense
comes from the adverbial PA on of the English sentence. The present continuous form of this verb is chal rahaa hai , when the subject is third person,
singular and masculine. It may be noted that in Hindi grammar neuter gender
101

3.3. Divergences and Their Identification in English to Hindi Translation

does not exist. Inanimate objects are treated as masculine or feminine, and
this categorization follows some systematic rules but occasionally with some
exception (See Appendix A).
4. Categorial sub-type 4 : This sub-type concerns with predicative adjuncts that
are realized in English as PP, but are realized in Hindi as the main verb. For
example, one may consider the following pair:
The train is in motion. railgaadii chal
(train)

rahii hai

(move) (..ing) (is)

Here, the PA in motion is a prepositional phrase whose sense is realized by


the verb chalnaa. One may notice that here in the Hindi translation the
auxiliary verb is rahii in order to convey that the subject of the sentence is
feminine and singular.

Our analysis of a large number of translation examples reveals the following:

Categorical divergence occurs if the main verb of the English sentence is a


declension of be, but the main verb of the Hindi translation is not the be
verb i.e. ho.
We further notice that for categorial divergence to occur, the Hindi translation
should not have any subjective complement or predicative adjunct.
If SPAC structure of the subjective complement (SC) is an AdjP or NP then
it is the case of categorial divergence of sub-type 1 or 2, respectively.
Otherwise, if the SPAC structure of predicative adjunct (PA) is AdvP or PP
then it is called categorial divregnece of sub-type 3 or 4, respectively.

102

An FT and SPAC Based Divergence Identification Technique

Step1.
Step2.
Step3.
Step4.
Step5.
Step6.
Step7.

IF(root word of the main verb of E is not "be")


THEN RETURN(0)
IF(root word of the main verb of H is "ho ")
THEN RETURN(0)
IF((the SC of H is not null) OR
(the PA of H is not null))THEN RETURN(0)
IF(the SPAC of the SC of E is AdjP)THEN RETURN(1)
IF(the SPAC of the SC of E is NP)THEN RETURN(2)
IF(the SPAC of the PA of E is AdvP)THEN RETURN(3)
IF(the SPAC of the PA of E is PP)THEN RETURN (4)

Figure 3.3: Algorithim for Identification of Categorial Divergence

The identification algorithm has been designed taking care of the above observations.
The algorithm returns 0 if the translation does not involve any categorial divergence.
Otherwise, depending upon the case it returns 1, 2, 3, or 4. Figure 3.3 provides the
schematic view of the proposed algorithm.

Illustration
Let E be the sentence She is in tears.. Its Hindi translation H be wah (she) ro
rahii hai (is crying). As the sentences are parsed, and their SPACs are obtained
the algorithm proceeds as follows.
In Step 1, it finds that the English sentence root main verb is be, hence it
proceeds to Step 2. In Step 2 the root main verb of the Hindi sentence is determined.
In this case it is ronaa (i.e. to cry), which is not the be verb. The algorithm,
therefore, proceeds to Step 3 where it detects that Hindi sentence does not have a
PA. Thus this is a case of categorial divergence.
The algorithm now checks the SPAC of the PA in tears which is a prepositional
phrase comprising a preposition and a noun. The algorithm, therefore, detects

103

3.3. Divergences and Their Identification in English to Hindi Translation

Figure 3.4: Correspondence of SPACs for the Categorial Divergence Example of


Sub-type 4
categorial divergence of sub-type 4. Figure 3.4 shows the correspondence of the
SPACs of the English sentence and its translation.
Other sub-types may be detected in a similar way.

3.3.3

Nominal Divergence

Nominal divergence is concerned with the subject of the English sentence. In the
event of nominal divergence, upon translation the subject of the English sentence
becomes the object or verb complement. In this respect this divergence is somewhat
similar to the thematic divergence as defined in (Dorr, 1993). However, in case of
thematic divergence the object of the source language sentence becomes the subject
upon translation, whereas, in case of nominal divergence the subject of the Hindi
translation is derived from the adjectival complement of the English sentence. Thus,
characteristically nominal divergence differs from thematic divergence.
The subject of the English sentence is realized in Hindi with the help of a prepositional phrase. In particular, with respect to nominal divergence use of two prepositions: ko and se can be observed, which are typically used for an object or
ablative case, respectively (kachru, 1980). Hence the latter one is called as verb
104

An FT and SPAC Based Divergence Identification Technique


complement.
In respect of the above discussions, we define two sub-types of nominal divergence:

1. Nominal sub-type 1: Here the subject of the English sentence becomes object
upon translation. For illustration the following example may be considered:
Ram is feeling hungry. ram ko

bhukh

lag

rahii hai

(to Ram) (hunger) (feel) (..ing) (is)

Here, the adjective hungry is an SC. Its sense is realized in Hindi by the
word bhukh, meaning hunger that acts as the subject of the Hindi sentence.
The subject Ram of the English sentence becomes the object ram ko of
the Hindi translation. Sometimes such an object is termed as dative subject
(kachru, 1980) also. However, because of the use of the postposition ko we feel
that calling it the object of the sentence is more appropriate.
2. Nominal sub-type 2: In this case the subject of the English sentence provides
a verb complement (VC) in the Hindi translation. The following example
illustrates this point.
This gutter smells foul.
iss

naale

se

badboo

aatii hai

(this) (gutter) (from) (bad smell) (comes)

Note that, the subject of the English sentence This gutter is realized as the
modifier iss naale seof the verb aatii hai .

105

3.3. Divergences and Their Identification in English to Hindi Translation

Step1.
Step2.
Step3.
Step4.
Step5.
Step6.

IF(the
IF(the
IF(the
IF(the
IF(the
IF(the

main verb root form is be")THEN RETURN(0)


SC of E is null)THEN RETURN(0)
SPAC of the SC of E is not AdjP)THEN RETURN(0)
SC of H is not null )THEN RETURN(0)
object of H is not null)THEN RETURN(1)
VC of H is not null)THEN RETRUN(2)

Figure 3.5: Algorithim for Identification of Nominal Divergence

The analysis of different examples of nominal divergence establishes the following


points:

1. Nominal divergence cannot occur if the main verb of the English sentence is a
declension of the be verb. This is because in that case, the English sentence
does not have an SC, which is essential for a nominal divergene to occur.
2. Otherwise, even if the root word of the main verb of the English sentence is
not be, nominal divergence cannot occur if the English sentence does not
have an SC.
3. Otherwise, if the SC is null, and the object is not null in H, then it is the
instance of nominal divergence of sub-type 1. In place of the object, if verb
complement (VC) is present in H then it is nominal divergence of sub-type 2.

The algorithm has been designed by taking care of the above observations. Figure
3.5 provides a schematic view of the proposed algorithm.

Illustration
Let E be the sentence I am feeling sleepy, and H be its translation mujhe (to me)
niind (sleep) aa rahii hai (is coming). The root form of the main verb of E is not
106

An FT and SPAC Based Divergence Identification Technique


be. Therefore, it satisfies the else condition of step1. So, we can proceed to check
further steps. Here, the SC of E is sleepy, which is an adjective (Adj).

Figure 3.6: Correspondence of SPAC E and SPAC H of Nominal Divergence of


Sub-type 1
Hence steps 2 and 3 do not apply. In step 4 the SC of H is checked and that is
null. This implies that conditions for step 4 are not satisfied. In step 5 the object of
the H is identified. It implies that the given example pair has nominal divergence
of sub-type 1. Figure 3.6 gives the correspondence of the SPACs of the example
discussed above.

3.3.4

Pronominal Divergence

Pronominal divergence pertains to English sentences in which the pronoun it is


used as the subject. The Hindi equivalent of it is wah or yah. Thus, typically
the Hindi translation of such a sentence should have one of these two words as the
subject of the sentence. For examples, the following translations may be considered:
It is crying. wah (it) ro (cry) rahaa (. . . ing) hai
It is small. yah (it) chhotaa (small) hai (is)

107

3.3. Divergences and Their Identification in English to Hindi Translation

However, sentences of similar structures with an impersonal pronoun as subject are


sometimes translated into Hindi in different ways. One may observe different variations here depending upon which part of speech/ FT of English sentence becomes the
subject upon translation. This observation helps in defining four different sub-types
of pronominal divergence that are illustrated below.
1. Pronominal sub-type 1 : Here, a subjective complement, which may be a noun,
with/without any qualifying adjective becomes the subject of the Hindi translation.
For illustration, consider the following sentences:
(a) It is morning.

subaha

ho gayii

hai

(morning) (become) (has)


(b) It was a dark night. ek

andherii raat

thii

(one) (dark) (might) (was)

In example (a) the word morning, a noun, acts as an SC. Upon translation
it provides the subject subaha of the Hindi sentence. In example (b) the SC
is still a noun but it is preceded by an adjective. Upon translation the whole
noun phase andherii raat becomes the subject of the corresponding Hindi
sentence.
2. Pronominal sub-type 2 : In this case, the adjectival complement of the subject
it becomes the subject of the Hindi translation. For illustration:
It is very humid today aaj

bahut

umas

hai

(today) (very) (humidity) (is)

108

An FT and SPAC Based Divergence Identification Technique


In this example, the adjectival complement humid, and its adverb very of
the English sentence are together realized with the help of the noun phrase
bahut umas, which acts as the subject of the Hindi sentence. As a consequence pronominal divergence occurs.
3. Pronominal sub-type 3 : Under this sub-type of pronominal divergence the
subject of the Hindi translation comes from the infinitive form of a verb. For
illustration, one may consider the English sentence It is difficult to run in
the Sun. The Hindi translation of the sentence is: dhoop (sun shine) mein
(in) daudhnaa (to run) kathin (difficult) hai (is). The subject of the Hindi
translation has become daudhnaa, which means to run. One may note
that the adjunct in the Sun of the infinitive form to run translates to
dhoop mein that becomes a post modifier for the subject daudhnaa.
4. Pronominal sub-type 4 : Here the subject of the Hindi translation is realized
from the main verb of the source language sentence. Consider, for example,
the following translation:
It is raining barsaat (rain) ho (be) rahii (. . . ing) hai (is)
The main verb to rain of the English sentence provides the subject barsaat
of the Hindi translation. One may notice the difference between this translation, and the translation of the sentence It is crying given earlier in this
section to appreciate the divergence.
Thus we find four different sub-types of the pronominal divergence each having
its own characteristics. If the subject of the English sentence is not it, then the
possibility of pronominal divergence can be ruled out. Further, even if the English
sentence has it at the subject position, if the subject of the Hindi sentence is
109

3.3. Divergences and Their Identification in English to Hindi Translation

Step1.
Step2.
Step3.

Step4.

IF(subject of E is not "It")THEN RETURN(0)


IF(subject of H is wah " or yah " )THEN RETURN(0)
IF(root form of the main verb of E is "be")THEN
IF( SC of E is null)THEN RETURN(0)
ELSE
IF(the SC of H is not null)THEN
IF(SPAC of the SC of E is NP)THEN RETURN(1)
IF(SPAC of the SC of E is AdjP)THEN RETURN(2)
ELSEIF(E contains infinitive form of verb)
THEN RETURN(3)
ELSE RETURN(0)
IF(root form of the main verb of H is "ho ")
THEN RETURN(4)
ELSE RETURN(0)

Figure 3.7: Algorithim for Identification of Pronominal Divergence

Figure 3.8: Correspondence of SPAC E and SPAC H of Pronominal Divergence of


Sub-type 4

110

An FT and SPAC Based Divergence Identification Technique


one of wah or yah, then too pronominal divergence cannot occur. Otherwise,
depending upon the SC or main verb of the English sentence the sub-type of the
pronominal divergence is identified. Figure 3.7 gives the corresponding algorithm.

Illustration
For the English sentence (E) It is raining, and its translation H in Hindi is barsaat
ho rahii hai . The syntactic phrase annotated chunk (SPAC) structures of the
example pair, and their correspondences are given in Figure 3.8. Here the subject
of E is it and the subject of H is barsaat but not yah or wah. It does not
satisfy step 2. In step 3, this algorithm finds that root form of the main verb of the
English sentence is rain which is not be. Therefore, the condition of step 3 is
also not satisfied. Hence step 4 detects pronominal divergence of sub-type 4.

3.3.5

Demotional Divergence

The characteristic feature of demotional divergence is that here the role of the main
verb of the source language sentence is demoted upon translation. In case of European languages this implies that the main verb of the target language is realized
from the object of the source language, and the main verb of the source language
upon translation becomes the adverbial modifier. However, with respect to English
to Hindi translations a subtle variation may be noticed. We observed several examples where the main verb of the English sentence upon translation is demoted
to the subjective complement or predicative adjunct of the Hindi sentence, but not
to adverbial modifier, which we call an adjunct. Hence in the event of demotional
divergence, the main verb of the Hindi translation is realized as a be verb. Thus

111

3.3. Divergences and Their Identification in English to Hindi Translation

for English to Hindi translation, demotional divergence needs to be redefined accordingly. Depending upon how the roles of different constituent words change, four
different sub-types of demotional divergence may be obtained. The four sub-types
are defined as follows:
1. Demotional sub-type 1: This divergence occurs when the main verb and the
object of the English sentence are realized as predicative adjunct in the Hindi
sentence. However, the subject of the English sentence remains the subject
after translation to Hindi. For illustration, we consider the following example:
This dish feeds four people.
yah pakvaan chaar logon

ke liye hai

(this) (dish) (four) (people) (for) (is)

In this example the main verb feeds and the object four people of the
English sentence together give the predicative adjunct, which is the PP, chaar
logon ke liye (in English for four people) of the Hindi sentence. The subject
this dish remains subject after translation.
2. Demotional sub-type 2: Unlike the above sub-type, here the main verb and its
complement (instead of the object) of the English sentence are realized as the
predicative adjunct of the Hindi sentence. The following example illustrates
this point:
This house belongs to a doctor.
yah

ghar

ek

daaktaar kaa hai

(this) (house) (one) (doctor)

112

(of) (is)

An FT and SPAC Based Divergence Identification Technique


In this example, belong to and a doctor are the main verb and its complement of the English sentence, respectively. They jointly provide the predicative
adjunct (daaktaar kaa) of the Hindi sentence.
3. Demotional sub-type 3 : Under this sub-type the main verb and the object
of the English sentence are realized as the adjectival SC and the adjectival
complementation by preposition (SC C), respectively, in the Hindi translation.
Here also, the subject of the English sentence remains the subject of the Hindi
sentence. The following example explains this sub-type:
These two sofas face each other.
yeh

do

sofa

ek

dusre ke

saamne

(these) (two) (sofa) (one) (of other) (opposite)

hain
(are)

In this example, the main verb of the English sentence face is realized as
the SC saamne in the Hindi sentence. Also, the object each other of the
English sentence becomes an SC C, i.e. ek dusre ke. Thus, this translation
belongs to demotional divergence of sub-type 3. The literal meaning of this
translation is These two sofas are opposite to each other.
4. Demotional sub-type 4 : Here also, the main verb of source is realized as SC
(adjective) of the target language. But the object and subject of the English
sentence become the subject of the translation and the post modifier of the
SC of the target language sentence, respectively. We illustrate this with the
following example:
This soup lacks salt.

iss

soop mein namak kam hai

(this) (soup) (in) (salt) (less) (is)

113

3.3. Divergences and Their Identification in English to Hindi Translation

Step1.
Step2.
Step3.

Step4.
Step5.
Step6.

IF(root word of the main verb of E is "be/have")


THEN RETURN(0)
IF (root word of the main verb of H is not "ho " )
THEN RETURN(0)
IF( (the subject of E )EQUAL (the subject of H ))THEN
IF(the PA of H is not null)THEN
IF(the object of E is not null)THEN RETURN(1)
ELSEIF(the VC of E is not null)THEN RETURN(2)
ELSE RETURN(0)
ELSE
IF(the SC C of H is not null) THEN
IF(the object of E is not null)THEN RETURN(3)
ELSE RETURN(0)
IF(the SC of H is null)THEN RETURN(0)
IF (the SC C of H is null)THEN RETURN (0)
IF(the object of E is not null)THEN RETURN(4)
ELSE RETURN(0)

Figure 3.9: Algorithm for Identification of Demotional Divergence

In the above example, the main verb lack of the English sentence is realized as
kam, the SC of the Hindi sentence. The object salt (namak ) becomes
the subject of the target language, and the sense of the soup is realized
as soop mein, post modifier of the SC. In particular, this is an adjective
complementation, and is expressed through the said PP. The literal meaning
of the translation is Salt is less in this soup..
Analysis of the translation examples that involve demotional divergence highlights the following points:
1. In all the instances of demotional divergence we find that the main verb of
the English sentence is different from be or have. Thus if the main verb
of an input sentence is either be or have, the possibility of demotional
divergence in its Hindi translation may be ruled out.
114

An FT and SPAC Based Divergence Identification Technique


2. On the other hand, if the main verb of the Hindi translation is not the ho
verb (i.e. in English be), then demotional divergence cannot occur.
3. If the Hindi equivalent of the subject of the English sentence E is same as the
subject of the Hindi sentence H, then occurrence of demotional divergence is
decided as follows. Since the English main verb is realized upon translation as
SC or PA, if the Hindi translation has no SC or PA, then here also demotional
divergence cannot occur.
4. Otherwise depending upon whether the PA or SC are present in the H, the
method returns sub-type 1, 2, 3 or 4, accordingly, indicating occurrence of the
corresponding sub-type of demotional divergence.

Figure 3.9 provides a schematic view of the proposed algorithm.

Illustration 1.
Consider the English sentence (E) The soup lacks salt, and its Hindi translation
(H) soop mein namak kam hai . The SPACs of these sentences and their term
correspondences are given in Figure 3.10.

Figure 3.10: Correspondence of SPAC E and SPAC H for Demotional Sub-type 4

115

3.3. Divergences and Their Identification in English to Hindi Translation

Here the root form of the main verb of H is ho (i.e. be), and for E it is
lacks. Hence, steps 1 and step 2 are not satisfied, and therefore, computation
proceeds to step 3. However, if condition of step 3 fails as the subjects of E and H
are not same. Step 4 and 5 check that both SC and SC C are present in the Hindi
sentence. Hence, step 6 is considered. Since E has no object, the algorithm returns
4 indicating that the above sentence pair has a demotional divergence of sub-type 4.

Illustration 2.
Consider another example, where E is This dish feeds four people., and H is yah
pakvaan chaar logon ke liye hai . The SPACs of these two sentences and their
correspondences are given in Figure 3.11.

Figure 3.11: SPAC Correspondence for Demotional Divergence of Sub-type 1


The root form of the main verb of E and H are to feed and ho, respectively.
Therefore, both step 1 and step 2 are not satisfied. Further, the algorithm checks
other steps for determining the sub-type of demotional divergence. The subject of
both E and H are same. The PA is present in H, and the object of E is not null.
This implies that the conditions of step 3 are satisfied. The algorithm, therefore,
116

An FT and SPAC Based Divergence Identification Technique


returns 1, i.e. the demotional divergence of sub-type 1 is present.

3.3.6

Conflational Divergence

Conflational divergence pertains to the main verb of the source language sentence.
Typically, as characterized in (Dorr, 1993), conflational divergence occurs when some
new words are required to be incorporated in the target language sentence in order to convey the proper sense of the verb of the input. However, with respect to
English to Hindi translation we need to deviate from this definition because of the
following reason. Many English verbs do not have a single-word equivalent in Hindi.
In fact, a large number of English verbs are expressed in Hindi with the help of a
noun followed by a simple verb. Such a combination is called a Verb Part (Singh,
2003), where the verb used in the Verb Part is some basic verb such as honaa (to
become), karnaa (to do) etc. Some examples of Hindi Verb Parts are given below.

Begin - aarambh karnaa

Answer - uttar denaa

Fail - asafal honaa

Allow - aagyaa denaa

Ride - savaarii karnaa

Wonder - hairaan honaa

For illustration, consider the verb to begin. Its Hindi equivalent is aarambh
karmaa. In Hindi, aarambh is the abstract noun meaning the beginning;
whereas, karnaa means to do. Thus the verb is realized in Hindi as a combination of noun and verb. In a similar vein, the verbs denaa (meaning to give)
and honaa (meaning to become) are used as the basic verbs along with appropriate nouns to provide the meanings of the English verbs cited above. There are
also examples of Verb Parts involving other basic verbs, such as maarna as well.

117

3.3. Divergences and Their Identification in English to Hindi Translation

Thus, if Dorrs definition is adopted in English to Hindi translation, there will be


a large number of instances of conflational divergence. Calling such a large set of
translation examples as divergence makes little sense. Hence we propose that the
event of introduction of a noun to convey the sense of a verb should not be called a
divergence for English to Hindi translation.
However, there are situations when the action, suggested by the main verb of
an English sentence, needs the help of a prepositional phrase or adverbial phrase to
convey the proper sense of the verb. These cases are encountered occasionally, and
therefore deviate from the normal Hindi verb structure. We call these variations as
divergence from English to Hindi translation. Below we provide two sub-types of
this divergence.

1. Conflational sub-type 1 : Divergence of this type occurs when the new words
are added as adjunct to the verb. Typically, this adjunct is realized as a
prepositional phrase. For illustration, consider the following English sentences
and their Hindi translations:
Ram stabbed John. ram ne john ko
(Ram) (to John)

chaaku se maaraa
(knife) (by) (hit)

The sense of the verb stab is conveyed through the introduction of the prepositional phrase chaaku se. There are cases when the adjunct appears in the
form of an adverbial phrase instead of a prepositional phrase.
Mary hurried to market. mary

jaldi se

bazaar

gayii

(mary) (hurriedly) (market) (went)

To convey the proper sense of the verb hurry, the adverbial phrase jaldi
se is used along with the main verb jaanaa meaning to go. Note that
118

An FT and SPAC Based Divergence Identification Technique


gayee is the past form (with feminine singular subject) of jaanaa.
Although, the conflational verb adds the lexicon in the adjunct normally, in
English to Hindi translation we have found some examples in which the conflational verb adds lexicon in the subject of the target language. This we call
sub-type 2 of conflational divergence.
2. Conflational sub-type 2 : Under this sub-type the new word added acts as
the subject of the Hindi translation, and the original subject of the English
sentence becomes the post modifier or possessive case of the subject of the
Hindi sentence.
Example 1. He resembles his mother.
uskii

shakal

uskii

maa

se

miltii hai

(his)

(face) (his) (mother) (with) (matches)

The literal meaning of the translation is: His face is similar to his mother..
The subject of the Hindi sentence, viz. uskii shakal (meaning his face)
is realized form the source language verb to resemble. Here uskii (his)
is the possessive pronoun of the original subject (he) of the English sentence.

Example 2. This dish tastes good.


iss

pakvaan

(this) (dish)

kaa swaad achachaa


(of) (taste) (good)

hai
(is)

In this example too, the subject of the Hindi sentence iss pakvaan kaa swaad
(the taste of this dish) is realized from the verb to taste.
Figure 3.12 provides a schematic representation of the proposed algorithm keep119

3.3. Divergences and Their Identification in English to Hindi Translation

Step1.
Step2.
Step3.

IF(root word of the main verb of E is "be/have")


THEN RETURN(0)
IF(# adjunct(s) of E < # adjunct(s) of H )THEN RETURN(1)
S1= number of nouns in the SPAC of the subject of E
S2 = number of nouns in the SPAC of the subject of H
IF(S1 <S2)THEN
IF(((SPAC of the subject of E has "PRP")
AND(SPAC of the subject of H has "PRP$"))
OR(SPAC of the subject of H has "POSS")
OR(SPAC of the subject of H has "P"))
THEN RETRUN (2)
ELSE RETURN(0)
ELSE RETURN(0)

Figure 3.12: Algorithm for Identification of Conflational Divergence

ing in view the following points.

1. If the English sentence E has declension of be/have verb at the main verb
position, then conflational divergence cannot occur.
2. If H has more adjuncts than E, then it is the case of conflational divergence
sub-type 1.
3. If the number of nouns in the SPAC of the subject of E is less than the number
of nouns in the SPAC of the subject of H, and the SPAC of the subject of
H further contains a possessive personal pronoun (PRP$), or a possesive case
(POSS), or a preposition (P), then conflational divergence of sub-type 2 occurs.

The algorithm returns 0 if the translation does not involve any conflational divergence. Otherwise, depending upon the case it returns 1 or 2.

120

An FT and SPAC Based Divergence Identification Technique


Illustration 1.
Let E be the sentence I stabbed John., and let its translation H be main ne john
ko chaaku se maaraa. The corresponding SPAC for both the sentences and the
term correspondences are given in Figure 3.13.

Figure 3.13: Correspondence of SPAC E and SPAC H for Conflational Divergence


of Sub-type 1
In Step 1, the algorithm finds that in the English sentence the root form of the
main verb is stab other than be/have, it implies else condition of step 1 occurs.
Now we will check further steps for determining the sub-type of the conflational
divergence.
In the step 2, S1 and S2 are 0 and 1 respectively, as E does not have any adjunct
but H has an adjunct. This implies, the given translation pair has conflational
divergence of sub-type 1.

3.3.7

Possessional Divergence

Possessional divergence deals with English sentences in which the declension of verb
have is used as the main verb. An interesting feature of Hindi is that it has
121

3.3. Divergences and Their Identification in English to Hindi Translation

no possessive verb, i.e. one equivalent to the have verb of English. The normal
translation pattern of English sentences with declensions of have as main verb is
illustrated below:
Ram has many enemies.

ram ke bahut

shatru

hai

(rams) (many) (enemies) (is)


Ram has a holiday today. ram kii

aaj

chhuttii

hai

(rams) (today) (holiday) (is)


Ram has an inkpot.

ram ke paas

davaat hai

(with ram)

(inkpot) (is)

The above examples demonstrate that the normal translation pattern of these
sentences is one of the following:

1. The main verb of the translated sentence is honaa which means to be.
2. The verb is used along with some genitive prepositions (viz. kaa, ke or kii ),
or the locative prepositional phrase, viz. ke paas, to convey the meaning of
possession (kachru, 1980).
3. Which one of the three genitive prepositions will be used depends upon the
number and gender of the object. It is kaa if the object is masculine singular,
kii if the object is feminine singular, and ke for plural both masculine and
feminine.

However, there are many examples where the translation structure deviates from
this normal pattern giving rise to divergence. We call this as the possessional

122

An FT and SPAC Based Divergence Identification Technique


divergence. Depending upon how the roles of different FTs change, six different
sub-types are identified. These sub-types are explained below.
1. Possessional sub-type 1 : Here the roles of the subject and the object are
reversed upon translation. Thus this sub-type is akin to thematic divergence.
But in Hindi this pattern is observed only when the main verb of the English
sentence is have or its declensions. Hence we categorize this as possessional
divergence. For illustration, consider the following examples:
(a) He has a bad headache. use

tez

sirdard hai

(to him) (bad) (headache) (is)


(b) Ram has fever.

ram ko bukhaar hai


(to ram) (fever) (is)

In sentence (a), he and a bad headache are the subject and the object,
respectively. In the Hindi translation the subject is tez sirdard , i.e. bad
headache, and the object of the Hindi sentence is use which is the accusative
case of he. Thus the roles of subject and object are reversed upon translation.
Similarly, in (b) upon translation the roles of the subject Ram and object
fever are reversed.
2. Possessional sub-type 2 : In this case the object and its premodifying adjective
in the English sentence are realized as the subject and SC, respectively, in the
Hindi sentence. The subject of the English sentence is realized as possessive
case of the subject of the target language sentence. The following example
illustrates this.

123

3.3. Divergences and Their Identification in English to Hindi Translation

These birds have sweet voice.


in

chidiyon kii aawaaz miithii hain

(these) (birds)

(voice)

(sweet) (is)

The object voice and its premodifying adjective sweet of the English sentence are realized in Hindi as the subject aawaaz and its adjectival complement miithii . Note that the subject these birds of the English sentence is
realized as a possessive case (in chidiyon kii) in the Hindi translation.
3. Possessional sub-type 3 : Here, the object and its post modifier (normally, a
PP) in the English sentence are realized as the subject and the predicative
adjunct, respectively, in the Hindi translation. The subject of the English
sentence also contributes as the possessive case to the predicative adjunct. For
illustration, consider the following:
Boys have books in their satchels.
ladkon ke baston

mein

(boys)

(in)

(satchels)

kitaaben
(books)

hain
(are)

Ram has two rupees in his pocket.


ram kii

zeb

mein do

rupaye hain

(rams ) (pocket) (in) (two) (rupees) (are)

In the first example, the object (books) provides the subject (kitaaben) of
the Hindi translation. The post-modifier in their satchels of the object of the
English sentence is realized as a predicative adjunct ladkon ke baston mein
of the Hindi sentence. One may notice that the subject boys is present as
the possessive case in the predicative adjunct.
124

An FT and SPAC Based Divergence Identification Technique


Similar transformation takes place in the second example. The object (two
rupees) and the post modifier of the object (in his pocket) are realized upon
translation as the subject (do rupaye) and the predicative adjunct (raam
kii zeb mein), respectively. Thus, the literal meaning of the Hindi sentence
is Two rupees are in Rams pocket.
4. Possessional sub-type 4 : In this case the subject of the Hindi translation is
derived from the object of the English sentence; and the subject of the English
sentence becomes a predicative adjunct upon translation. For illustration,
consider the following:
This city has a museum.
iss

shahar mein ek sangrahaalay hai

(this) (city)

(in) (one) (museum) (is)

The subject of the Hindi sentence is ek sangrahaalay , which comes from


the object of the English sentence. The object this city translates to iss
shahar that becomes a predicative adjunct iss shahar mein in Hindi.
5. Possessional sub-type 5 : Here, the object, which is a noun with/without any
premodifier, becomes the main verb of the Hindi sentence. The premodifier
may be an adjective or a noun, that becomes an adjunct of the translated
sentence. Consider, for illustration, the following translations:

125

3.3. Divergences and Their Identification in English to Hindi Translation

(a) Mary has regards for her uncle.


mary

apne

chaachaa

(Mary) (her) (uncle)

kii

izzat

kartii hai

(of) (respect) (does)

(b) They had a narrow escape.


woye

baal baal

bache the

(they) (marginally) (escaped)

In example (a) the main verb of the Hindi sentence (izzat kartii hai) is
realized from the object regards of the English one. Similarly, in example
(b), the object escape of the English sentence is realized as the main verb
(bache the) of the Hindi sentence. Further, the premodifying adjective of
the object (narrow) is realized as an adjunct (baal baal ) in the translated
sentence.
6. Possessional sub-type 6 : Here, the main verb of the translated sentence is not
ho. Moreover, this verb does not come from any of the functional tags of
the English sentence. Consider for example the following translations:
(a) Radha had a good time here.
raadhaa ne

yahaan

(radha )

(here)

acchaa samay
(good)

bitaayaa

(time) (spent)

(b) Ram had heavy breakfast.


ram ne

bhaarii

naashtaa

kiyaa

(ram)

(heavy) (breakfast) (did)

In example (a), the main verb of the Hindi sentence is bitaayaa which is
different from the verb ho which does not come from any FT of the English
126

An FT and SPAC Based Divergence Identification Technique


sentence. The literal meaning of the Hindi translation of (a) is Radha spent a
good time here.. Similarly, in (b) introduction of a new verb kiyaa (means
did) may be noticed.
We have the following observations on the translation examples that involve
possessional divergence:
1. Possessional divergence cannot occur if the main verb of the English sentence
is not a declension of have.
2. Possessional divergence cannot occur if the subject of H has a postposition
(ke, kaa or kii ).
3. If the root form of the main verb of H is not ho, then the presence of
divergence of sub-type 6 or sub-type 5 will be identified if the object is present
in H.
4. If the root form of the main verb of H is ho, the object of H is not present,
and the predicative adjunct is present in H, then decision of divergence subtype 3 and 4 will be taken on the basis of the postmodifier of the object of
E.
5. To check the precondition of sub-type 1, one has to first find out the translation of the subject and the object of the English sentence E with the help
of bilingual dictionary. If these act as the object and subject of the Hindi
translation (i.e. their roles are reversed) then possessional divergence sub-type
1 occurs.
6. If the postmodifier of the object of E is not present then it gives divergence
of sub-type 4, otherwise divergence sub-type 3. Under sub-type 3, the subject
127

3.3. Divergences and Their Identification in English to Hindi Translation

of E becomes a possessive case of the subject of H. This implies that if


the subject of E is either personal pronoun (PRP) or NP then the subject
of E becomes possesive personal pronoun (PRP$) or possesive noun phrase,
respectively. This can be identified if the SPAC of the subject conatins one of
POSS, PRP$ or P.
7. For sub-type 2 to occur the following three conditions are necessary. The root
form of the main verb of H should be ho, the SPAC of the object of H
should contain an Adj (i.e. adjective), and also the SC of H should not be
null. When all the three conditions are meet with, possessional divergence of
sub-type 2 is identified.

We have designed our algorithm taking care of the above observations. Figure 3.14 provides a schematic view of the proposed algorithm. We illustrate the
algorithm with the help of the following examples.

Illustration 1.
Consider the English sentence (E) Suresh has fever.. Its Hindi translation (H)
is suresh ko (suresh) bukhaar (fever) hai (is). The SPACs of these sentences and
their terms correspondences are given in Figure 3.15.
The root form of the main verb of E and H are have and ho respectively.
This implies that they do not satisfy the conditions of step 1 and 2. In step 3, the
algorithm checks the postposition condition of the subject of H, it finds that none
of the relevant postpositions is present for the subject of H. In step 4, the algorithm
finds that the subject of E and the object of H are suresh and suresh ko,
respectively, which are translations of each other. Further it finds that the object
128

An FT and SPAC Based Divergence Identification Technique

Step1.
Step2.

Step3.

Step4.

Step5.

IF(root word of the main verb of E is not have)


THEN RETURN(0)
IF(root word of the main verb of H is not ho) THEN
IF(the object of H is null) THEN RETURN(5)
ELSE RETURN(6)
IF((postposition of the subject of H) EQUAL
(ke paas OR ke OR kaa OR kii))
THEN RETURN(0)
IF(((the object of E) EQUAL (the subject of H))
AND ((the subject of E) EQUAL (the object of H)))
THEN RETURN(1)
IF(the object of H is not null)THEN RETURN(0)
{
IF(the PA of H is not null) THEN
IF(the post modifier of object in E is not null)THEN
IF(((SPAC of the subject of E has PRP)AND
(SPAC of the subject of H has PRP$))OR
(SPAC of the subject of H has POSS)OR
(SPAC of the subject of H has P)) RETURN(3)
ELSE RETURN(4)
ELSE
IF(SPAC of the object of E has Adj)THEN
IF(the SC of H is not null)THEN
IF(((SPAC of the subject of E has PRP)AND
(SPAC of the subject of H has PRP$))OR
(SPAC of the subject of H has POSS)OR
(SPAC of the subject of H has P))RETURN(2)
ELSE RETURN(0)
ELSE RETURN(0)
}

Figure 3.14: Algorithm for Identification of Possessional Divergence

129

3.3. Divergences and Their Identification in English to Hindi Translation

Figure 3.15: Correspondence of SPAC E and SPAC H for Possessional Divergence


of Sub-type 1
fever of E became the subject bukhaar of the Hindi sentence H. Therefore, step
4 returns 1 indicating the occurrence of possessional divergence of sub-type 1 in the
above translation.

Illustration 2.
Consider the English sentence (E) This city has a museum.. Its Hindi translation
H is iss (this) shahar (city) mein (in) ek (one) sangrahaalaya (museum) hai (is).
The SPACs of these sentences and their terms correspondences are given in Figure
3.16.
The root form of the main verb of E and H are have and ho, respectively.
Therefore, algorithm arrives to step 3, here the subject of H does not have any
postposition kaa, ke or kii . Hence the algorithm proceeds further. Since the
conditions for step 4 are not meet with, the algorithm arrives at step 5. Here it
finds that in H there is no object, but a PA (iss shahar mein) is present. Also,
since there is no postmodifier of the object of E, the algorithm returns 4. Thus, the
algorithm diagnoses possessional divergence of sub-type 4 in the above translation
example.
130

An FT and SPAC Based Divergence Identification Technique

Figure 3.16: Correspondence of SPAC E and SPAC H for Possessional Divergence


of Sub-type 4

3.3.8

Some Critical Comments

In this chapter we have discussed the various types of divergences that have been
observed in English to Hindi translation. By analyzing the characteristics of various
examples, we have been able to identify different sub-types under each divergence
type. These observations helped us to design algorithms for their identification.
However, we still have some examples of divergence which do not fall under any of
the above-mentioned types. At the same time we do not have sufficient number of
examples for these types so as to classify them under some new type or sub-types.
Efficiency of the algorithm however, is dependent on the availability of the following:
Cleaned and aligned parallel corpus of both the source and the target languages.
An on-line bi-lingual dictionary. For this work, we have used Shabdanjali,
an English-Hindi on-line dictionary.1
1

http://www.iiit.net/ltrc/Dictionaries/Dict Frame.html

131

3.4. Concluding Remarks

Appropriate parsers have to be designed for source language and target language. The parsers should be able to provide the FT and SPAC information
for both the languages. Note that, presently no such parser is available for
Hindi. For our experiments we have used manually annotated Hindi corpora.

3.4

Concluding Remarks

This chapter deals with the characterization and identification of different types
of divergence that may occur in English to Hindi translation. We observed that
identification of divergence can be made without going into the semantic details of
the two sentences. This can be achieved by comparing the Functional Tags (FT)
and Syntactic Phrase Annotated Chunks (SPAC) of the source language sentence
and its translation.
The work described here may be broadly classified into two parts:

1. Characterization of English to Hindi divergence. Divergence is essentially a


language dependent phenomenon. Depending upon the semantic and syntactic
properties of the source and target language the nature of divergence may
change. Although divergence has been studied in great detail for European
languages, not much has been done with respect to Indian languages in this
regard. This work describes in detail the various types (and sub-types) of
divergence that may occur in English to Hindi translation. The work also
identifies three new types of divergence that have hitherto been not found in
translation between any other language pair.
2. Identification of divergence. This chapter makes a meticulous study of the
132

An FT and SPAC Based Divergence Identification Technique


structural changes in the sentences that occur due to various types of divergence. Seven different types of divergence have been studied, and all of
them have a number of sub-types. The necessary preconditions in the English
sentence corresponding to each of these sub-types have been identified, and
consequent variations in the translated Hindi sentences have been observed.
These observations enabled us to form rules on the basis of the FTs and SPACs
of both the English sentences and their Hindi translations to identify the type
and sub-type of divergence, if any has occurred.

An obvious question that arises at this point is how an EBMT system is expected
to handle divergences. In this regard our suggestions are as follows. Once divergences
are identified, the focus of a system designer should be on the following:

To split the systems example base into two parts: normal and divergence
example base. The translation examples are to be put in the appropriate part
of the example base.
To design appropriate retrieval policy, so that for a given input sentence, an
EBMT system can heuristically judge whether its translation may involve any
divergence, and retrieval may be made accordingly.
To design appropriate adaptation strategies for modifying retrieved translation
examples. Since translations having divergence do not follow any standard
patterns, their adaptations may need specialized handling that may vary with
the type/sub-type of divergence.

The following chapter discusses these issues in detail.

133

Chapter 4
A Corpus-Evidence Based
Approach for Prior Determination
of Divergence

A Corpus-Evidence Based Approach

4.1

Introduction

This chapter presents a corpus-evidence based scheme for deciding whether the translation of an English sentence into Hindi will involve divergence. Surely, occurrence
of divergence poses a great hindrance in efficient adaptation of retrieved sentences.
A possible solution may lie in separating the example base (EB) into two parts:
Divergence EB and Normal EB, so that given an input sentence, retrieval can be
made from the appropriate part of the example base. However, this scheme can
work successfully only if the EBMT system has the capability to judge from the
input sentence itself whether its translation will involve any divergence. However,
making such a decision is not straightforward, since occurrence of divergence does
not follow any patterns or rules. In fact, a divergence may be induced by various
factors, such as, structure of the input sentence, semantics of its constituent words
etc. In this chapter we propose a corpus-evidence based approach to deal with this
difficulty. Under this scheme, upon receiving an input sentence, a system looks into
its example base to glean evidences in support as well as against any possible type of
divergence that may occur in the translation of the input sentence. Based on these
evidences the system decides whether the retrieval has to be made from the normal
EB, or from the divergence EB.
The algorithm proposed here, works for structural, categorial, conflational, demotional, pronominal and nominal types of divergence 1 . For convenience of presentation we denote them as d1 , d2 , d3 , d4 , d5 and d6 , respectively. Barring structural
divergence (d1 ) all of the other five types of divergence (i.e. d2 ,...,d6 ) have further
1

Prior identification of possessional divergence has been kept out of discussion here. This is
because possessional divergence depends upon several factors, such as, subject, object, and even
the sense in which the verb have is used. Our work (Goyal et. al., 2004) discusses these issues
in detail.

135

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

been classified into several sub-types depending upon the variations in the role of
different functional tags upon translation to Hindi.
In this chapter, we have identified the necessary FT-features that the source
langauge (English) sentences should have in order that a particular type/sub-type
of divergence may occur. This, however, does not mean that any sentence having
those FT-features will necessarily produce a divergence upon translation. As a
consequence, mere examination of the FTs of an input sentence cannot ascertain
whether its translation will induce any divergence or not. Hence more evidences
need to be considered.
This chapter describes all these evidences and how they are to be used for making a priori decision regarding whether the input English sentence will involve any
divergence upon translation to Hindi.

4.2

Corpus-Based Evidences and Their Use in Divergence Identification

The proposed scheme makes use of three different types of evidence to decide whether
a given input sentence will have a normal translation, or whether it will involve one
(or more ) type(s) of divergence when translated into Hindi. These evidences are
used in succession to obtain the overall evidence in support of divergence(s)/nondivergence in the translation of the input sentence. These three steps are explained
below:
Step1 : Here Functional Tags (FTs) of the constituent words of the input sentence
are used to determine the divergence types that cannot certainly occur in the
136

A Corpus-Evidence Based Approach


translation of that sentence. The output of this step is a set D of divergence
types that may possibly occur in the translation of a given input sentence.
Step2 : Here semantic similarities of constituent word(s) of input sentence with constituent words of sentences in the divergence EB and the normal EB are determined. Depending on occurrence of similar words in the divergence and/or
normal EB the scheme decides whether upon translation the input sentence
may induce any divergence.
Step3 : Some times the above two steps may suggest more than one type of divergence.
In such a situation the algorithm should consult its knowledge base to ascertain which combinations of divergence types are possible in the translation of
a single sentence. A scrutiny of our example base, and examination of the syntactic rules of the Hindi grammar suggest that only the following combinations
of divergence are possible with respect to English to Hindi translation:
1. structural (d1 ) and conflational (d3 )
2. conflational (d3 ) and demotional (d4 )
3. categorical (d2 ) and pronominal (d5 )
This knowledge is stored in a set CD := {{d1 , d3 }, {d3 , d4 }, {d2 , d5 }}. The
possible combinations of divergence can be used as evidence to rule out any
suggestions given by the earlier two steps that do not conform with the knowledge stored in the set CD described just above.

The following subsections elaborate the above steps.

137

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

4.2.1

Roles of Different Functional Tags

Analysis of the divergence examples suggests that for each divergence type to occur
the underlying sentence needs to have some specific functional tags (FT) and/or some
specific attributes of these FTs. We call them together FT-features of a sentence.
Considering all the divergences together we found that ten different FT-features are,
in particular, useful for identification of divergence. Table 4.1 provides a list of these
features, which we label as f1 , f2 , . . . , f10 .
FT-feature

Property of feature

f1
f2
f3

Root form of the main verb is be


To-infinitive form of a verb is present
Root form of the main verb is not be/have

f4
f5

Subject is present
Object is present

f6
f7
f8

Subjective complement (SC) is present


Subjective complement is adjective
Subject of the sentence is it

f9
f10

Verb complement(VC) is present and is a PP


Predicative adjunct (PA) is present

Table 4.1: FT-features Instrumental for Creating Divergence


With respect to a particular type of divergence, an FT-feature may have one of
the following three roles:

Its presence in the input sentence is necessary for the corresponding divergence
type to occur;
It should necessarily be absent in the input sentence if the corresponding divergence is to occur.
138

A Corpus-Evidence Based Approach


The FT-feature has no role in occurrence of the corresponding divergence.

We denote the above three possibilities as P (present), A (absent), and X (dont


care). Table 4.2 gives the roles of the 10 FT-features discussed above in occurrence of
the different types of divergence and their sub-types. We call the table as Relevance
Table.
di

sub-type

f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

d1

sub-type 1

sub-type 2

sub-type 3

sub-type 4

sub-type 1

sub-type 2

sub-type 1

sub-type 2

sub-type 3

sub-type 4

sub-type 1

sub-type 2

sub-type 3

sub-type 1

sub-type 2

d2

d3

d4

d5

d6

Table 4.2: Relevance of FT-features in Different Divergence Types

139

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

Each row of the Relevance Table provides the necessary conditions on the FTfeatures of an input sentence in order that the corresponding divergence may occur.
The advantage of this evidence is that it helps in quick discarding of those types of
divergence that cannot occur in the translation of the given input sentence.
The information given in Table 4.2 may be used in the following way. Given an
input sentence, the algorithm first extracts the values for the ten FT-features, fj ,
j = 1, 2, ..., 10, from the sentence. These values are then compared with the row
entries of the Relevance Table. If the FT-features of the sentence conform with the
entries of some particular row, then evidence is obtained towards occurrence of that
particular divergence for which this row corresponds to one of the sub-types. If a
particular sentence has evidence supporting more than one divergence then all these
possible divergence types are to be considered for step 2 of the algorithm. This set
of possible divergence types for a given input is denoted as D.
For illustration, consider the following input sentence: Ram is friendly to me..
As the sentence is parsed (with some unnecessary components edited) one may get
the following:
@SUBJ <Proper> N SG Ram, @+FMAINV V PRES be, @PCOMPL-S A ABS
friendly , @ADVL PREP to, @<P PRON PERS SG1 i < $. >
The notations used here are from ENGCG parser and explained in Appendix B.
We can summarize the parsed version as follows. Of the ten FT-features discussed
above (see Table 4.1), only four are present in the above sentence. These are:

f1 because the main verb of the sentence is be.


f4 since the sentence has a subject, viz. Ram.

140

A Corpus-Evidence Based Approach


f6 as an SC friendly is present in the sentence.
f7 since the SC of this sentence is an adjective.
Thus in the Hindi translation of this sentence only those divergence sub-types
can occur for which the entries corresponding to FT-features f1 , f4 , f6 , and f7 are
either P or X. For the other FT-features the entries have to be either A
or X. This algorithm assumes that occurrence of a particular divergence type is
possible only if at least one of its sub-types satisfies the above conditions. Thus for
the above input sentence the possible divergences are:
Categorial (d2 ), since sub-types 1 and 2 conform with the above requirements.
Nominal(d6 ), since sub-type 1 satisfies the above requirements.
Also note that, sub-type 1 of d5 has values either P or X for the FT-features
f1 , f4 , f6 , and f7 . But divergence d5 cannot occur in this case as the sub-type has
an extra requirement that FT-feature f8 should also be present, which is not true
for this sentence. Therefore, the output of this step is the set D = {d2 , d6 }.
It, however, should be noted that the FT-features specified in the Relevance
Table do not provide conclusive evidence towards the presence of some particular
divergence type. For example, consider the following two sentences.
Example (A):
She is in trouble. wah (she) musiibat (trouble) mein (in) hai (is)
She is in tears. wah (she) ro (cry) rahii (..ing) hai (is)
Since both the sentences given in Example (A) have the same FT-features, i.e.
f1 , f4 and f10 , the Relevance Table gives evidence supporting categorial divergence
141

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

d2 (check the rows for sub-types 3 and 4) for both the sentences. But of the two
sentences the translation of the first one is a normal one. It is only the second
sentence that involves categorial divergence upon translation to Hindi. Thus, to
determine the possible divergence type(s) in a sentence, only the FT-features cannot
be taken as the sole evidence, and more evidences need to be sought.
From the above example, it can be surmised that it is the prepositional phrase
in tears that is instrumental for causing the categorial divergence in the second
sentence. In general, corresponding to each divergence type one can associate some
functional tags that are instrumental for causing the divergence. We call it as the
Problematic FT of the corresponding divergence type. Table 4.3 provides the Problematic FT corresponding to all the six divergence types relevant in the context
of English to Hindi translation. This table has been obtained by examining the
sentences in our example base.

Divergence Type

Problematic FT

Structural

Main Verb

Categorial

Subjective Complement (SC: adjective, noun)


or Predicative Adjunct (adverb, PP)

Conflational

Main Verb

Demotional

Main Verb

Pronominal

Main Verb or Subjective Complement (adjective, noun)

Nominal

SC (adjective)
Table 4.3: FT of the Problematic Words for Each Divergence Type

Table 4.3 is to be used in the following way. If the FT-features of a given


input conform with the requirements of a particular divergence type (as given in
142

A Corpus-Evidence Based Approach


the Relevance Table) then the corresponding problematic FT in the sentence needs
to be examined more carefully. Since both the sentences of Example (A) have the
structures required for categorial divergence, Table 4.3 suggests that to gather more
evidence the scheme should concentrate on the SC or PA of the sentences.
In this respect one major difficulty is that a particular word may convey different
senses in different context even if it is under the same FT. For example, consider
the two sentences and their Hindi translations given in Example (B) below:
Example (B):
Mohan beat the drum in the school.
Mohan ne widyaalay mein drum bajaayaa
(Mohan)

(school)

(in) (drum)

(beat)

Agassi beat Becker in the final.


Agassi ne final
(Agassi) (final)

mein Becker ko haraayaa


(in)

(Becker)

(beat)

Here, the first one is an example of normal translation, while the second one is a
case of structural divergence because of the introduction of the postposition ko in
the object of the Hindi sentence. A careful examination suggests that although the
main verb of both the sentences is beat, its translation causes divergence when
used in a particular sense, but not when used in some other sense. By referring
to WordNet 2.02 one may find that the first sentence has the 6th sense of the word
beat, which is to make a rhythmic sound ; while the second sentence has the
1st sense of the word beat, which is to come out better in a competition, race,
or conflict. Therefore, while dealing with words one needs to pay attention to
2

http://www.cogsci.princeton.edu/cgi-bin/webwn

143

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

the particular sense in which a word is being used in some senses it may cause
divergence, and in some other senses it may not induce any divergence at all.
Since an exhaustive list of words (along with their relevant senses) that lead
to divergence is impossible to make, the proposed algorithm tries to gather more
evidences by using the semantic similarity of the constituent words to the word
senses that are already known to cause divergence, or known to deliver a normal
translation. In order to achieve this two dictionaries have been created: Problematic
Sense Dictionary (PSD) and Normal Sense Dictionary (NSD). The PSD contains the
words along with their senses that have been found to cause divergence. Similarly,
the NSD contains the words along with their senses for which normal translation
has been observed.
Divergence type (di ) No. of words in PSDi

No. of words in NSDi

Structural

(d1 )

163

1078

Categorial

(d2 )

57

167

Conflational

(d3 )

43

997

Demotional

(d4 )

66

1422

Pronominal

(d5 )

75

170

Nominal

(d6 )

12

97

416

3931

Total

Table 4.4: Frequency of Words in Different Sections

These dictionaries are further grouped into six sections a section corresponding to each divergence type. Section PSDi contains problematic words occurring in
sentences whose translations involve divergence of type di . Similarly, section NSDi
contains problematic words of sentences having the FT-features as required for divergence type di (as specified in the Relevance Table), but actually having a normal
144

A Corpus-Evidence Based Approach


translation. Table 4.4 gives the number of words in each section of the PSD and the
NSD that is currently present in our example base.
PSD1

NSD1

PSD2

NSD2

Attend#v#1

Beat#v#6

Afraid#a#1

Brave#a#1

Beat#v#1
Love#v#3

Do#v#13
Eat #v#4

Friendly#a#4
On#r#2

Good#a#1
Illusion#n#2

Marry#v#1
Occupy#v#4
..
.

Purchase#v#1
See#v#1
..
.

Pain#n#1
Tear#n#1
..
.

Monitor#n#2
Trouble#n#1
..
.

PSD3

NSD3

PSD4

NSD4

Face#v#3

Agree#v#4

Belong#v#1

Continue#v#9

Look#v#5
Resemble#v#1

Feel#v#4
Go#v#10

Face#v#3
Front#v#1

Ride#v#9
Sell#v#2

Rush#v#4
Stab#v#1
..
.

Look#v#3
Solve#v#1
..
.

Smell#v#2
Suffice#v#1
..
.

Solve#v#1
Walk#v#6
..
.

PSD5

NSD5

PSD6

NSD6

Freeze#v#6
Humid#a#1
Morning#n#3

Bright#a#10
Light#a#1
Plain#a#2

Cold#a#1
Hot#a#1
Hungry#a#1

Dull#a#4
Good#a#1
Happy#a#2

Rain#v#1
Winter#n#1
..
.

Shiny#a#3
Wrong#a#1
..
.

Sleepy#a#1
Thirsty#a#2
..
.

Helpful#a#1
Innocent#a#4
..
.

Table 4.5: PSD/NSD Schematic Representations

Each PSD/NSD entry contains along with the relevant word, its part of speech
and appropriate sense number (as given by WordNet 2.0). Table 4.5 shows some
entries corresponding to each PSDi and NSDi , i=1,2,...6. The entries are stored
in the format word#pos#k, where pos stands for the particular Part of Speech,
which can be one of n, v, a or r (corresponding to noun, verb, adjective and adverb
145

4.2. Corpus-Based Evidences and Their Use in Divergence Identification

respectively), and k stands for the sense number.


For illustration, consider the two sentences given in Example (A). Both of them
have the structure required for categorial divergence, i.e. d2 . Problematic FT for
this divergence type is the predicative adjunct (PA), which is a prepositional phrase.
Hence, in PSD2 and NSD2 we store tears#n#1 and trouble#n#1, respectively. Similarly, corresponding to Example (B) where the relevant divergence is structural, i.e.
d1 , the entries in PSD1 and NSD1 are beat#v#1 and beat#v#6, respectively.
In order to ascertain whether a given input sentence may have a divergence di
the proposed scheme proceeds as follows. It first identifies the problematic word ai
of the sentence corresponding to the divergence di . The evidence is collected on the
basis of four parameters, viz. sim(ai , wi ), s(di ), sim(ai , wi0 ) and s(ni ), as described
below:

1. sim(ai , wi ) gives the maximum similarity score between ai and the words in
PSDi , where sim(x, y) denotes the semantic similarity between two words x
and y (see Appendix C).
2. The quantity s(di ), corresponding to divergence type di is defined as follows:

s(di ) =

0
1
2

if xi = 0;


xi
ci

ci
S

...(1)

otherwise.

where ci , xi and S are as follows:


(a) ci is the total number of entries in PSDi (given in Table 4.4);
(b) xi is the number of words in PSDi that are semantically similar to ai ;

146

A Corpus-Evidence Based Approach


(c) S is the total number of words in the PSD. Note that, currently the
total number of words in PSD is 416 (see Table 4.4). This number will
increase as more divergence examples are obtained, and corresponding
problematic words are added to the dictionary.
3. The quantity sim(ai , wi0 ) is similar to sim(ai , wi ). While computing sim(ai , wi0 ),
the scheme will use NSDi and NSD instead of PSDi and PSD.
4. The quantity s(ni ) is similar to s(di ), and is calculated using NSDi and NSD.
The value used for S here is the cardinality of the NSD which is at present
3931 (see Table 4.4).

These four quantities are used to determine the possibility of occurrence of divergence di in the translation of the given input sentence.

4.3

The Proposed Approach

In order to determine whether a given input sentence, say e, may involve some
divergence upon translation, the evidences mentioned in previous section are used
in the following way. First the input sentence e is parsed, and then using the
Relevance Table a set D is determined that contains the divergence types that may
possibly occur in the translation of e. For each possible divergence type di D the
problematic word ai is extracted from the sentence e. From PSDi , the word wi is
retrieved that is semantically most similar to ai . The subsequent steps depend upon
the value of sim(ai , wi ). If the value is 1, that implies that ai is present in PSDi .
On the other hand, a small value of sim(ai , wi ) implies that there is not enough
evidence in support of divergence di . Hence it may be concluded that divergence di
147

4.3. The Proposed Approach

will not occur in the translation of e. Note that, whether the value of sim(ai , wi ) is
sufficiently small is determined by comparing it with a threshold t, which is to be
determined experimentally from the corpus. If the value of sim(ai , wi ) is between t
and 1, then some evidence in support of divergence di is obtained. In order to make
a conclusion from this point the algorithm now refers to NSDi to obtain the word
wi0 that is semantically most similar to ai . Depending upon the values of sim(ai ,
wi ) and sim(ai , wi0 ), a decision is taken regarding whether the translation of e will
involve divergence di or not. Based on this decision, the retrieval is to be made from
the appropriate part of the example base, i.e. the Divergence EB or Normal EB.
The overall scheme is explained below which involves four major steps as follows:
Step 1: At this stage, the input sentence e is parsed, and its FT-features are
obtained. From these FT-features, using Table 4.2, the set D of possible divergence
types is determined.
The main objective now is to determine the divergence types, out of all the di
D, which have positive evidence supporting them to happen in the translation of e.
Steps 2 and 3 are designed for this purpose. A set of flags, Flagi , corresponding to
each di D is used to store this information. Initially each of these flags is set to
1. Step 2 and Step 3 are now carried out for each di D in order to reassign the
value of Flagi . At each iteration the next di with the minimum index i is chosen
such that Flagi is -1.
Step 2:

From the input sentence e, the problematic word ai corresponding to

divergence di (see Section 4.2) is determined. The set Wi comprising of words belonging to PSDi , and having positive semantic similarity score with ai is determined.
Thus Wi = {b : b PSDi and sim(ai , b) >0}. From Wi the word wi is obtained such
148

A Corpus-Evidence Based Approach


that sim(ai , wi ) = max sim(ai , b) b Wi . If Wi is empty then sim(ai , wi ) is
considered to be 0. Depending on the similarity score sim(ai , wi ), decision is taken
regarding di , as follows.
Case 2a: If sim(ai , wi ) = 1, then set Flagi = 1. This is because the condition
implies that the word ai is present in PSDi . Hence this sentence will certainly have
divergence di upon translation. Therefore Flagi is set to 1.
Case 2b: This case occurs when ai
/ PSD. But if ai is a noun or verb, and further
ai is a coordinate term of wi (i.e. according to WordNet terminology, ai and wi have
the same hypernym), then it can be decided that ai will not create divergence of
type di upon translation. This is because all those coordinate terms of wi that may
cause divergence are already stored in the PSD. Therefore Flagi is set to 0.
Case 2c: If sim(ai , wi ) < t, where t is some pre-defined threshold, then too it
may be decided that ai will not cause divergence di . Consequently, Flagi is set to 0.
The main difficulty here is to decide upon the right value for the threshold t. After
a sequence of experiments with different values for t, we found that the best results
are obtained for t = 0.5. However, since this value is corpus dependent, for other
corpora the value of t should be determined experimentally.
Since in all the three cases above, the scheme arrives at a decision regarding the
divergence type di , computation may skip Step 3, and go to Step 4 directly. But
there may be cases when the similarity score sim(a i , wi ) lies between t and 1. In
these cases, as mentioned above, the NSD has to be referred to. Hence Step 3 is
executed.
Step 3:

Here, first the set W0i = {b |b NSDi and sim(ai , b) > 0) is computed.
149

4.3. The Proposed Approach

From this set the word wi0 is picked such that sim(ai , wi0 ) = max sim(ai , b) b Wi0 .
If Wi0 is empty then sim(ai , wi0 ) is considered to be 0. Depending on sim(ai , wi0 ), one
of the following cases is executed.
Case 3a: If sim(ai , wi0 ) = 0 then it implies that there is no evidence that the
word will lead to normal translation. Consequently, Flagi is set to 1 indicating that
divergence di has a positive chance of happening.
Case 3b: If sim(ai , wi0 ) = 1 then the evidence suggests that the word ai should
provide a normal translation to the sentence, and there is no possibility of divergence
di to occur in the translation of this sentence. Consequently, Flagi is set to 0.
Case 3c: Decision making becomes most difficult when 0 < sim(ai , wi0 ) < 1. This
implies that words sufficiently similar to ai exist neither in the PSD nor in the NSD.
Thus, any decision about divergence/non-divergence cannot be taken yet.
In this case the scheme proposes to look into how many words similar to ai , are
available in PSDi and NSDi . This evidence is given by score s(di ) and s(ni ) computed
using formula (1) (given in Section 4.2). Finally, similarity scores sim(ai , wi ) and
sim(ai , wi0 ) are combined with s(di ) and s(ni ) respectively, to take into consideration
the importance of both the evidences. If evidence supporting divergence di is more
then the value of Flagi is set to 1 otherwise it is set to 0. Thus, in this case, following
computations are performed:

Compute s(di ) and s(ni ).


Determine m(di ) := 12 (s(di ) + sim(ai , wi )), and
m(ni ) := 12 (s(ni ) + sim(ai ,wi 0 )).
150

A Corpus-Evidence Based Approach


If m(di ) > m(ni ) Then
Set Flagi = 1; GO TO Step 4.
Else If m(di ) < m(ni )
Set Flagi = 0; GO TO Step 4.
Else Flagi = 1/2; Break;
The last case refers to a rare situation when m(di ) and m(ni ) are equal. In
this case, the algorithm cannot recommend whether the translation will involve
divergence di , or will it be normal. In such a situation the system can at best pick
the most similar examples from both normal EB and divergence EB, and leave it to
the user to make the final decision. Therefore, in such cases, the Flagi is set to 1/2.
Once the evidences supporting/against all divergence types di D are obtained,
that is the value of Flagi di D is determined, Step 4 is performed to make a final
decision regarding possible divergence types in the translation of the given input
e. Here it should be noted that the value of Flagi = 0, implies that e cannot have
divergence di ; while value of Flagi = 1 implies that upon translation e may have
divergence di . A set D 0 is constituted, such that D 0 = {di D and Flagi = 1}, i.e.
D 0 stores all those di s for which positive evidences are obtained.
Step 4: The final decision is computed in the following way.
Case 4a: If D 0 = , then the conclusion is that sufficient evidence has not been
obtained for any of the divergence types. Hence, the decision is that the translation
of the input sentence e will not involve any divergence.
Case 4b: If |D 0 | = 1, i.e. D 0 = {dk }. This implies that evidence is obtained in
support of just one divergence type dk . The algorithm therefore decides that the
151

4.3. The Proposed Approach

translation of the input sentence will have divergence dk .


Case 4c: If |D 0 | > 1, it implies that there is a possibility of more than one type
of divergence. The algorithm therefore seeks further evidence to make any decision.
The evidence provided by CD (Section 4.2) may be used here. A set C = {{di , dj }
CD | di , dj D0 } is constructed. Depending upon the |C|, further decision has
to be taken in the following way.

If |C| = 0, it implies that no permissible combination has been found. In


this case, the algorithm computes s(di ) and m(di ) di D 0 as is in Case 3c.
The algorithm concludes that the translation of the input sentence will have
divergences dk , where k is such that m(dk ) = max0 {m(di )}.
di D

If |C| = 1, it implies that there is evidence for only one permissible combination.
Let it be {dk , dl }. The algorithm suggests that the input sentence e will involve
both divergence dk and dl upon translation to Hindi.
If |C| > 1, that is, if the evidences are obtained in support of more than
one permissible combination of divergences, then the scheme needs to select
the most likely combination of them. It therefore determines the quantity
1
(m(di )
2

+ m(dj )) for all combinations {di , dj } C. The scheme recommends

that combination of divergences for which this quantity is maximum.

The flowchart of the proposed scheme is given in Figures 4.1 and 4.2.

152

A Corpus-Evidence Based Approach

Figure 4.1: Schematic Diagram of the Proposed Algorithm


153

4.3. The Proposed Approach

Figure 4.2: Continuation of the Figure 4.1

154

A Corpus-Evidence Based Approach

4.4

Illustrations and Experimental Results

In this section we first illustrate with examples how the above algorithm works
towards prior identification of divergence, if any, in translation from English to Hindi.
The examples considered are increasingly difficult in nature. Later in Subsection
4.4.4 a consolidated result of several experiments is presented, and certain limitations
of the said algorithm are discussed.

4.4.1

Illustration 1

Consider the input sentence: I am feeling hungry.


The parsed version of the above sentence is: @SUBJ PRON PERS SG1 i, @+FAUXV
V PRES be, @-FMAINV V PCP1 feel, @PCOMPL-S A ABS hungry < $.>.
Of the ten FT-features (see Table 4.1) only four are present in the above sentence.
These are:

f3 since the main verb (feel) of the sentence is not be or have.


f4 as the sentence has a subject, viz. I.
f6 because the sentence has an SC.
f7 since the SC of this sentence is an adjective (hungry).

Note that the FT-features of the given input sentence conform with both the subtypes of d3 and only sub-type 2 of d6 (see Table 4.2). Hence the set D of possible
divergence types is obtained as D={d3 , d6 } which are conflational and nominal types

155

4.4. Illustrations and Experimental Results

of divergence, respectively. Therefore, evidences need to be collected for both of the


divergence types.

Evidences for conflational divergence (d3 ):


Table 4.3 suggests that the problematic word for d3 is the main verb, i.e. feel.
WordNet 2.0 provides thirteen different senses for the word feel when used as a
verb, such as:
sense1 : feel, experience undergo an emotional sensation
sense2 : find, feel come to believe on the basis of emotion, intuitions, or
indefinite grounds
sense3 : feel, sense perceive by a physical sensation, e.g., coming from the
skin or muscles
For the given input sentence the appropriate sense is sense1. Thus a3 is feel#v#1.
A scrutiny of PSD3 reveals that it contains no words w such that similarity sim(w, a3 ) >
0. Thus W3 = , and therefore, Flag3 is set to 0.

Evidences for nominal divergence (d6 ):


Problematic FT for d6 is Subjectival complement (Adjective). Hence the problematic word of the input sentence is hungry. WordNet 2.0 provides two senses for
hungry of which the first one feeling hunger is appropriate in this case. Thus,
the problematic word is a6 which is hungry#a#1. PSD6 is then scrutinized to find
the word semantically most similar to a6 . It is found that PSD6 already contains
hungry#a#1. Therefore w6 is same as a6 , and hence similarity score is 1. Thus
Flag6 is set to 1.
156

A Corpus-Evidence Based Approach


Now the set D 0 is constructed as D 0 = {di D: Flagi =1}. Evidently for the
given input sentence D 0 contains a single element d6 . Thus the algorithm suggests
that the above input sentence will cause nominal divergence upon translation to
Hindi, which is a correct decision.

4.4.2

Illustration 2

Consider the input sentence: She is in a dilemma.


Its parsed version is @SUBJ PRON PERS FEM SG3 she, @+FMAINV V PRES
be, @ADVL PREP in, @<P N SG dilemma <$.>.
The FT-features present in this sentence are:
f1 as the root form of the main verb is be.
f4 because the sentence has a subject, viz. she.
f10 since the sentence has a PA, viz. in dilemma.

Using the Relevance Table the set D of possible divergence types is obtained as
{d2 }.
The algorithm now collects evidences in support of categorial divergence (d 2 ):
Table 4.3 suggests that problematic FT for d2 is predicative adjunct, i.e. in
dilemma. Thus problematic word is dilemma. WordNet 2.0 provides only one
sense for dilemma: state of uncertainty or perplexity especially as requiring a choice
between equally unfavorable option. Thus the problematic word a2 is dilemma#n#1.
A search in PSD2 for the word that is semantically most similar to a2 retrieves the

157

4.4. Illustrations and Experimental Results

entry motion#n#4 as w2 , and the similarity score sim(a2 , w2 ) is computed to be


0.578.
It may be noted that similarity between dilemma and motion is not apparent
at the surface level. However, since in this algorithm the hypernyms of the words
concerned are used for computing the similarity value, a positive semantic score has
been obtained because the last abstraction level in the hypernyms of dilemma and
motion are same which is = state.
Since 0.5 sim(a2 , w2 ) <1, the Step 2 of the algorithm suggests that NSD2 has
to be checked for further evidence. From NSD2 , the word w20 most similar to a2 is
determined, and it is found to be confusion#n#2 with sim(a2 , w20 ) = 0.960. The
algorithm therefore determines s(d2 ), s(n2 ), m(d2 ) and m(n2 ) (see case3c). These
values are found to be 0.086, 0.035, 0.332 and 0.497, respectively. Since m(n2 ) >
m(d2 ), Flag2 is set to 0.
Using step 3 the algorithm now constructs the set D 0 consisting of divergence
types di for which the Flags have been set to 1. Evidently, D 0 is found to be
empty. Thus the algorithm suggests that the above input sentence does not give
any divergence upon translation to Hindi.
It may be noted that the above decision made by the algorithm is a correct one.

4.4.3

Illustration 3

Now consider the sentence: My house faces east.


Its parsed version is: @GN> PRON PERS GEN SG1 i , @SUBJ N SG house,
@+FMAINV V PRES face, @OBJ N SG east <$.>

158

A Corpus-Evidence Based Approach


Note that the main verb of the input sentence is face which is not be or
have. Further, the sentence has a subject my house and an object east. Thus
the FT-features of the given input sentence are: f3 , f4 and f5 .
According to the Relevance Table the set D is constructed, and it has three
elements:
d1 , i.e. structural divergence
d3 , i.e. conflational divergence because of sub-types 1 and 2.
d4 , i.e. demotional divergence due to sub-types 1, 3 and 4.

Evidences for structural divergence (d1 ):


The problematic FT for d1 is the main verb which is face. Nine senses are
provided by WordNet 2.0 for the verb face of which sense3 (be oriented in a
certain direction, often with respect to another reference point; be opposite to) is the
relevant one in this case. Thus problematic word a1 is face#v#3 . From PSD1
the word w1 that is most similar to a1 is retrieved. Note that w1 is obtained as
attend#v#1, and the similarity score sim(a1 , w1 ) is calculated to be 0.660. Since
0.5 sim(a1 , w1 ) < 1, the algorithm now checks the NSD1 . From NSD1 , W01 is
constructed, and w10 is found to be cap#v#1 with sim(a1 , w10 ) = 0.889. In this
case, the algorithm has to determine s(d1 ) and s(n1 ). These are found to be 0.444
and 0.151 respectively. Thus, m(d1 ) = 21 (sim(a1 ,w1 ) + s(d1 )) = 0.552 and m(n1 ) =
1
(sim(a1 ,w10 )
2

+ s(n1 )) = 0.520. Since m(d1 ) > m(n1 ), the algorithm set Flag1 to be

1.

159

4.4. Illustrations and Experimental Results

Evidences for conflational divergence (d3 ):


The problematic FT for d3 is also main verb (See Table 4.3), and therefore the
problematic word (a3 ) here too is face#v#3 . From PSD3 the word w3 that is
most similar to a3 is retrieved. In this case the same word face#v#3 exists in PSD3 ,
and therefore sim(a3 , w3 ) = 1.0. Therefore, due to Case 2a Flag3 is set to 1.
Evidences for demotional divergence (d4 ):
Problematic word a4 for d4 is also face#v#3 , which too exists in PSD4 . Hence
Flag4 is also set to 1.
Divergence type

(di ) s(di ) m(di )

structural

(d1 )

0.444 0.552

conflational

(d3 )

0.086 0.543

demotional

(d4 )

0.204 0.602

Table 4.6: Values of s(di ) and m(di ) for Illustration 3

In Step 3, the set D 0 = {d1 , d3 , d4 } is constructed . The set of possible combinations C (see Case 4c) is found to be {{d1 , d3 }, {d3 , d4 }}. For a final decision
the algorithm now computes the values of s(di ) and m(di ) (see Case 3c). These
values are given in Table 4.6. Using the values given therein the algorithm computes
1/2*(m(d

1)

+ m(d3 )) = 0.548 and 1/2*(m(d3 ) + m(d4 )) = 0.673.

Since the latter one is maximum, the algorithm suggests that the above input
sentence will have divergence d3 and d4 upon translation to Hindi. The above decision of the algorithm is also correct.
Tables 4.7 and 4.8 provide few more examples with brief explanation. The overall
160

A Corpus-Evidence Based Approach


analysis of each example sentence requires 17 columns. Table 4.7 contains the column
numbers (i ) to (viii ), and Table 4.8 contains the column numbers (ix ) to (xvii ). For
ease of understanding one column corresponding to serial number (S. No.) and
column number (ii ) are given in both the tables. In these tables, NA is used
when particular condition is not applicable, and Nil implies that no word having
semantic similarity score greater than 0 has been found in the PSD/NSD.

161

Problematic
Word, ai

Most similar
word, wi

sim(ai , wi )

Is wi
a

Most similar word, w i 0

sim(ai , wi 0 )

coordinate
term?
(i )

(ii )

(iii )

(iv )

(v )

(vi )

(vii )

(viii )

Continued . . .
1.

2.
162
3.

4.

5.

She will

d1

resolve#v#6

calculate#v#1

0.984

No

resolve#v#6

1.0

resolve
this issue.

d3
d4

resolve#v#6
resolve#v#6

Nil
Nil

0.0
0.0

NA
NA

NA
NA

NA
NA

I will
attend this

d1
d3

attend#v#1
attend#v#1

attend#v#1
look#v#5

1.0
0.66

NA
No

NA
ride#v#9

NA
0.66

meeting.

d4

attend#v#1

face#v#4

0.75

Yes

NA

NA

This exer-

d1

hurt#v#2

trample#v#2

0.96

No

twist#v#9

0.96

cise
will hurt
your back.

d3
d4

hurt#v#2
hurt#v#2

knife#v#1
Nil

0.96
0.0

No
NA

twist#v#9
NA

0.96
NA

John
stabbed

d1
d3

stab#v#1
stab#v#1

stab#v#1
stab#v#1

1.0
1.0

NA
NA

NA
NA

NA
NA

Mary.

d4

stab#v#1

Nil

0.0

NA

NA

NA

This dish

d3

taste#v#1

taste#v#1

1.0

NA

NA

NA

tastes
good.

d6

good#a#1

Nil

0.0

NA

NA

NA

4.4. Illustrations and Experimental Results

S. Sentence
No.

S. Sentence
No.

Problematic
Word, ai

Most similar
word, wi

sim(ai , wi )

Is wi
a

Most similar word, w i 0

sim(ai , wi 0 )

coordinate
term?
(i )
6.

(ii )

(iii )

(iv )

(v )

(vi )

(vii )

(viii )

d1

weigh#v#1

encounter#v#3 0.660

No

stay#v#1

0.660

weighs
100kg.

d3
d4

weigh#v#1
weigh#v#1

measure#v#3
suffer#v#6

0.972
0.660

No
No

look#v#3
look#v#3

0.660
0.660

7.

It is windy
today.

d2
d5

windy#a#1
windy#a#1

stormy#a#1
stormy#a#1

0.75
0.75

No
No

Nil
Nil

0.0
0.0

8.

It will be
morning

d2
d5

morning#n#3
morning#n#3

pain#n#2
morning#n#3

0.406
1.0

No
NA

NA
NA

NA
NA

d2

pain#n#1

pain#n#2

0.438

No

NA

NA

d3

suffice#v#1

resemble#v#1

0.782

No

meet#v#5

0.96

d4
d5

suffice#v#1
suffice#v#1

suffice#v#1
Nil

1.0
0.0

NA
NA

NA
NA

NA
NA

163

This table

soon.
She is in
pain.
10. It suffices.

Table 4.7: Some Illustrations

A Corpus-Evidence Based Approach

9.

(ii )

s(di )

s(ni )

m(di )

m(ni ) Flagi

D0

1
(m(di )
2

(ix )

(x )

(xi )

(xii )

(xiv )

(xv )

(xvi )

(xvii )

(xiii )

+ m(dj )) Result

Continued . . .
1.

d1

NA

NA

NA

NA

d3
d4

NA
NA

NA
NA

NA
NA

NA
NA

0
0

d1
d3

NA
0.075

NA
0.141

NA
0.368

NA
0.401

1
0

d4

NA

NA

NA

NA

d1

0.241

0.142

0.601

0.551

d3
d4

0.08
NA

0.145
NA

0.52
NA

0.553
NA

0
0

d1

NA

NA

NA

NA

d3
d4

NA
NA

NA
NA

NA
NA

NA
NA

5.

d3
d6

NA
NA

NA
NA

NA
NA

6.

d1

0.224

0.231

d3

0.186

d4

0.287

2.

164

3.

4.

NA

NA

Normal

d1

NA

NA

d1

d1

NA

NA

d1

1
0

d 1 , d3

{d1 ,d3 }

NA

d 1 , d3

NA
NA

1
0

d3

NA

NA

d3

0.442

0.495

d3

NA

NA

d3 ; No decision about
d4

0.219

0.579

0.439

0.287

0.473

0.473

1
2

4.4. Illustrations and Experimental Results

S. D
No.

s(di )

s(ni )

m(di )

m(ni ) Flagi

D0

1
(m(di )
2

(ii )

(ix )

(x )

(xi )

(xii )

(xiii )

(xiv )

(xv )

(xvi )

(xvii )

d2

NA

NA

NA

NA

d 2 , d5

{d2 , d5 }

NA

d 2 , d5

d5

NA

NA

NA

NA

d2

NA

NA

NA

NA

d5

NA

NA

d5

d5

NA

NA

NA

NA

d2

NA

NA

NA

NA

NA

NA

NA

NA

Normal

S. D
No.

7.
8.
9.

+ m(dj )) Result

as

sim(a2 , w2 ) < 0.5,


wrong decision
165
NA
NA

NA
NA

NA
NA

NA
NA

1
0

d4

NA

NA

Table 4.8: Continuation of Table 4.7

d4

A Corpus-Evidence Based Approach

10. d4
d5

4.4. Illustrations and Experimental Results

4.4.4

Experimental Results

In order to evaluate the performance we have used the above algorithm on randomly
selected 300 sentences, that are not currently present in our example base. Manual
analysis of the translations of these 300 sentences revealed that 32 of them will
involve some type of divergence when translated from English to Hindi. Remaining
268 sentences have normal translations.
The output of the algorithm is as follows: It recognized 36 of the sentences
to have divergence upon translation, and 261 to have normal translation. For 3
sentences the algorithm could not make any decision. Table 4.9 summarizes the
overall outcome.
Parameters

Divergence

Normal

Number of examples

32

268

Experimental results

36

261

Correct results

30

260

Recall %

83.33%

99.62%

Precision %

93.75%

97.39%

Table 4.9: Results of Our Experiments

The very high value (above 90%) for precision establishes the efficiency of the
algorithm in detecting possible occurrence of divergence even before the actual translation is carried out.
There are few examples when the algorithm failed to produce the correct decision.
These may be put into three categories:
166

A Corpus-Evidence Based Approach


1. Translation of the input sentence actually involves divergence but the algorithm predicts normal translation. Table 4.9 indicates that there is one such
case in our experiments. Although the algorithm suggests that 261 sentences
will be translated normally it has been found that actually 260 of them are
correct decisions.
2. The input sentence actually has normal translation but the algorithm predicts
divergence. In the experiments carried out by us, we found six such examples. While the algorithm suggests that 36 sentences will involve some type of
divergence, only 30 of them are correct decisions (see Table 4.9).
3. The algorithm is unable to decide the nature of the translation of the input
sentence. Out of 300 examples tried the algorithm could provide decisions for
only 297 (36+261) sentences. For the remaining three sentences the algorithm
could not arrive at any decision regarding whether they will be translated
normally, or their translations will involve some type of divergence. These are
the situations that fall under Case 3c of the algorithm.
Table 4.7 provides one of the example of this type. Here the input sentence
and its translation are: This table weighs 100kg iss (this) mez kaa vajan
(weight of this table) 100 kilo (100 kg) hai (is). This example has demotional
divergence, i.e. d4 . However the algorithm could not give any decision regarding occurence/non-occurrence of d4 since the values of both m(d4 )and m(n4 )
are computed to be 0.473.

The algorithm is not able to give correct result in first two cases. We feel that
the possible reasons behind the incorrect decisions taken by the algorithm are the
following:
167

4.5. Concluding Remarks

Lack of robust PSD and NSD. The present size of the PSD and NSD are
416 and 3931 respectively. Evidently, these numbers are not large enough
to deal with all different sentences. As more examples (particularly, those
involving divergence) are collected, both the PSD and NSD may be enriched
with additional entries. This will in turn enable the algorithm to measure
semantic similarity in a more direct way. As a consequence the number of
erroneous decisions will reduce.
The value of threshold. For our experiments we have used 0.5 to be the value
of the threshold t. This value has been obtained by carrying out a number
of experiments on our example base. However, with more examples this value
of t may have to be reassigned, which may in turn improve the quality of the
results. Further experiments with more examples need to be carried out to
arrive at an optimal value of the threshold t.

4.5

Concluding Remarks

Occurrence of divergence poses great hindrance in efficient adaptation of retrieved


sentences in an EBMT system. This can be dealt with efficiently provided an EBMT
system is capable of making a priori decision regarding whether an input sentence
will cause any divergence upon translation. This will enable the EBMT system to
retrieve a past example more judiciously. However, the primary difficulty in handling
divergence is that their occurrences are not governed by any linguistic rules. Hence
no straightforward method exists for determining whether a source language sentence
will involve any divergence upon translation. In this work we attempted to bridge
this gap. We developed a scheme so that an a priori decision may be made by seeking
168

A Corpus-Evidence Based Approach


evidences from the existing example base. In order to achieve the above goal we first
analyzed different divergence examples to ascertain the root cause behind occurrence
of a divergence. We found that each divergence type can be associated with some
Functional Tag (FT) that is instrumental for causing this type of divergence. We
call it the problematic FT corresponding to that particular divergence. In fact, a
detailed analysis of a large number of translation examples revealed that occurrence
of each type of divergence invariably demands certain patterns in the structure of
the input sentence. While the presence of certain FTs (including the problematic
FT) in the input sentence is mandatory, some other FT features should necessarily
be absent in order that the particular divergence type can occur.
Since divergence in an occasional phenomenon, it is not true that any sentence
having the structure required by a particular divergence will certainly involve divergence upon translation. Occurrence of divergence also depends upon semantics of
some constituent words. To measure the semantic similarity between words two dictionaries, viz. problematic sense dictionary (PSD) and normal sense dictionary
(NSD), have been created.
Given an input, these knowledge bases are referred to seek evidence in support/against divergence. Evidences used are of the following types:

(a) The Functional Tags of the constituent words of a given input;


(b) Semantic similarity of these constituent words with words in the PSD and
NSD;
(c) Frequency of occurrence of different divergence types in the example base; and
(d) Which divergence types may co-occur in the translation of an input sentence.
169

4.5. Concluding Remarks

The experiments carried out by us resulted in very high values of precision and
recall. However, more experiments need to be done to establish this scheme as a key
technique for dealing with divergences for an EBMT system.
The following points may be noted with respect to the scheme presented here:

1) Creation of the sense dictionaries is an important background work required


for implementation of the proposed scheme. The sense dictionaries (PSD and
NSD) used in this work have been created manually. Some suitable Word Sense
Disambiguation techniques may have to be developed/used to accomplish this
task.
2) The decisions made by the scheme concerns with divergence types only. We
feel that the scheme may be further extended to deal with various sub-types
that are associated with each divergence type. Our present example base does
not have sufficient number of examples for each sub-type. More examples involving each of these sub-types need to be obtained, and analyzed for any such
extension, and also to improve upon the performance of the present scheme.

170

Chapter 5
A Cost of Adaptation Based
Scheme for Efficient Retrieval of
Translation Examples

A Cost of Adaptation Based Scheme

5.1

Introduction

Similarity measurement is an essential part of any EBMT system as this leads to


the development of an effective retrieval scheme for it. The closer the retrieved sentence to the input one, the easier is its adaptation towards generating the required
translation. However, no standard technique has been developed for measuring sentential similarity. Typically, similarity between sentences is measured using syntax
and semantics (Manning and Schutze, 1999). But in this chapter we show that if
adaptation is the main concern for measuring similarity neither of them is adequate
for similarity measurement.
In this work we look at similarity from adaptation point of view. A new
algorithm has been proposed that considers cost of adaptation as the key concept
for measuring similarity. This means that the proposed algorithm measures the
computational cost involved in adapting the translation of a sentence E1 to generate
the translation of another sentence E2 . The lesser the cost, the more similar the two
sentences will be. The algorithm has been tested on our normal example base. For
convenience, in subsequent discussions normal example base(Refer Chapter 4) is
referred to as example base only. The results obtained are compared with other
two algorithms based on syntactic and semantic metrics. It has been shown that the
algorithm proposed in this chapter performs better than the other two.

5.2

Brief Review of Related Past Work

Various similarity metrics reported in the literature can be characterized depending on the text units they are applied on. These units may be words, characters,

171

5.2. Brief Review of Related Past Work

sentences or chunks1 . Some of these metrics are discussed below:


1. Word-based metrics: Word-based metrics are considered to be one of the basic
similarity metrics, suggested by Nagao (1984) and used in many early EBMT
systems. The proposed metric uses thesaurus or similar means for identifying
word similarity on the basis of meaning or usage. According to Nirenburg
(1993), individual words of the two sentences are compared in terms of their
morphological paradigms, synonyms, hypernyms, hyponyms and antonyms.
On the other hand, Sumita et. al. (1990) has used a semantic distance d
(0 d 1) that is determined by the Most Specific Common Abstraction
(MSCA), obtained from a thesaurus abstraction hierarchy. Similar technique
was used by Sumita and Iida (1991) for translating Japanese adnominal particle constructions (Noun1 preposition Noun2 ). This shows that this technique
works for measuring sub-sentence level similarity as well.
2. Character-based metrics: Another approach that is based on character-based
metric has been proposed in (Sato, 1992). This is a highly language dependent approach that requires analysis of characteristics of the language under
consideration. Satos work has been applied to Japanese taking advantage of
certain characteristics of the language. These characteristics are as follows :
(a) This character based method does not need morphological analysis.
(b) It can retrieve some kind of synonyms without a thesaurus, because synonyms often have the same Kanji character in Japanese.
The character based best match can be determined by defining the distance
or similarity measure between two strings. Considering character order con1

Chunk is a segment or substring of words from a sentence or text.

172

A Cost of Adaptation Based Scheme


straints, the simple measure of similarity between two strings is the number of
matching characters.
3. Chunk/Substring based matching scheme : This approach is proposed by Nirenburg et. al. (1993). Here, the search for matching candidates proceeds as follows. A sentence is broken into segments at punctuation marks or at unknown
words, and thus a list of all contiguous substrings (chunks) of a segment
is produced. For every input chunk the algorithm looks for sentences in the
corpus that contains a matching substring. This algorithm considers a relaxed
definition of matching that allows not only complete matches but also matches
in which (i) there are gaps between the words, (ii) the word order is different.
It also considers matching on the basis of subset of the words in the input
chunk, and also takes care of word inflections.
For each of the inexact matches, a penalty is calculated. This penalty is
based on some fixed numbers in the scale 1 to 15 reflecting the degree of inexactness. For example, the penalty for unmatched words is set to 10, the
penalty for disordering is set to 15 etc. Match scores are first calculated separately for each incomplete matches. Then a cumulative score is produced.
Candidate finding procedure retains only those matches whose match scores
are above a threshold, which is set at 10 for the best matches.
4. Syntactic/Semantic based matching (Gupta and Chatterjee, 2002): This idea
has been borrowed from the domain of information retrieval as proposed in
(Manning and Schutze, 1999). Here, each of the example base sentences and
the input are represented in a high-dimensional space, in which each dimension
of the space corresponds to a distinct word in the database. The similarity is
calculated as the dot product of the vectors. In both cases the measurement
173

5.2. Brief Review of Related Past Work

score depends to a significant extent on the word weights (word frequency and
sentence frequency)2 , which in turn depends on the sentences in the example
base. Thus the schemes become highly subjective. In particular, sentences
having similar structure (in terms of tense, subject, number of objects, etc.)
have higher similarity measurement values for a given input sentence. Different weights have been assigned to similarity of different syntactic tags. For
example, a score of 20 is given to verb or auxiliary verb matching, a score of
5 is given to adjective or adverb matching etc. .
5. Hybrid retrieval Scheme: This scheme has been used in ReVerb system (Collins
and Cunningham, 1996) (Collins, 1998) that utilizes two different levels of
case retrieval: String matching retrieval (Phase 1), and Activation passing for
syntactic retrieval (Phase 2). In Phase 1, only exact words are matched, and
near morphological neighbours (such as, variation due to number, variation due
to tense ) are not considered. The highest score is allocated to those cases that
have been activated the greatest numbers of times. In Phase 2, for structural
retrieval, the input sentence is first pre-chunked, such that each chunk has an
explicit head-word. The algorithm initiates activation from each word in the
chunk, giving the head word an increased weight to reflect its pivotal role in
the chunk. The final score is evaluated by summing the above two scores.
6. DP-matching between word sequence (Sumita, 2001): This scheme scans the
source parts of all example sentences in a bilingual corpus. By measuring
the semantic distance between the word sequences of the input and example
sentences, it retrieves the examples with the minimum distance, provided the
distance is smaller than a given threshold. Otherwise, the whole translation
2

These terms are explained in detail in Section 5.5.1.

174

A Cost of Adaptation Based Scheme


fails with no output. The semantic distance (dist) is calculated in the following
way.

P
I + D + 2 SEM DIST
dist =
Linput + Lexample
where I is the number of insertions, D is the number of deletions and SEMDIST
=

K
,
N

where K is the level of the least common abstraction in the thesaurus,

and N is height of the abstraction level. The value of SEMDIST ranges from
0 to 1. The denominator of the above expression is the sum of the length of
the input and the example sentence.
7. Semantic matching procedure (Jain, 1995): This scheme first looks at the verbpart of the input sentence, and on the basis of the type of verb-part it chooses
an appropriate partition of the input sentence. The syntactic units of the
input sentence are counted for entering into the next level of partition. After
reaching the correct sub-partition, exact pattern matching is performed. For
all such examples, distance is found for the input sentence using a distance
formula. The distance d between I(input sentence) and E(example sentence)
is defined as follows:

d(I, E) =

n
X

dp (IG, EG) + dv (IV, EV )

p=1

where n is the number of noun syntactic groups in the source langauge sentence.
IG and EG are input sentence noun syntactic group, and example sentence
noun syntactic group, respectively. Similarly, IV and EV are the input and
the example sentence verb groups, respectively.

175

5.2. Brief Review of Related Past Work

The above distance d is calculated on the basis of weighted average of attribute difference, status difference, gender difference, number difference, person difference, additional sematic difference, and verb category difference between example sentence E and input sentence I. Pre-assigned values in the
range of 0 to 1 have been used as the weighting factors for the above parameters.
8. Retrieving Meaning-equivalent Sentences (Shimohata et. al., 2003): Retrieval
of meaning-equivalent sentences is based on content words (e.g. noun, adjective, verb), modality (request, desire, question) and tense. This method
does not rely on functional words (e.g. conjunction, preposition, auxiliary
verb) information. A thesaurus is utilized to extend the coverage of the example base. Two types of content words identical and synonymous have
been used. Sentences that satisfy the following conditions are recognized as
meaning-equivalent sentences.
The retrieved sentence should have the same modality and tense as the
input sentence.
All content words (identical or synonymous) are included in the input sentence. This means that the set of content words of a meaning-equivalent
sentence is a subset of the input.
At least one identical content word is included in the input sentence.
If more than one sentence is retrieved, the algorithm ranks them by introducing
focus area to select the most similar one. The focus area has been defined
as the last N words from the word list of an input sentence. The value of N
varies according to the length of the input sentence.
176

A Cost of Adaptation Based Scheme


Many other similarity measurement schemes are found in the literature. Every
metric has its own advantages and disadvantages. The demerits mentioned below
motivate us for defining a new metric.

Character based metrics are highly script dependent. Hence a scheme which
is designed for a specific langauge may not be pursued for another language.
Word based metrics are generally dependent on the size of the database. If the
database does not contain sentences having words common to a given sentence,
then these methods may fail to retrieve any similar sentence from the example
base.
Most importantly, in almost all the schemes 3 described above, adaptation and
retrieval have been dealt with independently. However, we feel that adaptation
and retrieval should go hand in hand. A retrieval scheme should be considered
efficient (for an EBMT system) if the adaptation of the retrieved sentence is
computationally less expensive.

In order to avoid above difficulties we propose cost of adaptation to be the major


yardstick for measuring similarity between sentences in an EBMT system. The
following section describes the metric proposed by us. The cost of adaptation is
based on constituent word, morpho-word and suffix operations already discussed in
Chapter 2.
3

The only metrics that we found to have considered the concepts of adaptation while measuring
similarity are (Sumita, 2001) and (Collins, 1998). These schemes rely only on counting the number of adaptation operations, and a fixed penalty is assigned to these operations. However, this
assumption is not very realistic.

177

5.3. Evaluation of Cost of Adaptation

5.3

Evaluation of Cost of Adaptation

As discussed in Chapter 2, the cost of adaptation depends on the number of operations required for adapting a retrieved example. The total cost may then be computed as the sum of the individual cost of each operation used for the adaptation.
An important point to be noted in this respect is that some adaptation operations
(e.g. constituent word addition and constituent word replacement) requires a search
of an English to Hindi dictionary. Typically, this dictionary will not be stored in the
RAM of the system, and its access requires retrieval from external storage. This cost
is much more than the cost of any operation that can be accomplished. Resorting
morpho-word or suffix operations reduce the number of dictionary searches, since
the number of morpho-tags and suffixes is much smaller in comparison with the total content of a dictionary. However, since complete avoidance of constituent word
additions and replacements is impossible, we had to take into account the search
time due to different operations in our analysis of computational cost. To deal with
dictionary search, we make the following assumptions/observations:

1. We assume that the dictionary is stored in a hard-drive. We also assume that


the search will be done using a binary search algorithm. One may consider
some multi-way search trees (e.g. B+-tree) (Loomis, 1997) also. But since a
successful search in a dictionary of size D takes log2 D for a binary tree, and
logm D for a m-way tree, the difference between the search times in these two
cases is due to a constant factor only4 .
2. We further assume that the index tree that is used to facilitate the dictionary
search is already in RAM. Typically, the index tree is designed with the help
4

logm D = log2 D logm 2

178

A Cost of Adaptation Based Scheme


of a set of keys. In this case, we assume that the keys are actually the English
words, which are used for the search operation. The record corresponding to
each key contains all other relevant information, e.g. the Hindi meaning of the
word, the POS and other information. These records are stored in the external
storage.
3. The search procedure refers to the index tree for identifying the location of
the word in the dictionary. This operation is carried out by accessing the
RAM only. For actual retrieval the external storage is accessed, and this has
its associated factors, e.g. latency, seek time (Weiderhold, 1987). However,
in our analysis we do not consider all these factors. We make a simple but
realistic assumption following the studies on temporal requirements as given in
http://www.kingston.com/tools /umg/ umg01a.asp. Accordingly, we assume
that the access time for RAM for the CPU is 200 ns (nanoseconds), while the
access time for hard disk is 12,000,000ns. Thus the time requirements differ
by a constant of the order of 105 .
4. In order to reduce the search time, instead of using one dictionary, we recommend using different dictionaries for different POS. The dictionaries used for
this work are of the following sizes: Noun - 13953, Adjective - 5449, Adverb 1027, Preposition - 87, Pronoun - 72 and Verb - 4330. This database has been
taken from the Shabdanjali English to Hindi dictionary (http://ltrc.iiit.net
/ onlineServices/ Dictionaries/ Shabdanjali/ data source.html). Thus the approximate search time for all these part of speeches are as follows:

179

5.3. Evaluation of Cost of Adaptation

POS

Size

Search time

Noun

13953 log2 13953 13.77

Adjective

5449

log2 5449 12.41

Adverb

1027

log2 1027 10.00

Preposition

87

log2 87 6.44

Pronoun

72

log2 72 6.17

Verb

4330

log2 4330 12.08

5. Constituent word addition operations require another element in an EBMT


system. If a new word needs to be added into a retrieved translation, the
addition should be according to the syntax rules of the target language (here
Hindi). Since one should not expect that all possible examples are available in
the knowledge base of an EBMT system, pure example-based approach may
not be able to obtain the right place for the new word to be added. In the
absence of suitable examples the system needs to wade through a large set of
syntax rules to determine the appropriate position where the new word has to
be added. We denote this cost by , and assume that this cost is much more
than the cost of finding a Hindi equivalent of a given English word from the
dictionary.
6. For applying any constituent word, morpho-word or suffix operation, first one
needs to find the appropriate word position in a retrieved example. If the
retrieved Hindi sentence length is L, we consider the average search time for
finding the appropriate word position to be proportional to

L
.
2

7. Since dictionary search required for constituent word addition (WA) and constituent word replacement (WR) operations is computationally expensive, we
introduce the following step before referring to the dictionary. We suggest that
180

A Cost of Adaptation Based Scheme


before referring to the dictionary the scheme should first check whether the
word to be added is already present in the sentence (may be with a different
functional tag). In that case the Hindi equivalent of the word may be taken
directly from the retrieved sentence, and thereby dictionary search may be
avoided. The cost of this step is proportional to

Lp
,
2

which should be added

to overall cost of constituent word addition and replacement. Here Lp is the


length of the parsed version of the retrieved English sentence.
For illustration, consider two examples
(A) The car runs on diesel.
gaadii

(B)

diijal

par

chaltii

hai

(car) (diesel) (on)

(runs)

(is)

Diesel is a suitable fuel for this car.


diijal

iss

gaadii

(diesel) (this) (car)

ke liye
(for)

upyukta

iindhan

(suitable)

(fuel)

hai
(is)

Note that in sentence (A) the words car and diesel are subject and complement of preposition, respectively. On the other hand, in sentence (B) their
roles are reversed. In order to generate the translation of (A) if the sentence
(B) is retrieved then for these two positions constituent word replacement operations are required. Typically, this operation demands a dictionary search
to get the Hindi equivalent of these words. However, in the above example
the dictionary search may be avoided since the Hindi equivalent of the desired
words (car and diesel) may be obtained from the retrieved example itself.
Hence the computational cost can be minimized.
8. Morpho-word operations or suffix operations do not require any dictionary
search. Only a set of fixed rules (which may be in a tabular form) is needed
181

5.3. Evaluation of Cost of Adaptation

for finding the appropriate morpho-word or suffix for addition, deletion or


replacement. If the total number of morpho-words is M , then the average cost
to find the relevant morpho-word is proportional to

M
.
2

In a similar way, the

average cost of finding the appropriate suffix is proportional to

K
,
2

where K is

the total number of suffixes.


Section 5.3.1 describes how the computational cost of each of the adaptation
operations is computed in view of the above assumptions.

5.3.1

Cost of Different Adaptation Operations

Based on the above observations, cost of the ten different adaptation operations
(discussed in Chapter 2) are estimated in the following way:
1. Constituent Word Deletion(WD): To delete a word from a retrieved example,
first the word is located in the sentence, and then it is deleted. Thus the average
cost is (l1 L2 ) + , where L is the length of the retrieved Hindi sentence, and
l1 is the constant of proportionality.  is a small positive quantity reflecting
the cost of actual deletion operation (e.g. adjustment of pointers if sentences
are stored in a linked-list structure of words).
2. Constituent Word Addition(WA): Constituent word addition is done in three
steps:
First, the Hindi equivalent of the word to be added has to be found in
the dictionary. This involves the cost {(d log2 D) + (c 105 )}, where
D is the size of the relevant dictionary, and c and d are the constants of
proportionality. The two terms correspond to searching the binary tree
182

A Cost of Adaptation Based Scheme


of keys and then retrieving the related record from the external storage,
respectively.
In the second step the position (in the sentence) where the new word
has to be added is located. This requires referring to the syntactic rules
of the target language grammar to find the proper position of the word.
The cost of this operation has been considered (see item 5 of Section
5.3). Thus the overall cost of this step is + (l1 L2 )+ (l2

Lp
).
2

Here

l1 and L are same as in the case of WD discussed above. Lp is the


length of the parsed version of the retrieved English sentence, and l2 is
the corresponding constant of proportionality.
Finally, the actual addition is done. The cost involved for this is , indicating the cost of adding the new word in the retrieved translation.
Therefore, average time requirement for a WA operation is (l1 L2 ) + (l2 L2p )
+ {(d log2 D) + (c 105 )} + + .
3. Constituent Word Replacement(WR): The work here is similar to what needs
to be done in WA, except that here no grammar rules is required to be referred
for finding the proper position of the new word. Consequently, no space is
required to be created for the new word. The cost, therefore, is reduced by 
and in comparison with constituent word addition. Hence the average cost
is (l1 L2 ) + (l2

Lp
)
2

+ {(d log2 D) + (c 105 )}.

4. Morpho-word Deletion (MD): As discussed in item 8 of Section 5.3 to delete a


morpho-word from a retrieved example first the relevant morpho word has to
be identified. Hence an additional cost m

M
2

is to be added to the cost of

the constituent word deletion to get the cost of morpho-word deletion. Here
183

5.3. Evaluation of Cost of Adaptation

m is the constant of proportionality. Therefore, the average cost is ( L2 l1 ) +


(m

M
)
2

+ .

5. Morpho-word Addition (MA): For morpho-word addition, the cost for dictionary search and access in constituent word addition (by referring to item 2
above) is replaced with the average cost that is (m

M
).
2

Moreover, the cost

(l2 L2p ) in constituent word addition is not considered for morpho word addition, as these morpho-words are not present in the tagged version. Therefore,
the average cost of morpho-word addition is (l1 L2 ) + (m

M
)
2

+ + .

6. Morpho-word Replacement (MR): To compute the cost of morpho-word replacement one may refer to the morpho-word addition cost as explained just
above. However, two of its components, viz. and , need not be considered
in the cost for morpho-word replacement. This is because the grammar rules
need not be used to find the location of new word, and consequently no extra
space needs to be created. Further, an additional cost (m M21 ) is to be added
for finding out the morpho-word to be replaced, where M1 is the set from
where the new morpho-word is to be picked. Therefore, the average cost for
morpho-word replacement is (l1 L2 ) + (m M2 ) + (m M21 ). It may be noted
that M1 and M can be equal if the word is replaced with some morpho-word
from the same set.
7. Suffix Deletion (SD): Here the work involved is first to identify the right suffix,
then to do the stripping. So the cost is (l1 L2 ) + (k K2 ), where k is the cost
of proportionality, and K is the total number of suffixes (as explained in item
8 of Section 5.3).
8. Suffix Addition (SA): Suffix addition is done in two steps. First the position
184

A Cost of Adaptation Based Scheme


of the word where the suffix has to be added is determined. The average cost
for this operation is (l1 L2 ) (as explained above). Next the suffix database
is searched for obtaining the appropriate suffix. The average cost therefore is:
(l1 L2 )+ (k

K
).
2

9. Suffix Replacement(SR): In a similar manner, here the average cost may be


computed as (l1 L2 )+ (k

K
)
2

+ (k

K1
).
2

This operation is costlier than

SA because here on the top of adding the suffix some extra computational
effort (k

K1
)
2

is made in identifying the suffix to be replaced, and then in its

stripping from the word. It may be noted that K1 and K can be equal if the
suffix is replaced with some suffix from the same set.
10. For Copy operation no computational cost is taken into account.

These individual costs may be used for determining the overall cost of adaptation.
Section 5.4 discusses how cost may be calculated for adaptation between different
functional slots and kind of sentences.

5.4

Cost Due to Different Functional Slots and


Kind of Sentences

In this Section we discuss cost of adaptation corresponding to the features by referring to the adaptation rules as presented in various rule tables given in Sections 2.3,
2.4, 2.5, 2.6 and 2.7.

185

5.4. Cost Due to Different Functional Slots and Kind of Sentences

5.4.1

Costs Due to Variation in Kind of Sentences

The adaptation rule Table 2.15 suggests that in order to adapt a particular kind of
sentence into another kind requires one or more of the following operations: either
addition or deletion of the word adverb nahiin; or addition or deletion of the
morpho-word kyaa. Hence the cost table can be generated by computing the
costs with respect to the above four adaptation operations only.
By referring to the notations in adaptation cost operations given in Section 5.3.1,
the cost of these operations are:
The cost (k1) of WA for the word adverb nahiin is (l1 L2 ) + + . Here
dictionary search is not required as the translation nahiin may be stored in
some readily accessible location.
The cost (k2) of WD for the word adverb nahiin is (l1 L2 ) + ;
The cost of MA (for the morpho-word kyaa) is , as kyaa always comes in
the beginning of the sentence, no search is required to find the correct position
of the word in the retrieved Hindi sentence. We call this cost k3.
Similarly, the cost of MD of the morpho-word kyaa may be computed as k4
= .

Table 5.1 gives the adaptation cost due to kind of sentences for different combinations of input and retrieved sentence. Cost of adaptation due to variations in
kind of sentences can now be calculated by referring to the required set of adaptation
operations for different cases as given in Table 2.15.

186

A Cost of Adaptation Based Scheme

Input AFF

NEG

INT

NINT

Retd
AFF

k1

k3

k1 + k3

NEG

k2

k3 + k2

k3

INT

k4

k1 + k4

k1

NINT

k2 + k4

k4

k2

Table 5.1: Cost Due to Variation in Kind of Sentences

5.4.2

Cost Due to Active Verb Morphological Variation

Below we discuss the cost of adaptation for certain types of verb morphological
variations. In particular, we discuss two groups:
(1) the input and the retrieved sentence have same tense and same verb form.
(2) the input and the retrieved sentence have same tense but different verb forms.

Cost due to same tense same verb form


In Section 2.3.1 different cases of this group have been discussed. Further, in Table
2.3 different adaptation rules for present indefinite to present indefinite have been
illustrated in detail. It has also been argued that all other cases belonging to this
group can be dealt with in a similar way. Below we discuss the adaptation cost for
present indefinite to present indefinite by referring to the corresponding adaptation
rule Table 2.3.
The above mentioned table suggests that the relevant adaptation operations are
copy (CP), suffix replacement (SR) and morpho-word replacement (MR). The costs
of these basic operations may be computed in the following way.
187

5.4. Cost Due to Different Functional Slots and Kind of Sentences

The cost of CP is considered to be 0.


Cost of SR is (l1 L2 )+ (k 32 ) + (k 32 ). Note that, as discussed in item 9 of
Section 5.3.1, here the term (k 23 ) occurs twice in determining the average cost
for SR. It is because the algorithm has to decide which suffix in the verb of the
retrieved sentence needs replacement. This will be followed by identification
of the appropriate suffix which will replace the present suffix. With respect
the present indefinite case the relevant suffix set is {taa, te, tii }. Hence the
above expression is obtained. We shall denote the overall cost of SR by s.
In a similar way the average cost of morpho-word replacement may be computed to be (l1 L2 )+ (m 42 ) + (m 42 ). Note that here the relevant set of
morpho-word is {ho, hain, hai, hoon}. Hence the cost factor (m 24 ) has been
considered twice in the overall expression. The overall cost of MR is denoted
by n.
Input M1S F1S

M1P F1P

M2S F2S

M3S M3P F3S

F3P

Retd
M1S

s+n

s+n

s+n

s+n

s+n

s+n

s+n

F1S

s+n

s+n

s+n

s+n

M1P

s+n

s+n

s+n

s+n

s+n

F1P

s+n

n+s

s+n

M2S

s+n

s+n

s+n

s+n

s+n

s+n

F2S

s+n

s+n

s+n

s+n

M3S

s+n

s+n

s+n

s+n

s+n

s+n

s+n

M3P

s+n

s+n

s+n

s+n

s+n

F3S

s+n

s+n

s+n

s+n

F3P

s+n

s+n

s+n

Table 5.2: Cost Due to Verb Morphological Variation


Present Indefinite to Present Indefinite
188

A Cost of Adaptation Based Scheme


The cost table corresponding to present indefinite to present indefinite is given in
Table 5.2. It has been formulated in accordance with the adaptation rule Table 2.3.
Here the cost of adapting present indefinite to present indefinite is picked according
to the gender, number and person of the subject of the input and the retrieved
sentence.
The cost table for other verb morphological variations under same tense same
verb form can be formulated in the similar way. Some relevant points in this regard
are discussed below.

The same Table 5.2 works for adaptation from past indefinite to past indefinite
with a slight modification. In this case morpho-word replacement is done from
the morpho-word set {thaa, the, thii } instead of morpho-words set {hain, ho,
hoon, hai }. Hence if the value of n is replaced by (l1 L2 )+ (m 32 ) + (m 32 ) in
the cost Table 5.2, one gets the cost table for past indefinite to past indefinite.
In case of adaptation from future indefinite to future indefinite, cost depends
upon two operations CP and SR. Hence the cost n due to morpho-word replacement (MR) is to be removed from the entries of the Table 5.2. The cost
s of SR in this case is (l1 L2 )+ (k 82 ) + (k 82 ), which is obtained by
considering the relevant set of suffixes, viz. {oongaa, oongii, oge, ogii, egaa,
egii, enge, engii }.
For all other combinations of verb morphological variations of the same group
one more morpho-word replacement is to be added to the cost Table 5.2 in
place of the suffix replacement cost s (as discussed in items 3 to 6 of Section
2.3.1). Here cost of these two morpho-word replacements will vary according
to the tense and verb form. For example, in case of present continuous to
189

5.4. Cost Due to Different Functional Slots and Kind of Sentences

present continuous the relevant morpho-word sets are {hain, ho, hoon, hai }
and {rahaa, rahii, rahe}. The average cost of these morpho-word replacements
are (l1 L2 )+ (m 42 ) + (m 42 ) and (l1 L2 )+ (m 3f2 ) + (m 23 ), respectively.
The cost for morpho-word replacements for the remaining 5 cases (e.g. future
continuous to future continuous, past prefect to past perfect etc.) can be
computed in a similar way by referring to the appropriate morpho-word sets.

Cost due to same tense different verb forms


There are total 18 verb morphological variations (see Section 2.3.3). To keep our
discussion simple here we explain the adaptation cost calculations with the case
explained in the Section 2.3.3 under the heading same tense different verb forms.
In particular, we discuss the case where the input sentence is in future indefinite,
and the retrieved sentence is either in future continuous or future perfect.
Here the cost of verb morphological variations depends on three adaption operations. One is suffix addition, and the other two are morpho-word deletions. The
cost of these operations are as follows:
In the case of future indefinite the appropriate suffix set is {oongaa, oongii,
oge, ogii, egaa, egii, enge, engii }. Hence the average cost of suffix additionis
(l1 L2 )+ (k 82 ). We denote it as s.
In case of future continuous, the two morpho-word deletions will be restricted
to the sets {rahaa, rahii, rahe} and {hoongaa, hoongii, honge, hogaa, hogii,
hoge}, respectively. However, if the retrieved sentence is in future perfect,
the two morpho-word deletions are restricted to {chukaa, chukii, chuke} and
{hoongaa, hoongii, honge, hogaa, hogii, hoge}, respectively.
190

A Cost of Adaptation Based Scheme


The cost of morpho-word deletion corresponding to {rahaa, rahii, rahe} is
(l1 L2 )+ (m 32 ) + . We denote it as m1 . The cost of morpho-word deletion
is same for the morpho-word set {chukaa, chukii, chuke}.
The cost of morpho-word deletion of {hoongaa, hoongii, honge, hogaa, hogii,
hoge} is (l1 L2 )+ (m 62 ) + , which we denote by m2 .
Therefore, the total cost involved in adaptation from future continuous or future
perfect to future indefinite is (s + m1 + m2 ). The cost will be same irrespective of
the variation in number, gender and person of the subject of the input as well as of
the retrieved sentence.
For the reverse case (i.e input sentence is either future continuous or future
perfect, and retrieved sentence is future indefinite) the cost will be the sum of two
morpho-word additions and one suffix deletion. For these adaptation operations the
suffix set and the morpho-word sets are same as in the above case. The individual
costs of these adaptation operations may be calculated in the way explained in
Section 5.3.1. Here the total cost will be sum of the cost of these three adaptation
operation that is : ((l1 L2 )+ (m 62 ) + +) + ((l1 L2 )+ (m 32 ) + + ) +
((l1 L2 )+ (k 82 )).
In a similar way, the cost can be evaluated for the rest of the cases of verb morphological variations of same tense different verb forms. One may refer to Sections
2.3.3 and 5.3.1 to get the relevant adaptation operations and their costs, respectively.
The adaption cost with respect to the other two groups (i.e. different tenses
same verb form, and different tenses different verb forms) can be evaluated in a
similar way with the help of rule tables and set of adaption operations as discussed
in Section 2.3.2 and Section 2.3.4. To avoid the stereotyped nature of discussion,
191

5.4. Cost Due to Different Functional Slots and Kind of Sentences

we do not present all other different cases in this report. However, we present below
the adaption cost table (Table 5.3) for verb morphological variation from present
indefinite to past indefinite, which belongs to the group different tenses same verb
form. These values are obtained by referring to the adaptation rule Table 2.4.
Input M1S F1S
Retd

M1P F1P

M2S F2S

M3S M3P F3S

F3P

M1S

s+w

s+w

s+w

s+w

s+w

s+w

s+w

s+w

F1S

s+w

s+w

s+w

s+w

s+w

M1P

s+w

s+w

s+w

s+w

s+w

s+w

s+w

F1P

s+w

s+w

s+w

s+w

s+w

M2S

s+w

s+w

s+w

s+w

s+w

s+w

s+w

F2S

s+w

s+w

s+w

s+w

s+w

M3S

s+w

s+w

s+w

s+w

s+w

s+w

s+w

s+w

F3S

s+w

s+w

s+w

s+w

s+w

M3P

s+w

s+w

s+w

s+w

s+w

s+w

s+w

F3P

s+w

s+w

s+w

s+w

s+w

Table 5.3: Adaptation Operations of Verb Morphological


Variation Present Indefinite to Past indefinite

Here the cost s denotes the cost of suffix replacement between {taa, te, tii }, which
is (l1 L2 )+ (k 32 ) + (k 32 ), and w denotes the cost of morpho-word replacement
from morpho word of set {hoon, hai, ho, hain} to the set of morpho-word {thaa,
thii, the} which is (l1 L2 )+ (m 42 ) + (m 32 ).

5.4.3

Cost Due to Subject/Object Functional Slot

In this subsection we discuss the adaptation cost majorly for three functional tags
under subject/object functional slot. These tags are genitive case (@GEN), pre192

A Cost of Adaptation Based Scheme


modifying adjective (@AN), and subject/object (@SUB/@OBJ). The relevant adaptation rules have been discussed in Section 2.5.

Cost due to adapting genitive case to genitive case


A transformation from genitive case to genitive case requires eleven adaptation operations as given in Table 2.8. Below we describe the cost for each of them. Note
that the genitive word can be a proper noun, or a noun, or a pronoun. We denote
this set by P.

1. The average cost of constituent word replacement from the set P with a proper
noun. We denote this by w1 . Note that in this case no dictionary search
is required as proper nouns are not stored in any dictionary. Hence w1 is
computed as (l1 L2 ) + (l2

Lp
).
2

2. The average cost of morpho-word replacement (MR) from {kaa, ke, kii } with
itself. We denote this cost by w2 . Since the number of morpho-words is 3, w2
may be formulated as (l1 L2 ) + (m 32 ) + (m 23 ).
3. The average cost of WR from the set P with a noun. This cost is denoted
by w3 . Note that in this case noun dictionary search is necessary for which
the search time is 13.77 (see item 4 of Section 5.3). Further, to access the
dictionary a cost (c 105 ) is required. Hence the total cost is (l1 L2 ) +
(l2

Lp
)
2

+ {(d 13.77 ) + (c 105 )}.

4. The average cost of WR from the set P with pronoun. This is denoted by w4 .
Imitating the case just mentioned above the cost here may be formulated as
(l1 L2 ) + (l2

Lp
)
2

+ {(d 6.17 ) + (c 105 )}.


193

5.4. Cost Due to Different Functional Slots and Kind of Sentences

5. The average cost of morpho-word deletion from the set {kaa, ke, kii }. This
cost is denoted by w5 , which may be formulated simply as (l1 L2 ) + m 23 +
.
6. The average cost of morpho-word addition from the set {kaa, ke, kii }. We
denote this cost by w6 , which is formulated as (l1 L2 ) + (m 23 ) + + .
7. The average cost of suffix replacement for converting a noun in either an oblique
noun form or a plural form (refer Section 2.5.2 and Appendix A). We denote
this cost by s1 . Since the number of relevant suffixes is four, s1 may be computed as is (l1 L2 ) + (k 42 ) + (k 42 ).
8. The average cost of suffix addition for converting noun either in oblique noun
form or in plural form. This cost of suffix addition can be formulated in a way
similar to item 7 above. Here this cost is (l1 L2 ) + (k 52 ) + (k 25 ), which
we denote as s2 .
9. The average cost of suffix addition form the set {kaa, ke, kii } is (l1 L2 ) +
(k 32 ). We denote it as s3 .
10. The average cost of suffix deletion for converting oblique noun form to noun,
or plural to singular ( See Appendix A and Section 2.5.2). This cost is (l1 L2 )
+ (k 52 ) + . We denote is as s4 .
11. The average cost of suffix replacement from the set {kaa, ke, kii }. We denote
this cost by s5 , which is formulated as (l1 L2 ) + (k 23 ) + (k 32 ).
The cost table corresponding to genitive case to genitive case is given in Table
5.4. It has been formulated in accordance with the adaptation rule Table 2.8.

194

A Cost of Adaptation Based Scheme

Input

<proper>

PRON

(0 or ({w1 } +

w3 + {w2 }+ {s1 or w4 + w5 +s3

{w2 }))

s2 }

w1 + {w1 }

(0 or ({w3 }+ {w2 } + w4 + w5 +s3

Retd
<proper>
N

{(s1 or s2 or s4 )}))
PRON

w1 +w6

w3 + {s1 or s2 } +w6

(0 or s5
(w4 +s3 ))

or

Table 5.4: Costs Due to Adapting Genitive Case to Genitive Case

Cost due to adapting subject/object to subject/object


We have considered four possible cases a noun, a proper noun, a pronoun and a
gerund form (PCP1) at the subject or object position (refer Section 2.5.5). We
denote this set by Q. All possible adaptation operations required in this case are
listed in Table 2.11, which has been referred for evaluating the cost for adapting
subject/object to subject/object possible variations.

1. The average cost of constituent word replacement of the set Q by noun. This
cost is denoted by w1 . In this case noun dictionary search is required, and its
search time is 13.77. Hence, w1 is computed as (l1 L2 ) + (l2

Lp
)
2

+ {(d

13.77 ) + (c 105 )}.


2. The average cost of constituent word replacement from the set Q to proper
noun. This cost is denoted by w2 . Note that in this case no dictionary search
is required as proper nouns are not stored in any dictionary. Hence the cost
w2 is computed as (l1 L2 ) + (l2

Lp
).
2

195

5.4. Cost Due to Different Functional Slots and Kind of Sentences

3. The average cost of constituent word replacement from the set Q to pronoun.
This cost is denoted as w3 , which is formulated as (l1 L2 ) + (l2 L2p ) + {(d
6.17) + (c 105 )}(same given in item 1 above).
4. The average cost of constituent word replacement from the set Q to gerund
(PCP1) is (l1 L2 ) + (l2

Lp
)
2

+ {(d 12.08) + (c 105 )}(same as in item

1). Note that, here verb dictionary search is required, and its search time is
12.08. We denote this cost as w4 .
5. For converting singular form of noun to plural form of noun, or vice versa (See
appendix A) any one of the three different suffix operations are required that
are suffix replacement (SR) , suffix addition (SA) and suffix deletion (SD). The
average cost of these operations are:
The cost of SR is (l1 L2 ) + (k 42 ) + (k 24 ). We denote it as s1 .
The cost of SA is (l1 L2 ) + (k 32 ). We denote it as s2 .
The cost of SD is (l1 L2 ) + (k 32 ). We denote it as s3 .
6. The average cost of suffix addition na in the verb of PCP1 form is (l1 L2 ).
Note that here only one suffix is required in any of the cases, therefore, no
search is required for deciding about the suffix. This cost is denoted by s4

Table 5.5 discusses the cost due to subject/object to subject/object morpho


changes pairwise.

196

A Cost of Adaptation Based Scheme

Input

<proper>

PRON

PCP1

w3

w4 +s4

w3

w4 +s4

Retd
N

(0 or ({w1 } + {s1 or w2
s2 or s3 })

<proper> w1 + {s1 or s2 or w2
s3 }
PRON

w1 + {s1 or s2 or w2
s3 }

(0 or w3 )

w4 +s4

PCP1

w1 + {s1 or s2 or w2
s3 }

w3

(0 or w4 )

Table 5.5: Cost of Adaptation Due to Subject/Object to


Subject/Object
Similarly, we have formulated the cost of adaptation for pre-modifying adjective.
In order to avoid the repetitive nature of description here, we put the content in
Appendix E.
Similarly, cost of adaptation for other sentence patterns components which have
been discussed in Chapter 2 can be formulated. However, to avoid the stereotyped
nature of discussion, we do not present all other different cases in this report. The
primary advantage of the above analysis is that it paves the way for using adaptation cost as a good yardstick for similarity measurement that may lead to efficient
retrieval from EBMT perspective. The following subsection describes adaptation
cost can be used for this purpose.

5.4.4

Use of Adaptation Cost as a Measure of Similarity

The input sentence may be compared with the example base sentences in terms
of functional-morpho tags, their discrepancies may be measured, and adaptation
197

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

cost may be estimated using the formulae given above. The example base sentence
having the minimum cost of adaptation may then be considered the most similar
to the input sentence, and may be retrieved for generating the translation of given
input sentence. Below we compare the proposed scheme with some other similarity
measurement schemes. In particular, we consider semantic similarity and syntactic
similarity in a way similar to what Manning and Schutze (1999) prescribed for
information retrieval.

5.5

The Proposed Approach vis-`a-vis Some Similarity Measurement Schemes

5.5.1

Semantic Similarity

Semantic similarity depends on the similarity of words occurring in the two sentences
under consideration. Here, we used a purely word based metric, and developed a
vector space model as suggested in (Manning and Schutze, 1999). However, the
weighting scheme has been modified (Gupta and Chatterjee, 2002) in order that the
scheme can be applied on sentences in a meaningful way. Here, each of the example
base sentences and the input are represented in a high-dimensional space, in which
each dimension of the space corresponds to a distinct word in the example base.
Similarity is calculated as the normalized dot product of the vectors. The method
is explained below.
Let Ej : j = 1,2,. . ., N be the English sentences in the example base, and E0 be
the input sentence in English. We denote E0 and each Ej as n-dimensional vector
in a real valued space. The space of all distinct words W1 , W2 , . . ., Wn are in the
198

A Cost of Adaptation Based Scheme


example base. Thus Ej = (ej1 , ej2 ,. . .,ejn ), for j =0, 1, 2, . . ., N . The similarity
measure between E0 , and of the example base sentence Ej is defined here as:

m(E0 , Ej ) =

n
X

e0i eji

(5.1)

i=1

This scheme computes how well the occurrence of word Wi (measured by e0i
and eji ) correlate in input and the example base sentences. The coordinates eji are
called word weights in the vector space model. The basic information used for word
weighting are word frequency (wji ) and sentence frequency (si ).
For a word Wi word frequency wji and sentence frequency si are combined into
a single word weight as

wji (( N ) 1),
si
eji =

0,

if wji 1;

(5.2)

if wji = 0.

where i = 1, 2, . . ., n and j = 0, 1, 2, . . ., N . Here N is the total number of sentences


in the example base. The term (( N
) - 1) gives maximum weight to words that occur
si
in only one sentence, whereas a word occurring in all the sentences would get zero
- 1 = 1 - 1 = 0). Here the word frequency wji and sentence frequency
weight =( N
N
si implie:
Word frequency wji is the number of times the word Wi occurs in the j th
sentence Ej . This implies how salient a word is within a given sentence. The
higher the word frequency, the more likely is that word is a good description
of the content of the sentence.
Sentence frequency si is the number of sentences of the database in which the
199

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

word Wi occurs. The sentence frequency is interpreted as an indicator of how


informative the word is in the example base. In this respect we distinguish
between two types of words: semantically focused words and semantically
unfocused words. A semantically focused word is a word, which gives meaning to a sentence. These semantically focused words enrich the vocabulary of
the language. For example: verb, noun, adjective and adverb. Semantically
unfocused words introduce structurally standard behaviour. These words are
limited in number predetermined by the grammar of the language. For example, prepositions, conjunctions, pronouns, auxiliary verbs etc.

Given an input sentence, this similarity may be used for retrieving an appropriate
past example from the example base. In order to achieve that the similarity of the
input sentence is measured with each of the example base sentence. The one with
the highest similarity score may be considered for retrieving.
We have experimented with two input sentences I work. and Sita sings ghazals.. Tables 5.6 and 5.7 provide the best five matches for them, respectively.

Retrieved Sentences

Semantic Score

I do this work.

0.9852

I have this work.

0.9746

I will do this work.

0.9543

They work there.

0.7954

The hungry man work.

0.6834

Table 5.6: Best Five Matches by Using Semantic Similarity for the Input Sentence I work.

200

A Cost of Adaptation Based Scheme

Example Sentence

Semantic Score

Sita sings ghazals.

1.00

Ghazals were nice.

0.775

Sita reads books.

0.733

Sita is eating rice.

0.731

He has been singing ghazals.

0.701

Table 5.7: Best Five Matches by Using Semantic Similarity for the Input Sentence Sita sings ghazals.

One may note that the drawback of this scheme is that the outcome varies significantly on the content words, the size of the database sentences and the occurrence
of the words in the sentences.

5.5.2

Syntactic Similarity

Syntactic similarity pertains to the similarity of the structure of two sentences under
consideration. Let Tj be the tagged version of English sentence Ej of the example
base, and T0 is the tagged version of the input sentence E0 . Here too, every sentence
Tj in the example base is expressed as a vector generated from the structure of the
sentence. A matching technique similar to that used for semantic similarity has been
applied on Tj and T0 (instead of Ej and E0 , as discussed in earlier subsection). As
a consequence similarity measures are computed at the structural level, and not at
the word level.
The key question is whether all the components in determining the structural
similarity are of equal importance. We feel that the contributions of all the constituent words on the formation of the sentence are not same, in particular, sentences
201

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

having similar structure (in terms of verb, auxiliary, adverb etc) should have higher
similarity measurement value for a given input sentence. Having tried with different
weighting schemes, we found that the one given in Table 5.8 provides the best result.

POS/syntactic role

Multiplier

Auxv/verb

20

Preposition

10

Adjective/adverb

Subject/object

Determiner/negative

0.1

Table 5.8: Weighting Scheme for Different POS and Syntactic Role

Table 5.9 and Table 5.10 give the similarity measures obtained for the example
base corresponding to the input sentences I work and Sita sings ghazals when
the above weighted syntactic similarity scheme has been used.

Retrieved Sentences

Syntactic Score

I walk.

1.00

I do this work.

0.971

I hear the parrot.

0.968

They walk.

0.942

They work there.

0.928

Table 5.9: Best Five Matches by Syntactic Similarity for


the Input Sentence I work.

202

A Cost of Adaptation Based Scheme

Example sentences

Syntactic Score

Sita sings ghazals.

1.000

Sita reads books.

1.000

Mohan eats mangoes.

1.000

Babies drink milk.

0.918

He reads history.

0.907

Table 5.10: Best Five Matches by Syntactic Similarity


for the Input Sentence Sita sings ghazals.

Note that here similarity of words is completely ignored, as the main emphasis
is laid on the similarity of tense. By resorting to a different weighting scheme one
can change the similarity measures to some extent.

5.5.3

A Proposed Approach: Cost of Adaptation Based Similarity

The above studies reveal that neither semantic measure nor syntactic measure provides an effective scheme for calculating similarity between two sentences. In both
cases the measurement score depends to a significant extent on the word weights,
which in turn depends on the sentences in the example base. Thus the schemes
become highly subjective. We, therefore, look for a method that provides a more
objective measurement of similarity. We consider the cost of adaptation for this
purpose., which is seen as the number of operations required for transforming a
retrieved translation example into the translation of a given input sentence. We
continue with the adaptation operations discussed in Section 5.3.1. The following
example illustrates how the functional-morpho tags of an input (IE) and a retrieved
203

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

example base sentence (RE) can be used for determining the appropriate adaptation
operations.
Input sentence(IE)

Ram is driving the car at a high speed.

Retrieved English sentence (RE)

He is sitting on the chair.

Retrieved Hindi sentence (RH)

wah kursii par baith rahaa hai


(he) (chair) (on) (sit) (..ing) (is)

Table 5.11 gives the functional-morpho tags of the IE and the RE. To generate
the translation ram bahut tezii se gaadii chalaa rahaa hai of the input sentence
the following adaptation operations are required.
IE:

Ram is driving the car at a high


speed.

RE:

He is sitting on the chair.

Ram

@SUBJ <Proper> N SG
Ram

he

@SUBJ PRON MASC SG3


he

is

@+FAUXV V PRES be

is

@+FAUXV V PRES be

driving

@-FMAINV V PCP1 drive

sitting

@-FMAINV V PCP1 sit

the

@DN> ART the

the

@DN> ART the

car

@OBJ N SG car

...

...

at

@ADVL PREP at

on

@ADVL PREP on

@DN> ART a

...

...

high

@AN> A ABS high

...

...

speed

@<P N SG speed

chair

@<P N SG chair

Table 5.11: Functional-morpho Tags for the Input English Sentence (IE) and the Retrieved English Sentence
(RE)

(a) Whenever a functional tag along with the morpho tags match in both sentences
but the corresponding words are different, a constituent word replacement
204

A Cost of Adaptation Based Scheme


needs to be done for the retrieved Hindi translation. For example, driving
and sitting are both verbs in their present continuous form @-FMAINV V
PCP1. But since the root verbs drive and sit are different, a constituent
word replacement is required in the retrieved Hindi translation RH. Therefore,
the scheme replaces baith with chalaa5 . In a similar way chair and
speed have same functional-morpho tag @<P N SG. Hence, here too, the
scheme replaces kursii with tez .
(b) If the functional tags match, but corresponding morpho tags do not match,
then either a constituent word replacement or some suffix modifications (or
both) need to be done to modify the retrieved Hindi translation. For example,
Ram and he are both subjects, but Ram is a proper noun while he is
a pronoun. Hence a constituent word replacement is required, and the scheme
replaces wah with ram.
(c) Whenever a functional tag is not present in the sentence RE, but is present in
IE, the corresponding word in Hindi has to be retrieved from an appropriate
word dictionary, and added at the appropriate position in the Hindi sentence
RH. For example, the object car which comes before the preposition at
in the IE sentence is not complementing the preposition, the object chair
which comes after the preposition on in the RE sentence is complementing
the preposition. Thus, the two objects car - @OBJ N SG and chair - @<P
N SG differ in their functional tags. Therefore a constituent word addition
is required in the Hindi sentence RH. Therefore, the word gaadii has to be
added which serves as the object before the Hindi verb chalaa .
(d) Whenever a functional tag is present in RE, but is not present in IE, the
5

The Hindi translation of the word drive is to be taken from the verb dictionary.

205

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

corresponding word in Hindi has to be deleted from the Hindi sentence RH.
(e) Other necessary adaptation operations may be WA (bahut) and SA(tez
tezii ). Similarly, these adaptation operations can also be identified.

After the identification of all the adaptation operations required for adapting RE
to IE, we have calculated cost of each of the adaptation operations. For this purpose
we have referred cost of all the different operations which are listed in Section 5.3.1
and Section 5.4.
In order to apply cost of adaptation to design an appropriate retrieval scheme
one needs to have a measurement of different constants of proportionality (i.e. l1 ,
l2 , d etc.) as described in Section 5.3.1. Evidently, these constants depend upon
the underline computing system. Hence in our discussion we want to keep them
independent of any particular platform. We further make a few assumptions in
order to make the calculations relatively simple:
We assume that the linear search operations in the RAM are equally costly
irrespective of the size of each data record. Hence we assume that the constants
l1 , l2 , d, m, and k are all equal. Let them have a common value .
It has already been discussed in Section 5.3 that hard disc operations are
costlier than RAM operations by an order of 105 . Hence we denote the constant
associated with retrieval from the external storage as c 105 , where c .
and  costs are treated as independent quantities. Here  is very small
quantity and  . On the other hand, is considered as a large quantity as
discussed in item 5 of Section 5.3.

206

A Cost of Adaptation Based Scheme


Table 5.12 and Table 5.13 give the best five matches when the retrieval is made
by the cost of adaptation based scheme by using the same input sentence and the
same example base. Cost values here are measured according to scheme given in
Section 5.4.
Retrieved sentences

Adaptation cost

I have been working for four hours.

23+4

I have not been working for four hours.

30+5

This works.

13.17+ 105 c

The man works.

13.67+105c

I walk.

16.27+105c

Table 5.12: Retrieval on the Basis of Cost of Adaptation


Based Scheme for the Input Sentence I work.

Retrieved sentence

Adaptation cost

Sita sings ghazals.

Sita sang ghazal.

He has been singing ghazals.

13+

Sita is singing melodious song.

22+ 105 c+ 2

Sita reads books.

32.85+2(105c)

Table 5.13: Retrieval on the Basis of Cost of Adaptation Based Similarity for the Input Sentence Sita sings
ghazals.

To generate the translation main kaam kartaa hoon of the input sentence
I work., the adaptation operations required for adapting each of the sentences
given in Table 5.12 above are as follows:

For adapting I have been working for four hours. main chaar ghante se
207

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

kaam kar rahaa hoon to the input sentence, five operations are required. This
operations are : SA (kar kartaa), WD (chaar ), WD(ghante), WD (se) and
MD (rahaa). The total adaptation cost is therefore = (5.5)+ (4+) +
(4+) + (4+)+(5.5+)= 23+4. One may refer Section 5.3.1 for this
computation.
In case of adapting the second retrieved sentence I have not been working for
four hours main chaar ghante se kaam nahi kar rahaa hoon to the input
sentence, at most six operations are required to be done. These operations are
: SA (kar kartaa), WD (chaar ), WD(ghante), WD (se), WD(nahiin) and
MD(rahaa). Hence the total adaptation cost is (6)+ (4.5+) + (4.5+) +
(4.5+)+ (4.5+) +(6+)= 30+5.
For adapting This works yah kaam kartaa hai to the input sentence,
at most two operations WR (yah main) and MR (hai hoon) are required. The total adaptation cost is therefore (9.17 + c105 )+ (2+ 2)=
13.67+c105 .
Similarly, one can identify the appropriate adaptation operations for adapting
last two sentences to the input sentence.

We now consider the costs of adaptation for the best five sentences that are
retrieved using semantic and syntactic similarity schemes as given in Sections 5.5.1
and 5.5.2.

208

A Cost of Adaptation Based Scheme

Retrieved Sentences Based on Se-

Adaptation Cost

mantic Similarity
I do this work.

23.77+ 105 c+ 2

I have this work.

25.4+ 105 c+ +3

I will do this work.

22.9+ 105 c+ + 2

They work there.

21.17+ 105 c+ 

The hungry man work.

21.67+ 105 c+ 

Retrieved Sentences Based on Syntactic Similarity

Adaptation Cost

I walk.

16.27+ 105 c

I do this work.

23.27+ 105 c+ 2

I hear the parrot.

19.77+ 2(105 c)

They walk.

28.94+ 2(105 c)

They work there.

21.17+ (105 c)+

Table 5.14: Cost of Adaptation for Retrieved Best Five


Matches for the Input Sentence I work. by Using Semantic and Syntactic Based Similarity Schemes

First we consider the input sentence I work. Table 5.14 provides the costs of
adaptation of the best five matches under semantic similarity and syntactic similarity
based measurement schemes. An examination of the adaptation costs suggests that
all the five sentences retrieved by the semantic similarity based scheme are costlier
to adapt in comparison with all the sentences retrieved by cost of adaptation scheme
(see Table 5.12). On the other hand, the sentence I walk. which is retrieved as the
best matching sentence under syntactic similarity based scheme actually requires
more computational efforts than the best four sentences as given by the cost of
adaptation based scheme (see Table 5.12).

209

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

Retrieved Sentences Based onSe-

Adaptation Cost

mantic Similarity
Sita sings ghazals.

Ghazals were nice.

29.08+ 105 c+ +

Sita reads books.

32.85+ 2(105 c)

Sita is eating rice.

46.85+ 2(105 c)+

He has been singing ghazals.

13+ 

Retrieved Sentences Based on Syntactic Similarity

Adaptation Cost

Sita sings ghazals.

Sita reads books.

32.85+ 2(105 c)

Mohan eats mangoes.

39.85+ 2(105 c)

Babies drink milk.

39.85+ 2(105 c)

He read history.

39.85+ 2(105 c)

Table 5.15: Cost of Adaptation for Retrieved Best Five


Matches for the Input Sentence Sita sings ghazals by
Using Semantic and Syntactic based Similarity Schemes

In a similar way Table 5.15 provides the costs of adaptation for the best five
matches retrieved by semantic and syntactic based schemes for the input sentences
Sita sings ghazals. One may note the following by comparing Table 5.13 and Table
5.15.

Sita sings ghazals. is retrieved as a best match by all the three schemes
because it is already there in the example base.
The second best match Ghazals were nice. under semantic similarity based
scheme is actually very expensive to adapt as it has a term . This term
occurs since the sentence concerned is of a structure that is different from the
210

A Cost of Adaptation Based Scheme


input sentence.
The sentences retrieved by the syntactic similarity based scheme are costlier
to adapt than the sentences retrieved by the cost of adaptation based scheme.
The above results clearly demonstrate the superiority of the proposed scheme
over the semantic and syntactic similarity based schemes.

5.5.4

Drawbacks of the Proposed Scheme

One major drawback of the proposed scheme is that for each input sentence, the
scheme essentially boils down to evaluating the cost of adaptation for all the sentence
in the example base. This makes retrieval from a large example base is computationally very expensive. On the other hand, use of cost of adaptation as a potential
yardstick for measuring similarity makes too strong an argument to be ignored with
respect to Example Based Machine Translation. This, therefore, necessitates development of some filtration technique so that given an input sentence, the example
base sentences that are difficult to adapt are discarded. The adaptation scheme can
therefore be applied only on the remaining sentences of the example base. We have
designed a systematic two-level filtration scheme for this purpose.
It is clear from cost of adaptation operations, mentioned in Section 5.3.1, that
constituent word addition and constituent word replacement are the costliest adaptation operations in terms of computational cost, with the former being costlier than
the latter. Hence the filters are designed to retrieve those example base sentences for
which the adaptation of given input sentence will require less number of constituent
word addition and constituent word replacement operations.
The two-level scheme works as follows.
211

5.5. The Proposed Approach vis-`


a-vis Some Similarity Measurement Schemes

In the first level, the algorithm retrieves sentences that are structurally similar to the input sentence, thereby reducing the number of constituent word
additions in the adaptation of the retrieved example. Here functional tags
(FTs) are used to determine the structural similarity. We call this step as
measurement of structural similarity.
In the second level only the sentences passed by the first filter are considered
for further processing. Here the dissimilarity of each of these sentences with the
input sentence is measured. The lower is the dissimilarity score of an example,
the lesser will be its adaptation cost to generate the required translation. The
dissimilarity is measured on the basis of tense and POS tag along with its root
word. Henceforth, for notational convenience, we shall denote these features
as characteristic features of a sentence. This step is denoted as measurement
of characteristic feature dissimilarity.

The following examples illustrate the necessity of the two levels of the filtration
scheme. Let us consider the following two sentences:
A

A beautiful girl is going to her home.


ek
(a)

sundar
(beautiful)

ladkii

apne

ghar

jaa

rahii

hai

(girl)

(my)

(home)

(go) (...ing) (is)

This home is very beautiful.


yeh

ghar

(this) (home)

bahut

sundar

hai

(very) (beautiful) (is)

Even though there are two common words (beautiful and home) between these two
sentences, adapting the translation of sentence B to generate translation of A is not
an easy task because of their structural difference. Adaptation of the translation
212

A Cost of Adaptation Based Scheme


of B to generate the translation of A requires only eight adaptation operations.
These are WR (yeh ek ), WA (sundar ), WR (ghar ladkii ), WA (ghar ), WA
(apne), WA(jaa), MA(rahii ), and WD (bahut). Hence the total cost of adaptation
for adapting B to A is 84.79+4+4(105c)+7, by referring to Section 5.3.1.
Let us now consider another sentence:
C

This girl is going to office.


yeh
(this)

ladkii
(girl)

office
(office)

jaa

rahii

hai

(go)

(...ing)

(is)

This sentence also has two words (girl and going) common with sentence A. But
its adaptation for generating the translation of A is computationally less expensive in
comparison with adaptation of B. In order to adapt C to A, only four adaptation operations are required: WR (yeh ek ), WA (sundar ), WR (office ghar ) and WA
(apne). The total cost of adaptation for adapting C to A is 46.08+2+3(10 5 c)+2.
Evidently, this cost is much less than the cost of adapting B to A as computed above.
This happens because of the structural similarity and commonality of some characteristic features of sentence C with A.
The above discussion suggests that one of the filters alone is not sufficient. For
appropriate filtration both the levels are required. The next section discusses the
proposed filtration scheme in detail.

5.6

Two-level Filtration Scheme

We have used the following notations to describe the filtration scheme. Let L denote
natural language (here it is English), and let e L denote an input sentence. S

213

5.6. Two-level Filtration Scheme

denotes the example base which is a finite subset of L, and d S is an example base
sentence. The following subsections discuss the above-mentioned levels of the filter.

5.6.1

Measurement of Structural Similarity

In this step, the aim is to filter the example base S of sentences to produce a subset
of S such that the sentences are structurally similar to e. The example base is
partitioned into equivalence classes of sentences that have same functional tags (e.g.
subject, object, verb etc.). This example base of equivalence classes is filtered, and
the classes that are similar in structure to the equivalence class of the input sentence
are identified. Here too we have used ENGCG parser for finding the functional tags
(FTs).
Given a sentence x L, let (x) be the bag of functional tags present in x. We
have used the term bag in place of set as in a bag repetition of elements are
allowed. For example, if x is My brother helps me in my studies. then (x) = {
@GN>, @SUBJ, @+FMAINV, @OBJ, @ADVL, @GN>, @<P}. Let F be the set
of the possible bag of functional tags for the language L.
Note that induces an equivalence relation on L. Two sentences e L and
e0 L are said to be equivalent(notationally, eEe0 ) if they have the same bag of
functional tags. Let [e] denote the equivalence class corresponding to the sentence
e, i.e. [e] = {e0 L| eEe0 }. For example, the sentences He drank milk.,Sita eats
mangoes., They are playing football. and Will Ram marry Sita? are members of
the same equivalence class because all these five sentences have same functional tags
representation (.) = {@SUBJ, @+FMAINV, @OBJ}.
Since our concentration is on our example base S, the function and equivalence
214

A Cost of Adaptation Based Scheme


class are restricted to set S. The restriction of to S is also denoted by , and it
induces an equivalence relation on S, whose equivalence classes are also denoted by
[d] for d S. Let S 0 = {[d] | d S} be the partition of S into equivalence classes.
For a given input sentence e, and an example base sentence d S, | (e) (d) |
denotes the number of common FTs between [e] and [d]. Let m denote max dS (|
(e) (d) |), i.e. the maximum number of common FTs between (e) and (d).
From the partitioned example base S 0 a new set Se0 is constructed such that Se0 =
{[d] : | (e) (d) | d m2 e}. Here the function d m2 e gives the smallest integer greater
than or equal to

m
.
2

Thus, Se0 is constructed in such a manner that it contains all

those equivalence classes for which the number of common FTs is between d m2 e and
m. Thus, by this step, all the equivalence classes having number of common FTs
less than d m2 e are discarded. We claim that the sentences having higher cost of
adaptation has been discarded from the set Se0 . The proof is given below:
Let n = |(e)|, i.e. the number of functional tags present in e is n, evidentally,
n m. Let us now consider the examples left out of consideration in Se0 (i.e. the
set (S 0 Se0 )). Of all the examples belonging to this set, the one that will have least
cost of adaptation should have the following properties:

(a) It should have maximum number of common FTs with e. We assume that
there exist a sentence with (d m2 e -1) FTs common with e.
(b) We further assume that for all these common FTs, the underlying words are
also same as in e.

Therefore, to adapt any such sentence to generate a translation of e, the words


corresponding all the other functional tags are to be added from dictionaries. This
215

5.6. Two-level Filtration Scheme

means (n (d m2 e-1)) constituent word additions are required. Therefore, the cost of
adaptation of any such sentence will be approximately6 : (n d m2 e+1)WA, where
WA is the cost of constituent word addition, and is ((l1 L2 ) + (l2 L2p ) + {(dlog2 D)
+ (c 105 )} + + ). For details of this cost, check item 2 of Section 5.3.1. Let us
denote this cost as C1. This cost will be certainly more than the cost of adaptation
for the sentence having m common FTs with e, i.e. the sentences belonging to the
equivalence classes of the set Se0 selected by the first filter. Argument supporting
this fact is as follows:
If all the words corresponding to the m common FTs are different from the input
sentence constituent words, then the cost of adaptation will be approximately the
sum of m constituent word replacements and (n m) constituent word additions,
i.e. mWR + (n m)WA, where WR is the cost of constituent word replacement,
and is ((l1 L2 ) + (l2

Lp
)
2

+ {(d log2 D) + (c 105 )})) (see item 2 and 3 of

Section 5.3.1). Therefore, WA = WR + + . We denote this cost as C2. Thus,


the value of C2 is n WR+ (n m) (+ ). Now let us consider the difference
C1 C2.
C1 C2 = (n d m2 e+1)WA - (nWR + (n m)(+ ))
= (n d m2 e+1)(WR + + ) - (nWR + (n m)(+ ))
( m2 + 1) ( + ) - ( m2 1) WR > 0, since is greater than the
cost of dictionary search, i.e. {(d log2 D) + (c 105 )} (see Section 5.3).
It can also be noted that the sentence having m common FTs not necessarily
will have the minimum cost of adaptation. For this, consider the cases:

The sentence having m common FTs have all different words. The approximate
6

We have not added other costs like suffix operations and morpho-word operations cost.

216

A Cost of Adaptation Based Scheme


costs will be sum of the cost of m constituent word replacements and the cost
of (n m) constituent word additions, i.e. mWR + (n m)WA.
The sentence having d m2 e common FTs have all same words. In this case the
approximate cost is (n d m2 e) constituent word additions, i.e. (n d m2 e)WA.
By using a similar argument as given above, it can be shown that the cost of
the latter case may be less than the former one. Hence the sentences having d m2 e
common FTs cannot be discarded at this level.
The significance of this filtration step is that corresponding to any sentence x
that has been discarded by this filter there is at least one sentence x0 that will be
considered for the next level of filtration, and the cost of adaptation of x0 is much
less than that of x.
The equivalence classes passed by the first filter are subjected to further analysis
in the second level of filter as mentioned below.

5.6.2

Measurement of Characteristic Feature Dissimilarity

This filter arranges sentences of the set Se0 on the basis of the characteristic features
(see Section 5.5.4) of a sentence. We have considered the following characteristic
features: POS with its root word main verb (V), noun(N), adverb(ADV), adjective(A), pronoun(Pron), determiner(DET), preposition(p), gerund(PCP1), participles(PCP1, PCP2), and tense and form of the sentence. Here note that we have
considered those main verbs whose root forms are other than be or have . We
stick to the notations provided by the ENGCG parser (See Appendix B). For convenience of presentation, we denote the above mentioned ten characteristic features
217

5.6. Two-level Filtration Scheme

as p1 (y), p2 (y), ..., p10 (y), where y is the root word of the corresponding characteristic feature pi . For example, let us consider the sentence I am sitting on the old
chair. According to this example, there are six characteristic features, i.e. p 5 (I),
p1 (sit), p7 (on), p4 (old), p2 (chair) and p10 (present continuous). Here the
verb sit is the root form of the verb sitting.
In the following, we define a dissimilarity measure so that the sentences belonging
to Se0 can be arranged in increasing order of dissimilarity score. Note that the
smaller is the dissimilarity score, the lesser is the cost of adaptation (to generate the
translation of the input sentence e) of the corresponding sentence.
Let M be the set of the possible bag of characteristic features. We define a
mapping : L M such that (x) = {pi (y) : pi (y) is a characteristic feature
[
of sentence x}. Let Se =
[d], the set of sentences of all equivalence classes
of

Se0 ,

[d]Se0

and let the restriction of to Se be noted by only. Further a mapping

: L Se M is defined such that (a, b) = {pi : pi (y) (a) and pi (y)


/ (b)},
i.e. (a, b) contains those characteristic features that are present in (a), but not in
(b), where a L and b Se .
For example, let the input sentence e be The old man is sitting on the old chair.,
and let the sentence x from example base be He is sitting on my bed.
Here the characteristic feature bags are:
(e) = {p4 (old), p2 (man), p1 (sit), p7 (on), p4 (old), p2 (chair),
p10 (present continuous)}
(x) = {p5 (he), p1 (sit), p7 (on), p5 (I), p2 (bed), p10 (present continuous)}
Therefore, (e, x) = { p4 (old), p2 (man), p4 (old), p2 (chair) }

218

A Cost of Adaptation Based Scheme


For e L, we define a dissimilarity function dise : Se R by

dise (d) = (

wi ) + ( (| (e) | | (e) (d) |))

(5.3)

pi (y)(e,d)

dise (d) gives the dissimilarity score of d Se with respect to e. Here, is the
cost of finding the location of new word, which has already been explained in item
5 of Section 5.3, and wi is the weight assigned to the characteristic feature pi (.).
Significance of this dissimilarity function dise (d) and wi are explained below.
As mentioned earlier, two of the costliest operations from the adaptation point
of view are constituent word addition and constituent word replacement. Thus, the
dissimilarity measure is designed to focus on these two operations. The second term
in the above-mentioned measure correspond to the approximate cost involved for
constituent word addition (to find the appropriate position). Further it should be
noted that cost of adaptation varies with the POS of the word to be added/replaced.
This is because this cost depends on the dictionary size of the concerned POS. The
bigger is the dictionary size, more will be the search time, and hence costlier will
be the required operation. Thus, for the characteristic features pi (y), i=1, 2, . . .,
9, a weight wi is assigned, depending on the respective dictionary size. Table 5.16
lists the weights of these characteristic features according to the search time of the
respective dictionaries.
Note that for tense and form (p10 ) identification cannot be done through dictionary search. Appropriate rules should be developed for this purpose. In our
implementation, we have used 65 rules to take care of the sentences in our example
base. Therefore, the weight 6.02 (log2 (65) = 6.02 ) is assigned to the characteristic
feature p10 .
219

5.6. Two-level Filtration Scheme

POS

pi

Dictionary size

Weights, wi

Verb (V)
Noun (N)

p1
p2

4330
13953

log2 (4330) = 12.08


log2 (13953) = 13.77

Adverb (A)
Adjective (ADJ)
Pronoun (PRON)

p3
p4
p5

1027
5449
72

log2 (1027) = 10.00


log2 (5449) = 12.41
log2 (72) = 6.17

Determiner (DET)
Preposition (P)

p6
p7

72
87

log2 (72) = 6.17


log2 (87) = 6.44

Gerund (PCP2)
Participles (PCP1, PCP2)

p8
p9

4330
4330

log2 (4330) = 12.08


log2 (4330) = 12.08

Table 5.16: Weights Used for Characteristic Features

Below we illustrate the significance of these weights in computing the dissimilarity


score between the input e and the sentence of the set Se . Let the input sentence e
be This girl is my sister., and let two sentences d1 and d2 from the example base
be:
d1

This boy is my brother.

d2

That girl is her sister.

For these sentences the characteristic feature bags are:


(e) = {p6 (this), p2 (girl), p5 (I), p2 (sister), p10 (simple present tense)}
(d1 ) = {p6 (this), p2 (boy), p5 (I), p2 (brother), p10 (simple present tense)}
(d2 ) = {p6 (that), p2 (girl), p5 (she), p2 (sister), p10 (simple present tense)}
Note that instead of the words my and her their root words I and she,
respectively have been considered above.
Therefore, (e, d1 )={p2 (girl), p2 (sister)} and (e, d2 ) = { p5 (she), p6 (that)}
220

A Cost of Adaptation Based Scheme


Thus, dise (dj ) = (

pi (e,dj )

wi ) + ( |4 4|) =(

pi (e,dj )

wi ) for j = 1, 2. It is

to be noted that the contribution of the second term is zero for both d1 and d2 since
both these sentences have the same FTs as that of e.
Let us now consider two cases:
1. The weights corresponding to all features are same, say wi = 1 i = 1, 2, ...,
10. In this case, dise (dj ) = 2 for j = 1, 2.
2. The weights are taken as given in Table 5.16. In this case, dise (d1 )= w2 + w2
= 13.77 + 13.77 = 27.54 and dise (d2 )= w5 + w6 = 6.17 + 6.17 = 12.34.
Note that in the first case, the dissimilarity score is same. But from adaptation
point of view, the cost involved for adapting d2 to e is much less than the cost of d1 .
This is due to the fact that d1 has a determiner and a pronoun characteristic feature
common with e, while d2 has two noun characteristic features common with e.
Since the search and access time from a dictionary depends upon the size of
the dictionary under consideration, in this context one has to look at the sizes of
the dictionaries concerned. It is a general observation that the size of the noun
dictionary is much more than the sizes of the pronoun and determiner dictionaries.
For example, in our case the sizes are 14000, 70 and 72, respectively (see item 4
of Section 5.3). Consequently, retrieval from noun dictionary is computationally
costlier than the retrieval from pronoun or determiner dictionaries. This fact is
not reflected if equal weights are assigned to each POS. Hence in order to assign
priorities to the POS features in such a way that the dissimilarity score reflect the
approximate cost of adaptation, weights are assigned to each POS as given in Table
5.16.

221

5.7. Complexity Analysis of the Proposed Scheme

Here the dissimilarity metric is so designed that the dissimilarity score is directly
proportional to the approximate cost of adaptation. Finally the sentences in Se are
arranged in descending order of dissimilarity score. Now a few best sentences are
considered for cost of adaptation based scheme, and the best one is retrieved as the
most similar to the given input sentence. In our experiments we have considered
five best sentences by this two-level filtration scheme for evaluation of their costs of
adaptation.

5.7

Complexity Analysis of the Proposed Scheme

The above discussed filtration scheme aims at improving the efficiency of cost of
adaptation based scheme. This improvement can be observed by comparing the
worst case complexities of the two algorithms - cost of adaptation based scheme
without the two-level filtration, and cost of adaptation based scheme after the twolevel filtration. These two similarity measurement schemes are denoted as A1 and
A2 , respectively. Table 5.17 gives the notations for different parameters used in the
analysis, and their maximum sizes with respect to our example base.
Parameters

Notations

Maximum size

Example base size


Input sentence length

N
Le

4000
10

Example base sentence length


Morpho-tag length
No. of equivalence classes of example base

Ld
LF
|S 0 |

10
5
162

No. of retrieved equivalence classes in 1st filter


No. of functional tags in input e

|Se0 |
|(e)|

|S 0 |
10

No. of functional tags in d

|(d)|

10

Table 5.17: Notation Used in the Complexity Analysis


222

A Cost of Adaptation Based Scheme


In the algorithm A1 , for each example base sentence d, the maximum amount of
efforts required to adapt the translation of d to the translation of the input sentence
e is the number of comparisons required to identify the adaptation operation(s) in
the worst case. For an example base sentence d, the number of comparisons required
are as follows:

First, the appropriate functional tag in d corresponding to each functional tag


in e is identified. In this step, a total of |(e)||(d)| comparisons are required.
Then, the morpho tags for all matching functional tags are compared and
hence the maximum number of comparisons required are Le LF .

Therefore, the total number of comparisons for adapting d to e is given by C1 ,


and C1 is (|(e)| |(d)|) + (Le LF ) = Le (Ld + LF ). It may be noted that the
length of the functional tag set is same as the length of the sentence (i.e. |(e)| =
Le and |(d)| = Ld ). Hence, the complexity of A1 for all example base sentences is
given by TA1 = N C1 .
The complexity computation of the algorithm A2 requires a detailed analysis
of the two filters. In the first filter, the complexity depends on the number of
comparisons between the functional tags of the equivalence class [e] of the input and
[d] of the example base, where e L, d S. This value, in worst case, is given
by C21 = |(e)| |(d)| = Le Ld . So, for a total of equivalence classes in S 0 , the
complexity of the first filter is given by A21 = |S 0 | C21 .
For the second filter, we need to work on the sentences of the equivalence classes
retrieved (Se0 ) from the first filter. Suppose, there are Pi sentences in the ith , equivalence class, i = 1,2, . . ., Se0 . The total number of sentences in all the equivalence
223

5.7. Complexity Analysis of the Proposed Scheme

classes of Se0 are Se . Maximum two comparisons are required: one for POS matching
and other for its root word matching between d and e for finding the characteristic
feature. The number of POS and its root words can be at most equal to the length of
a sentence. Thus the total number of comparisons required are computed as follows:

POS of d and e are compared first which makes the number of comparisons to
be Le Ld .
Then, comparison of root words of d and e having same POS is done. Thus
there are Le comparisons.

Hence, the total number of comparisons required for POS and root word matching
between e and d are Le (Ld + 1). Summing it over all the sentences of Se , we get
PSe0
the total complexity as A22 = ( i=1
Pi ) (Le (Ld + 1)) = |Se | (Le (Ld + 1)),
PSe0
where i=1
Pi N .
Finally the cost of adaptation based scheme is applied on the top few sentences

of Se having the minimum dissimilarity score. We have considered a set of first five
sentences in our experiments. This makes the number of comparisons to be 5 C1 .
Hence, the total complexity of the algorithm A2 is given by TA2 = A21 +A22 +5C1 .
Comparing the time complexities of the algorithms A1 and A2 ,
TA2
TA1

A21 +A22 +5C1


N C1

(|S 0 |(Le Ld ))+(|Se |(Le (Ld +1)))+(5Le (Ld +LF ))


N (Le (Ld +LF ))

|S 0 |
N

d
[( LdL+L
)] +
F

Se
N

+1
[( LLdd+L
)] +
F

5
N

(As |Se | |S| = N )

In the worst case, we assume that |Se | = |S| = N


Thus,

TA2
TA1

|S 0 |
N

d
[( LdL+L
)] +
F

N
N

+1
[( LLdd+L
)] +
F

224

5
N

A Cost of Adaptation Based Scheme


Putting the values of N , S 0 , Ld , and LF the ratio becomes
TA2
TA1

= c, where c = 0.762.

The above ratio shows that in the worst case the improvement in the algorithm
A2 is about 25%, i.e. the cost of adaptation based scheme will need to be applied on
only 75% sentences of the example base. But experimentally we found that for 500
different examples, which are not present in our example base, the improvement is
of the order of about 75%, which is quiet a significant improvement. This variation
is mainly due to the fact that during our experiments the cardinality of Se has been
obtained to be much less than N , and thus the ratio
Se
N

Se
N

reduces the contribution of

+1
[( LLdd+L
)], which is the main contributory term towards c.
F

The retrieval scheme has been developed with respect to simple sentences. However, if the input sentence is complex then its adaptation is not straightforward
(Dorr et. al., 1998), (Hutchins, 2003), (Sumita, 2001), (Shimohata et. al., 2003)
and (Rao et. al., 2000).
Typically, complex sentences are characterized by sentences having more than
one clause, of which one is the main clause, and the rest are subordinate clauses
(Wren, 1989), (Ansell, 2000). Relative clause is a type of subordinate clause in
which a relative adjective (who, which etc.) or relative adverb (when, where etc.)
is used as a connective. The clauses may be joined by some connectives, but their
presence is not mandatory. However, in this work we consider complex sentences
with exactly one relative clause, and we further assume presence of the connective is
mandatory. Even with this simplified assumption, we find that translating complex
sentences under an EBMT framework is relatively difficult. This is because English
complex sentences having same connectives are often translated in different ways
in Hindi. Consequently, for a given complex English sentence, finding its suitable
225

5.8. Difficulties in Handling Complex Sentences

match from the example base is difficult. And even when it is found its adaptation
may not be straightforward. The following section illustrates the above points.

5.8

Difficulties in Handling Complex Sentences

Here we first observe the following two points:

Even for complex sentences having the same connective (e.g. who, when, where,
which) the structure of the translations may vary. For illustration, consider the
four examples given below. Each of these English sentences may have at least
four possible variations depending on, in which position the Hindi connectives
are used. It may further be noticed that although the keywords of all these
four sentences are same7 (subject to morphological variations), their translation
patterns vary according to the role of the connective, and the role of the noun
modified by the relative clause. If the relative adjective who plays the role
of subject in the relative clause, then the Hindi relative adjective may be one
of jo, jis or jin, depending upon the tense and form (i.e. for present
perfect, past indefinite and past perfect) of the main verb of the relative clause.
The items (A), (B), (C) and (D) below show the four sentences and their Hindi
translations.
(A) The policeman who chased the thief was tall.
wah sipaahii jo chor kaa piichhaa kartaa thaa, lambaa thaa
wah sipaahii, jis ne chor kaa piichhaa kiyaa lambaa thaa
jo sipaahii chor kaa piichhaa kartaa thaa wah lambaa thaa
7

policeman - sipaahii, thief - chor, to chase - piichaa karnaa, I - main, tall - lambaa, to know jaannaa

226

A Cost of Adaptation Based Scheme

jis sipaahii ne chor kaa piichhaa kiyaa wah lambaa thaa

(B) The thieves who the policeman chased were tall.


we chor, sipaahii jis kaa piichhaa kartaa thaa, lambe the
we chor, sipaahii ne jis kaa piichhaa kiyaa, lambe the
sipaahii jin choron kaa piichhaa karte the we lambe the
sipaahii ne jin choron kaa piichhaa kiyaa we lambe the

(C) I know the policemen who chased the thief.


main un sipaahiyoan ko jaantii hoon, jo chor kaa piichhaa karte the
main un sipaahiyoan ko jaantii hoon, jin ne chor kaa piichhaa kiyaa
jo sipaahii chor kaa piichhaa karte the main us ko jaantii hoon
jin sipaahiyoan ne chor kaa piichhaa kiyaa main un ko jaantii hoon

(D) I know the thief who the policemen chased.


main us chor ko jaantii hoon jis kaa sipaahii piichhaa karte the
main us chor ko jaantii hoon, jis kaa sipaahiyoan ne piichhaa kiyaa
sipaahii jis chor kaa piichhaa karte the main us ko jaantii hoon
jin sipaahiyoan ne chor kaa piichhaa kiyaa main us ko jaantii hoon

Although, in general, the structures of the Hindi translations of two complex


sentences having different connectives are different, certain part of them may
still be similar. Hence an EBMT system may use this similarity in an effective
way. For example, consider the following two sentences and their translations
which involve the following keywords: man - aadmii, is working - kaam kar

227

5.8. Difficulties in Handling Complex Sentences

rahii hai, in - mein, farmer - kissan, said - kahaa, he - wah.


The man who is working in the field is a farmer.
jo aadmii khet mein kaam kar rahaa hai wah kisaan hai
This man said that he is a farmer.
iss aadmii ne kahaa ki wah kisaan hai
Despite the dissimilarity in their structures, one may notice that the part wah
kissan hai is common to both the translations. Typically this can happen if
the two complex sentences have some similar clauses. The above observation
also implies that sometimes a simple sentence may also be helpful in generating
the translation of a complex sentence, or some of its parts.

The above discussion suggests that the retrieval and adaptation strategies for
complex sentences may need to take care of a large number of variations according
to each connective word and its usage. Creating the adaptation rules and implementation of such number of possibilities are not an easy task. To overcome this
problem, we propose a split and translate scheme for handling complex sentences
in an EBMT framework. The proposed scheme works as follows:

1. First it checks whether the input sentence is complex. If yes then it executes
the following:
2. It splits the input sentence into two simple sentences RC and MC, corresponding to the Relative clause and the main clause of the complex sentence.
3. By using cost of adaptation based scheme it retrieves sentences most similar to RC and MC. Let these retrieved sentences be denoted as R1 and R2,
respectively.

228

A Cost of Adaptation Based Scheme


4. It generates translations of RC and MC from the retrieved examples R1 and
R2.
5. The translation of given complex sentence is generated by using the translations of RC and MC.

In the following subsections we discuss some of the splitting rules required to


convert complex sentence into simple sentences, and the adaptation procedure to
obtain the Hindi translation of the given complex sentence using the translation of
the splitted sentences.

5.9

Splitting Rules for Converting Complex Sentence into Simple Sentences

Various approaches have been suggested in literature for splitting of complex sentences. For example:

1. Furuse et. al. (1998, 2001) has proposed a technique where a sentence is split
according to the sub-trees and partly constructed parsed trees.
2. Takezawa (1999) recommended a word-sequence characteristic based technique.
3. Doi and Sumita (2003) proposed two methods: Method-T and Method-N.
Method-T uses three criteria, viz. fault-length, partial-translation-count and
combined-reliability. On the other hand Method-N uses pre-process-splitting
method based on N-gram of POS subcategories.
229

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

Many approaches exist for splitting complex sentences (typically for English),
e.g.(Orasan, 2000), (Sang and Dejean, 2001) and (Clough, 2001). The technique
used by us is similar in nature to that proposed in (Leffa, 1998), (Puscasu, 2004).
They suggest three ways in which a sentence can be segmented to the clause level:

(1) Starting with the first word in the sentence, and processing it from left to
right, word by word until all the clauses are identified;
(2) Starting with formal indicators of subordination/coordination, and proceeding
until the end of the clause is found;
(3) Starting with the verb phrase, identifying the verb type, and locating its subject and complements.

In our approach, we have used first two methods. We have developed heuristics
to split a complex sentence into two simple sentences one related to the main clause
and the other to the relative clause. Here the advantage is that both the simple
sentences now can be translated independently using the retrieval and adaptation
procedures developed for dealing with simples sentences. For this work we made the
following assumptions for the input sentence:

The sentence has only one relative clause, and a connective must be present.
The connectives that we have considered are when, where, whenever, wherever, who, which, whose, whom, whoever, whichever, that, whomever, what and
whatever.
The algorithm makes use of the delimiter of the input sentence as well. We
illustrate this technique with respect to the delimiters . and ?.
230

A Cost of Adaptation Based Scheme


No wh-family word (e.g. who, which, when, where) should be present in the
main clause.

In the following subsections, we discuss the splitting rules for complex sentences
having any of the following connectives: when, where, whenever,wherever
and who. Since the splitting rule for some of the above connectives are same,
the following subsection considers connectives when, where, whenever and
wherever together . The subsequent Subsection 5.9.2 discusses the splitting rule
of complex sentence having connective who.

5.9.1

Splitting Rule for the Connectives when, where,


whenever and wherever

This rule is explained using three modules.

Module 1
Module 1 identifies whether a given input sentence e is a complex sentence or not.
If e is complex, then the module identifies the position of the relative adverb, which
can be one of when, whenever, where or wherever. The algorithm considers
the two possible positions of the relative clause: i.e, the relative clause is present
before the main clause, or it is present after the main clause. Depending upon the
position of the relative clause, the algorithm proceeds to Module 2 or Module 3.
Figure 5.1 provides a schematic view of this module.

231

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

-Let the input sentence be e and let e be e1 ,


n is the length of the English sentence.

e2 ,. . ., en ; where

-Let the parsed version of e be denoted by f , and its bag of


functional tags is denoted as {f1 , f2 ,. . ., fn }, where fi is the
functional-morpho tag corresponding to ei .
-For all
to ei .

ei e, let Roote (ei ) denote the root word corresponding

IF((f1 = @ADVL) AND (Roote (e1 ) = where" OR when" OR wherever"


OR whenever"))
THEN {
IF(((f2 = @+FAUXV) AND (Roote(e2 ) = be" OR do" OR
have" OR can" OR may" OR shall" OR will")) OR
((f2 = @+MAINV) AND (Roote (e2 )= be" OR have")))
THEN

{
PRINT Simple sentence";
EXIT;

}
ELSE
ELSE

{Print Complex sentence"; GO TO Module 2;}

}
{
j =0;
For (i = 2, 3,

. . .,n)

{
IF(Roote (ei ) = "where" OR when" OR wherever"
OR whenever")
j = i;

}
IF(j = 0)
THEN {Print Simple sentence";}
ELSE {Print Complex sentence"; GO To Module 3;}

}
Figure 5.1: Schematic View of Module 1 for Identification of Complex Sentence with
Connective any of when, where, whenever, or wherever

232

A Cost of Adaptation Based Scheme


The following two examples illustrate this module.
Example 1:
Let the input sentence e be Whenever you go to India, speak Hindi.. Its parsed
version f obtained using the ENGCG parser, is:
@ADVL ADV WH whenever, @SUBJ PRON PERS SG2/PL2 you,
@+FMAINV V PRES VFIN go, @ADVL PREP to, @<P <Proper>
N SG India, @+FMAINV V IMP speak, @OBJ <proper> N SG
Hindi <$.>
The length of the input sentence e is 7, and the bag of functional tags is {@ADVL,
@SUBJ, @+FMAINV, @ADVL, @<P, @+FMAINV, @OBJ}. Since f1 is @ADVL,
Roote (e1 )8 is whenever, and f2 is not @+FAUXV (f2 is @SUBJ), it is concluded
that the given input sentence e is complex, and the algorithm should proceed to
Module 2.

Example 2:
Consider the another input sentence e Will you bring anyone along when you return
from town?. Its parsed version f is:
@+FAUXV V AUXMOD VFIN will, @SUBJ PRON PERS SG2/PL2
you, @-FMAINV V INF bring, @OBJ PRON SG anyone, @ADVL
ADV ADVL along, @ADVL ADV WH when, @SUBJ PRON PERS
SG2/PL2 you, @+FMAINV V PRES VFIN return, @ADVL PREP
from , @<P N NOM SG town <$?>
The length of e is 10, and the bag of functional tags is {@+FAUXV, @SUBJ, @FMAINV, @OBJ, @ADVL, @ADVL, @SUBJ, @+FMAINV, @ADVL, @<P}. Since
8

Roote (e1 ) denote the root word corresponding to e1

233

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

f1 is not @ADVL, the module checks the presence of any of the connectives when,
whenever, where or wherever in e. The connective when is present at the
6th position, i.e. j =6. Hence Module 1 concludes that the given input sentence e is
complex, and for splitting of e, the algorithm should proceed to Module 3.

Module 2
If the relative adverb is the first word of the given input sentence e then the sentence
is splitted in Module 2. Figure 5.2 gives a schematic view of this module. Table
5.19 gives the typical sentence structures that can be handled by this module. The
structure of the sentence handled by this module is characterized, the relative clause
will be present in the beginning of the main clause. In this module, along with
the position of the relative clause, position(s) of the subject(s) is used to split the
complex sentence. In the following, we assume the length of e to be n. Sub-steps of
this module are as follows:

If the delimiter of the input sentence e is ?, or if the input sentence has


only one subject (possible tags of subject are @SUBJ and @F-SUBJ) and the
delimiter of the sentence is ., then the main verb (i.e. @+FAMINV tag)
or main auxiliary verb (i.e. @+FAUXV tag) decides the splitting point. The
module looks for the second occurrence of @+FAMINV or @+FAUXV tag9 .
Let l be the word position where one of these two tags occur. If one of the
above two cases is true then from the second word to (l 1)th word, and from
the lth word to nth word of e constitute the two simple sentences, which are
9

ENGCG parser always provides either @+FAMINV or @+FAUXV tag to the first occurrence
of a verb whether it is main verb or auxiliary verb. All other verbs (main or auxiliary) in the
sentence are denoted with either @-FAMINV or @-FAUXV tag

234

A Cost of Adaptation Based Scheme


the parts of the main clause and relative clause, respectively. We call these
two simple sentences as MC and RC, respectively.
If the delimiter of input sentence e is ., and it has two subjects10 then
the position of the second subject slot gives the decision about the splitting
point. For this purpose, the pre-modifiers (i.e. determiner, article, pre-modifier
adjective, adverb etc.) of the second subject are identified. If the position of
the first pre-modifier of second subject is k, then from the second word to
(k 1)th word of e, and k th word to nth word of e constitute the two simple
sentences. First simple sentence (RC) is a part of the relative clause, and
second simple sentence (MC) is the main clause.

When I saw the oxen they were pulling the plow.


jab maine bail dekhe, tab we hal khiinch rahe the
Whenever the woman eats too much, she gets sick.
jab bhii wah aurat bahut zyaadaa khaatii hai, bimaar ho jaatii hai
Whenever you go to India, speak Hindi.
jab bhii tum india jaate ho, hindi bolo
Where there is a cat, there is a dog.
jahaan billii hotii hai, wahaan kuttaa hotaa hai
Wherever I run, the little dog will follow me.
jahaan bhii main dauddataa hoon, chhothaa kuttaa mere piichhe jaaegaa
Table 5.19: Typical Examples of Complex Sentence with
Connective when, where, whenever or wherever
Handled by Module 2
10
The algorithm works for at most two clauses in a complex sentence, therefore, the maximum
number of subjects in the sentence is taken to be two.

235

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

K = number of @SUBJ or @F-SUBJ tags in the sentence e


\K =1 or K = 2 \

Let

IF ((delimiter of
THEN {
l = 0;
For (i = 2, 3,

e = ?") OR (delimiter of e = ." AND K =1))

. . ., n)

{
IF(fi = @+FMAINV OR @+FAUXV)
IF(l = 0)
THEN l++ ;
ELSE { l = i; Break;}

}
- The string e2 , e3 , . . . , el1 constitutes a simple
sentence (say RC), which is the relative clause;
- The string el , el+1 , . . . , en constitutes a simple
sentence (say MC), which is the main clause;
- The functional-morpho tags of RC are

f2 , f3 , . . .,fl1 ;

- The functional-morpho tags of MC are

fl , fl+1 , . . .,fn ;

- Delimiter of RC is .";
IF(delimiter of e is ?")
THEN {delimiter of MC is ?"}
ELSE {delimiter of MC is ."}

236

A Cost of Adaptation Based Scheme

ELSE

{
IF(delimiter of

e = ." AND K =2)

{
m = 0;
For(i = 2 to n)
{
IF(fi = @SUBJ or @F-SUBJ)
IF(m = 0)
THEN m++ ;
ELSE {m = i; Break;}
}
}
\ Now this algorithm finds the attributes (pre-modifier adjective,
determiner etc.) of second subject \
k = m 1;
WHILE((k>2) AND (fk = @N OR @DN> OR @NN> OR @GN>
OR @AN> OR @QN> OR @AD-A>))
k ;
-The string ek , ek+1 , . . . , en constitutes the simple
sentence (say MC), which is the main clause;
-The string e2 , e3 , . . . , ek1 constitutes the simple
sentence (say RC), which is the relative clause;
- The functional-morpho tags of MC is
-The functional-morpho tags of RC is

fk , fk+1 , . . ., fn ;
f2 , f3 , . . ., fk1 ;

- Delimiter of RC is .";
- Delimiter of MC is ?";

}
Figure 5.2: Schematic View of Module 2

237

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

Our discussion on Module 1 concluded with the remark that the complex sentence
given in Example 1 should be splitted using Module 2. We now continue with the
same example Whenever you go to India, speak Hindi. to show how Module 2 splits
this sentence into two simple ones. In this example, the number of subjects is one,
i.e. K = 1, and the delimiter is .. The module now proceeds to determine the
position of the second occurrence of @+FMAINV or @+FAUXV tag, which is found
to be at the 6th position11 , i.e. l = 6. Hence the input complex sentence is splitted
into simple sentences as follows:
2nd to 5th words constitute a simple sentence RC, i.e. You go to India, its
delimiter is .. This is a part of relative clause, and its morpho functional
tags are:
@SUBJ PRON PERS SG2/PL2 you, @+FMAINV V PRES VFIN go,
@ADVL PREP to, @<P <Proper> N SG India <$.>.
6th and 7th words constitute the simple sentence MC, i.e. Speak Hindi, its
delimiter is also .. This is a part of main clause, and its morpho functional
tags are: @+FMAINV V IMP speak , @OBJ <proper> N SG Hindi <$.>.

Module 3
If the relative adverb (or connective) is not the first word of the given input sentence,
then the sentence is splitted by this module. In this case, the relative clause is present
after the main clause, i.e. relative clause is located towards the end of the sentence.
Let the position of the relative adverb (as identified in Module 1) be j. In this case,
11

The second main verb in the given input sentence is speak.

238

A Cost of Adaptation Based Scheme


first j 1 words of e will constitute the first simple sentence MC (which is the main
clause.), and j + 1 to nth words will constitute the second simple sentence RC (which
is a part of the relative clause). Module 3 is given in Figure 5.3. Table 5.20 gives
the typical sentence structures that can be handled by this module.
Please do not talk to him when the carpenter is working.
jab barhaii kaam kar rahaa hai tab usse na boliye
Should you speak English when you go to India?
jab tum india jaate ho kyaa tumhe english bolnii chaahiye?
Visit us whenever you come here.
jab bhii tum yanhaa aate ho ham se milo
I will stop, where there are interesting spots in my journey.
jahaan bhii mere safar mein dilchasp jaghe hohii main rukungaa
Will you bring anyone along when you return from town?
jab tum shahar se waapis aate ho tab kyaa tum kisii ko saath laaoge?
Do you want to go wherever I go?
jahaan bhii main jaatii hoon wahaan kyaa tum jaanaa chaahte ho?
Table 5.20: Typical Examples of Complex Sentence with
Connective when, where, whenever or wherever
Handled by Module 3

According to Module 1, the rule given in Module 3 will split the complex sentence
discussed in Example 2, i.e. Will you bring anyone along when you return from town?.
As discussed in Module 1, for this input sentence e the value of j 12 is 6. Hence
spitting in this case is as follows:
12

j denotes the position of relative adverb when.

239

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

- The string e1 , e2 , . . . , ej1 constitutes the simple sentence


MC which is the main clause;

\j is the position of relative adverb as obtained in Module 1 \


- The string ej+1 , ej+2 , . . . , en constitutes the
simple sentence RC, which is the part of the relative
clause;
- The functional-morpho tags of MC is

f1 , f2 , . . ., fj1 ;

- The functional-morpho tags of RC is

fj+1 , fj+2 , . . ., fn ;

- Delimiter for RC is always .";


IF(delimiter of e is .")
THEN {delimiter of MC is ."};
ELSE {delimiter of MC is ?"};

Figure 5.3: Schematic View of Module 3


The first five words of e constitute a simple sentence, which is main clause.
That is, the first simple sentence (denoted by MC) is Will you bring anyone
along. Since the delimiter of e is ?, the delimiter of MC is ?. Its functional
morpho tags are:
@+FAUXV V AUXMOD VFIN will, @SUBJ PRON PERS SG2/PL2 you,
@-FMAINV V INF bring, @OBJ PRON SG anyone, @ADVL ADV ADVL
along <$?>.
7th to 10th words constitute the other simple sentence RC, i.e. You return
from town, which is a part of the relative clause. The delimiter of RC is ..
Its functional morpho tags are:
@SUBJ PRON PERS SG2/PL2 you, @+FMAINV V PRES VFIN return,
@ADVL PREP from , @<P N NOM SG town <$.>.
240

A Cost of Adaptation Based Scheme

5.9.2

Splitting Rule for the Connective who

Here we discuss the algorithm for splitting complex sentences when the connective
is who. It should be noted that in this case, the relative clause can occur either
embedded in between the main clause, or after the main clause. In both the cases,
there are two possible functional tags of the connective word who, i.e. @SUBJ and
@OBJ. The algorithm takes care of all these possibilities. The algorithm is divided
into four modules, which are given in Figures 5.4, 5.6, 5.7 and 5.8. Along with these
four modules, there is a subroutine SPLIT given in Figure 5.5. The brief outline of
these modules and subroutine SPLIT is as follows:

The Module 1 checks whether the given input sentence is complex or not. If
the sentence is complex with the connective who then depending on the
position of the clause and the delimiter of a sentence it routes the algorithm
to the appropriate module.
Module 2 splits those complex sentences in which the relative clause is embedded in the main clause, and the delimiter of the sentence is .. Table 5.21
provides the typical sentence structures considered in this module.
The complex sentences in which the relative clause follows the main clause are
splitted in Module 3. Here also the delimiter of the sentence under consideration should be .. The sentence structures considered in this module are
exemplified in Table 5.22.
Irrespective of the position of the relative clause, Module 4 splits those complex sentences for which the delimiter is ?. Examples given in Table 5.23
demonstrate the sentence structures considered in this module.
241

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

The algorithm for splitting those complex sentences in which the relative clause
is embedded in the main clause is given in subroutine SPLIT. This subroutine
accepts two arguments: integer-type x, and character-type y. x gives a splitting point position, and y provides the delimiter of the simple sentence that is
a part of the main clause.

Those students, who want to learn Hindi, should study a lot.


jo vidyaarthii Hindii siikhnaa chaahte hain, unko bahut parhnaa chaahiye
un vidyaarthiyoan ko jo hindii siikhnaa chaahte hain bahut parhnaa
chaahiye
The old man, who is working in the field, is a farmer.
jo aadmii khet mein kaam kar rahaa hai, wah kisaan hai
wah aadmii jo khet mein kaam kar rahaa hai, kisaan hai
The dog who I chased was black.
jis kutte kaa maine piichhaa kiyaa, wah kaalaa hai
wah kuttaa, jis kaa maine piichhaa kiyaa, kaalaa hai
Table 5.21: Typical Complex Sentences with Relative
Adverb who Handled by Module 2

I met the person who called me yesterday.


jis insaan ne mujhe kal pukaaraa, main us ko milaa
wah insaan, jis ne mujhe kal pukaaraa, main use milaa
She met the person who I called yesterday.
maine jis insaan ko kal pukaaraa, wah usko milii
wah insaan, maine jis ko kal pukaaraa, wah use milii
Table 5.22: Typical Complex Sentences with Relative
Adverb who Handled by Module 3

242

A Cost of Adaptation Based Scheme

Do you know the boy who chased the dog?


jo ladkaa kutte kaa piichhaa kartaa thaa kyaa tum usko jaante ho?
Do you know the boy who I chased?
kyaa tum us ladke ko jaante ho, jis kaa maine piichhaa kiyaa?
jis ladke kaa maine piichhaa kiyaa, kyaa tum us ko jaante ho?
Did not the man, who read the book, like it?
jis aadmi ne kitaab padhii kyaa usne yah pasand nahiin kii?
kyaa wah aadmii, jis ne kitaab padhii, yah pasand nahiin kii?
Did the man, who I know, like this book?
jis aadmii ko main jaantii hoon, kyaa wah yah kitaab pasand kartaa thaa?
kyaa wah aadmii jis ko main jaantii hoon, yah kitaab pasand kartaa thaa?
Table 5.23: Typical Complex Sentences with Relative
Adverb who Handled by Module 4

The following illustration explains how the above modules work.


Let the input sentence e be Those students, who want to learn Hindi, should study a
lot.. The parsed version f of this sentence is:
@DN> DEM that, @SUBJ N PL student, @SUBJ <Rel> PRON
WH SG/PL who, @+FMAINV V PRES VFIN want, @INFMARK>
INFMARK> to, @-FMAINV V INF learn, @OBJ <proper> N SG
Hindi, @+FAUXV V AUXMOD should, @-FMAINV V INF study,
@DN> ART a, @OBJ N SG lot <$.>

243

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

- Let the input sentence be e and let e be


n is the length of the English sentence.

e1 , e2 ,. . ., en where

- Let the parsed version of e be denoted by f , and its bag of


functional tags is denoted as {f1 , f2 ,. . ., fn }, where fi is
the functional morpho tag corresponding to ei .
- For all
to ei .

j = 0;

ei e, let Roote (ei ) denote the root word corresponding

\ j will store the position of the connective who\

For(i = 1 to n)
IF( Roote (ei ) = "who" AND <Rel> Morpho-tag of
THEN {Print Complex sentence"; j = i}
ELSE {Print "Simple sentence"; Exit;}
Flag = 0;

p = 0;

fj )THEN

\ p stores the position of @+FAUXV or


@+FMAINV, if any one of them occurs before the
connective who \

For(i = j 1 to 1)
IF(fi = @+FAUXV OR fi = @+FMAINV)
{Flag=1; p = i; break;}
IF(Flag = 0)
THEN GO TO Module 2;
ELSE IF(delimiter of sentence = ".")
THEN GO TO Module 3;
ELSEIF(fi = @+FAUXV)
THEN GO TO Module 4;
ELSE Print "Sentence can not be splitted";

Figure 5.4: Schematic View of Module 1 for Identification of Complex Sentence with
Connective who

244

A Cost of Adaptation Based Scheme

SUBROUTINE SPLIT(int x, char y)


{
IF(fj 6= @OBJ)
\j stores the position of who as obtained
in Module 1. The connective who has
@SUBJ tag\
THEN {
- The constituent words e1 to ej1 concatenated with
ex to en forms the main clause, which is a simple
sentence, denoted as MC;
- The functional tags f1 to fj1 concatenated with
fx to fn forms the parsed output of MC;
-

ej to ex1 forms the relative clause;

- Replace ej with either he", she" orthey"


depending on the gender and number of el , where l is
such that fl = @SUBJ; and 1 l j 1, and also change
the morpho functional tag fj with the corresponding
tag of either he, she" or they".
- The morpho functional tag of he, she" and they"
are @SUBJ PRON PERS MASC SG3 he",
@SUBJ PRON PERS FEM SG3 she"
and @SUBJ PRON PERS PL3 they", respectively;

\If parser does not specify the gender, then


the gender of el is considered to be masculine.\
- After this modification,
sentence denoted by RC;

ej to ex1 forms the simple

245

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

ELSE

\Here fj = @OBJ \
c = 0;
For(i = j + 1 to x 1)
IF(fi = @-FMAINV OR
{c = i; break;}

fi = @+FMAINV)

- Words e1 to ej1 concatenated with word ex to en forms


the main clause, which is a simple sentence, say MC;
- The functional-morpho tags f1 to fj1 and
together form the parsed output of MC;

fx to fn

ej to ex1 forms the relative clause;

(j + 1)th to (x 1)th words will form the simple sentence


with the following modification;

Modification:
- A new word, either "him", "her" or "them" is placed
after the cth word. The functional-morpho tag of this
new word will be @OBJ PRON PERS MASC SG3 he",
@OBJ PRON PERS FEM SG3 she" or
@OBJ PRON PERS PL3 they";
The choice of this new word depends on the gender and
number of el , where l is such that fl =@SUBJ, 1 l j1.

\If parser does not specify the gender, then


the gender of el is considered to be masculine.\
- After this modification, ej to ex1 forms the
simple sentence and we denote it as RC;

}
Delimiter of RC is ".";
Delimiter of MC is y ;
Exit;

}
Figure 5.5: Schematic View of the SUBROUTINE SPLIT

246

A Cost of Adaptation Based Scheme

\ This module will be executed when neither @+FAUXV nor @+FMAINV


tag is present before who\
Let count =0,
For(i =

k = 0;

\k stores the position of second occurrence


of @+FAUXV or @+FMAINV tag\

j + 1 to n)

{
IF(fi = @+FAUXV OR
IF(count = 0)
THEN count++;
ELSE k = i;

fi = @+FMAINV)

IF(count = 1)
break;

}
CALL SUBROUTINE SPLIT(k , ".")

Figure 5.6: Schematic View of Module 2

\This modules splits sentences in which the relative clause succeeds the
main clause.\
IF(fj 6= @OBJ)
THEN {
- Words e1 to ej1 of the given input sentence e forms
the main clause, which is a simple sentence, say MC;
- Functional-morpho tags f1 to fj1 of the parsed
version of e gives the parsed version of MC
- Words

ej to en constitute the relative clause;

\ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j 1th word to the first word in e.\
NUMBER = ;
GENDER = ;

247

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

For(i =

j 1 to 1)

{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)

{
ith word;
ith word;

GENDER = gender of
NUMBER = number of

}
IF(NUMBER

6= ) break;

}
IF (GENDER = ) GENDER = MASC;
-

ej is replaced with either he", she" or they"


depending on GENDER and NUMBER. Morpho-functional
tag fj of new ej is @SUBJ PRON PERS MASC SG3 he",
@SUBJ PRON PERS FEM SG3 she" or
@SUBJ PRON PERS PL3 they".

- Words ej to en constitute the other simple sentence,


say RC;
- The functional-morpho tags fj to fn form the parsed
version of RC;

}
ELSE

{
- Words e1 to ej1 form the main clause, which
is a simple sentence, say MC;
-

f1 to fj1 functional-morpho tags form the parsed


version of MC

- Words
c=0;

ej to en form the relative clause;

\c stores the position of first occurrence of the word having


@+FMAINV or @-FMAINV tag.\

For(i=j + 1 to n)
IF(fi = @+FMAINV OR
{c = i; break;}

fi = @-FMAINV)

248

A Cost of Adaptation Based Scheme

\ Gender and number of the first occurrence of the word having either
of @<P, @OBJ, @PCOMPL-O, @I-OBJ, @PCOMPL-S tag is determined
below. This word is searched from the j 1th word to the first word in e.\
NUMBER = ;
GENDER = ;
For(i = j 1 to 1)

{
IF(fi = @<P OR @OBJ OR @I-OBJ OR @PCOMPL-S OR
@PCOMPL-O)

{
ith word;
ith word;

GENDER = gender of
NUMBER = number of

}
IF(NUMBER

6= ) break;

}
IF (GENDER = ) GENDER = MASC;
- A new word w which can be either "him", "her" or "them"
is placed after the cth word. The functional- morpho
tag of w will be @OBJ PRON PERS MASC SG3 he",
@OBJ PRON PERS FEM SG3 she" or
@OBJ PRON PERS PL3 they";
The choice of w depends on GENDER and NUMBER.

j + 1 to c followed by w concatenated with the


(c+1)th to nth words of e form the other simple sentence,

- Words

call it RC;
- Except for the functional-morpho tag of (c + 1)th word,
the functional-morpho tag of constituent words of RC
are obtained from the functional-morpho tags
of the corresponding words of e.

}
Delimiter of MC is ".";
Delimiter of RC is ".";
Exit;

Figure 5.7: Schematic View of Module 3

249

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

\ This module splits interrogative complex sentences\


r=0;
For(i =

p + 1 to j 1)

\ p is as obtained from Module 1 \

{
IF(fi = @-FMAINV AND
IF(r 6= 0) break;

fi1 = @INFMARK>) r = i;

}
IF(r 6= 0)
\This implies that relative clause follows main clause\
{CALL Module3a, where Module3a is same as Module 3 with one
modification, i.e. the delimiter of MC is "?" instead of
".";}

\ The following will be performed when the relative clause is embedded


between the main clause\
c = 0; c1 = 0; c2 = 0;
For(i = j + 1 to n)
{
IF((fi = @+FMAINV)OR (fi = @-FMAINV AND fi1 = @INFMARK>)
{ c++;
IF(c = 1)
THEN c1 = i;
\ c1 stores the position of the first
occurrence of the main verb after who\
ELSE c2 = i;
\c2 stores the position of the second
occurrence of the main verb after who\
}
IF(c2 6= 0) break;
}
\Position of the auxiliary verb preceding the second main verb, if any, is
determined below.\
s = 0;
For(i = c1 + 1 to c2 1)
{
IF(fi = @-FAUXV) s = i;
IF(s 6= 0) Break;
}
IF(s 6= 0)
THEN CALL SUBROUTINE SPLIT(s, "?");
ELSE CALL SUBROUTINE SPLIT(c2 , "?");
}
Figure 5.8: Schematic View of Module 4
250

A Cost of Adaptation Based Scheme


The length of the input sentence is 11, and the bag of functional tags is {@DN>,
@SUBJ, @SUBJ, @+FMAINV, @INFMARK>, @-FMAINV, @OBJ, @+FAUXV,
@-FMAINV, @DN>, @OBJ}. Since Roote(e3 ) is who and its morpho tags contain
<Rel>, the module decides that given input sentence e is complex sentence with
the connective who. To identify the position of the relative clause in the given
input sentence e, presence of @+FAUXV or @+FMAINV tag is checked in the first
two words. It is found that none of these two functional tags is present in the first
two words (which are those and students). Thus Flag is set to 0 indicating that the
relative clause is embedded between the main clause. Hence the Module 1 concludes
that the given sentence e will be splitted by Module 2.
For separating the main and the relative clause, Module 2 first locates the position
k of the second occurrence of @+FAUXV or @+FMAINV tag in the parsed version
(f ) of e. It should be noted that since neither @+FAUXV nor @+FMAINV tag is
present in first two words, and the third word is who, the algorithm checks the
tags of 4th word to 11th word to determine the value of k. The value of k is found
to be 8. @+FMAINV tag occurs at the 4th position, and @+FAUXV tag is present
at the 8th position of the sentence e.
Since the functional tag of the connective who is @SUBJ, the module gives the
following output:
The first two words concatenated with 8th to 11th words constitute the simple
sentence (MC) which is also the main clause . Thus, first simple sentence is
Those students should study a lot.. The delimiter of MC is .. The parsed
version of MC is obtained from the FTs of the corresponding words in the
parsed version f of e. Thus the parsed version of MC is

251

5.9. Splitting Rules for Converting Complex Sentence into Simple Sentences

@DN> DEM that, @SUBJ N PL student, should, @-FMAINV V


INF study, @DN> ART a, @OBJ N SG lot <$.>
The words from 3rd position to 7th position of the input sentence e form the relative clause, i.e. who wants to learn Hindi. Now the 3rd word is replaced with
they as the functional tag of the 2nd word (i.e. students)has the functional
tag @SUBJ whose number is plural (PL). Also, since the gender of students
is not specified, the gender of they is assumed to be masculine. Thus the
other simple sentence RC is They want to learn Hindi and its delimiter is ..
The parsed version of RC is:
@SUBJ PRON PERS PL3 they, @+FMAINV V PRES VFIN want,
@INFM- ARK> INFMARK> to, @-FMAINV V INF learn, @OBJ
<proper> N SG Hindi <$.>

Similarly, splitting of other structures of complex sentences with connective


who can be carried out using the above-mentioned modules. It should be noted
that the algorithm cannot deal with those interrogative complex sentences for which
the root form of the main verb of the main clause is be. In these type of sentences
the identification of main clause and the relative clause is relatively more complicated. For example, consider the complex sentence Is the man who was reading the
book in the library upstairs?. Its parsed version is:
@+FMAINV V PRES be, @DN> ART the, @SUBJ N SG man,
@SUBJ <Rel> PRON WH SG/PL who, @+FAUXV V PAST be, @FMAINV PCP1 read, @DN> the ART, @OBJ N SG book, @ADVL
PREP in, @DN> ART the, @<P N SG library, @ADVL ADV
ADVL upstairs <$?>
252

A Cost of Adaptation Based Scheme


In the above sentence, in the library is the preposition phrase (<PP>) and
upstairs is adverb. Since root form of the main verb of the main clause is be,
it can take either upstairs or in the library along with upstairs as its predicative(s)13 . Thus, the main clause can be Is the man in the library upstairs? or Is
the man upstairs?. The relative clause will also vary accordingly. Hence in this
situation formulating the splitting rules is not achievable using this parsing scheme.
The same problem occurs for other variations of these type of sentences (e.g. Is this
the man who saw you with the binoculars?). Thus these type of sentences are not
handled in this report.
We have developed algorithms for splitting complex sentences using other connectives also. But these rules are not discussed in this report in order to avoid
the repetitive nature of discussion. The following subsection discusses the adaptation procedure for obtaining translations of the input complex sentences using the
splitted simple sentences RC and MC.

5.10

Adaptation Procedure for Complex Sentence

In the following subsections, we discuss the adaptation procedure to obtain the


translation of complex sentences having any of the following connectives: when,
where, whenever,wherever and who.
13

The predicative of the sentence having root form of the main verb be can be any one (or
combination) of the subjective complement, preposition phrase or adverb.

253

5.10. Adaptation Procedure for Complex Sentence

5.10.1

Adaptation Procedure for Connectives when, where,


whenever and wherever

Since the Hindi translation patterns of the complex sentence having connectives one
of when, where, whenever or wherever are same, the adaptation procedure
for such complex sentences is discussed collectively. Table 5.24 gives the translation
of the above-mentioned connectives, and Table 5.25 provides the possible structures
of English and its Hindi translation for these connectives (Refer Tables 5.19 and 5.20
for examples of these sentence patterns). Since the correlative adverb is frequently
not indicated in the Hindi translation of complex sentences having any of abovementioned connectives (Bender, 1961), (Kachru, 1980), the correlative adverb is
given in {}.
English
Relative Adverb

Hindi
Hindi
Relative Adverb Correlative Adverb

When

jab

tab

where

jahaan

vahaan

whenever

jab bhii

tab

wherever

jahaan bhii

vahaan

Table 5.24: Hindi Translation of Relative Adverbs

The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having
least cost of adaptation which have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:
1. Adapt the translation of R1 to the translation of RC.
2. Adapt the translation of R2 to the translation of MC.
254

A Cost of Adaptation Based Scheme


3. Add one morpho-word (i.e. corresponding Hindi relative adverb, refer Table
5.24) in the beginning of the translation of RC. Other morpho-word (i.e. corresponding Hindi correlative adverb, refer Table 5.24) may be added in the
beginning of the translation of MC.
4. Concatenate the (modified) translations of RC and MC.

Complex English Sentence Pattern


<Relative clause with connective
wherever]> &<Main clause>.
OR

[when,

where,

whenever

or

<Main clause> &<Relative clause with connective [when, where,


whenever or wherever]>.
Complex Hindi Sentence Pattern
Connective when:

jab &<Hindi translation of RC> &{tab}


&<Hindi translation of MC>

Connective where:

jahaan &<Hindi translation of RC> &{vahaan}


&<Hindi translation of MC>

Connective whenever: jab bhii &<Hindi translation of RC> &{tab}


&<Hindi translation of MC>
Connective wherever: jahaan bhii &<Hindi translation of RC>
&{vahaan} &<Hindi translation of MC>
Table 5.25: Patterns of Complex Sentence with Connective when, where, whenever and wherever

It may be noted that the total cost involved in generating the translation of
given complex sentence depends on the cost of adapting the translation of R1 and
255

5.10. Adaptation Procedure for Complex Sentence

R2 to the translation of RC and MC, respectively. This is due to the fact that the
cost involved in one (or two) morpho-word addition (required in step 3) is fixed, i.e.
 (or 2) as relative and correlative adverbs always occur at the beginning of the
Hindi translation of RC and MC sentences (refer Table 5.25), respectively. Hence no
search is required to find the correct position for morpho-word in the Hindi sentence.
Further, the cost of concatenating these two translations is also fixed, which is .
Assuming that the cost of adapting the translation of R1 and R2 to the translation of RC and MC are c1 and c2 , respectively. Hence the total cost involved in
generating the translation of the given complex sentence is c1 +c2 +2+.

5.10.2

Adaptation Procedure for Connective who

This section discusses the adaptation procedure for the complex sentence having
connective who. It may be noted that there are many variations in sentence
structure having this connective (refer Figure 5.19). For illustration, we consider
the sentence pattern as given in Table 5.26. In this pattern the connective who
plays the role of subject in the relative clause of English sentence.
Two different Hindi translation patterns may be noted corresponding to the
above mentioned English sentence pattern. In the first pattern, the relative adjective
jo occurs in the beginning of the relative clause whereas in the other pattern jo
occurs before the subject slot of the main clause. The noun in the main clause,
to which the relative clause modifies14 , is represented by wah or we depending
upon the number of the noun (Bender, 1961), (Kachru, 1980).

14

For the sentences under consideration, this noun is the subject of the main clause.

256

A Cost of Adaptation Based Scheme

Complex English Sentence Pattern


<Subject slot of Main clause>, &(Relative clause with connective [who])
&(Main clause, without subject slot).
(e.g. The man who is reading a book is nice.)
Complex Hindi Sentence Pattern
(wah or we) &<translation of subject slot of MC> & jo &(translation of
RC, without subject slot) &(Translation of MC, without subject slot)
(e.g. wah aadmii jo kitaab padh rahaa hai achchhaa hai )
OR
jo &<Translation of subject slot of MC> &(Translation of RC, without subject
slot) &(wah or we) &(Translation of MC, without subject slot)
(e.g. jo aadmii kitaab padh rahaa hai wah achchhaa hai )
Table 5.26: Patterns of Complex Sentence with Connective who

The adaptation procedure for generating the translation of the complex sentences
under consideration is discussed below. Suppose R1 and R2 are sentences having the
least cost of adaptation that have been retrieved from example base corresponding
to RC and MC, respectively. The steps of this procedure are as follows:

1. Adapt the translation of R1 to the translation of RC. The subject slot of R1


is not adapted for RC as it is to be deleted while formulating the translation
of the given complex sentence.
2. Adapt the translation of R2 to the translation of MC.
3. Depending upon the required translation pattern, add two appropriate morphowords in RC and/or MC. The first morpho-word to be added is taken from

257

5.10. Adaptation Procedure for Complex Sentence

the set {wah, we}, and the other morpho-word is jo. The position of the
morpho-words in the two patterns are given below:
For the first pattern, the morpho-word jo is added in the beginning of
the translation of RC, and depending on the number of the subject of
MC, the morpho-word either wah or we is added in the beginning of
the translation of MC.
For the second pattern, the morpho-word jo is added in the beginning
of the translation of MC, and depending on the number of the subject
of MC, a morpho-word (either wah or we) is added after the subject
slot of the translation of MC.
4. Combine the (modified) translations of RC and MC. For both translation patterns, the translation of RC is embedded in the translation of MC after the
subject slot.

The cost involved in generating the translation of complex sentences discussed


above, is as follows:

1. Cost of adapting the translation of R1 to the translation of RC. Let this cost
be c1 . In this case, the cost involved for adapting the translation of the subject
slot is not included.
2. Cost of deletion of subject slot from the translation of RC. Let us denote this
cost by w.
3. Cost of adapting the translation of R2 to the translation of MC. Let this cost
be denoted as c2 .
258

A Cost of Adaptation Based Scheme


4. Cost of two morpho-word additions which are given below for both the Hindi
translation patterns.
For the first translation pattern, the cost of adding these two morphowords is () + (0.5+) = 0.5+2 (refer Section 5.3). Here dictionary
search is not required as the morpho-words may be stored in some readily
accessible location. Since these morpho-words are always added in the
beginning of the translation of RC and MC, no search is required to
determine the correct position for the morpho-word addition.
For the second Hindi translation pattern, the cost of adding these two
morpho-words is () + ( L2 + 0.5 + + )= ( L2 + 0.5) + +
2, where L is the length of the translated of MC.
5. Cost of combining the translation of RC and MC for both translation patterns
is

L
2

+ + . Here too, L is the length the translated sentence of MC.

Thus the total cost involved for two translation patterns is the sum of all the
above mentioned costs. The two simple sentences R1 and R2 are retrieved from
the example base for generating the translation of given complex sentence so as to
minimize the total cost of adaptation.
We have formulated the adaptation procedure for other complex sentence structures having connective who in a similar way. However, due to similar nature of
discussion we do not elaborate on them in this report.
The above discussed adaptation procedures are illustrated in the following subsection. In particular, we show for a given complex sentence how the scheme retrieves
two similar simple sentences from the example base that can be used to generate
the translation of the input complex sentences.
259

5.11. Illustrations

5.11

Illustrations

The adaptation procedures for complex sentences are explained using two illustrations.

5.11.1

Illustration 1

Suppose the input sentence is: You should speak Hindi when you go to India.. Its
parsed version is:

@SUBJ PRON PERS SG2/PL2 you , @+FAUXV V AUXMOD should


, @-FMAINV V INF speak , @OBJ <Proper> N SG Hindi , @ADVL
ADV WH when , @SUBJ PRON PERS SG2/PL2 you , @+FMAINV
V PRES go , @ADVL PREP to , @<P <Proper> N SG India < $. >
The algorithm of splitting complex sentence (see Figure 5.1 and Figure 5.2)
results in two simple sentences RC and MC as given below.

RC :

You go to India. @SUBJ PRON PERS SG2/PL2 you, @+FMAINV V


PRES go, @ADVL PREP to, @<P <Proper> N SG India < $. >

MC :

You should speak Hindi.

@SUBJ PRON PERS SG2/PL2 you,

@+FAUXV V AUXMOD should , @-FMAINV V INF speak, @OBJ


<Proper> N SG Hindi < $. >

Five most similar sentences for RC and MC, obtained by applying cost of adaptation based scheme, are given in Table 5.27 and Table 5.28, respectively.
260

A Cost of Adaptation Based Scheme

Retrieved Sentences for RC Cost of Adaptation


You go to school.

Ram has gone to school.

25.67+105 c

You are coming from India.

29.58+105 c+2

You will not go to India.

9.5+ +2

They will go to Kanpur.

24.67+ + 105 c+ 

Table 5.27: Five Most Similar Sentence for RC You go


to India. Using Cost of Adaptation Based Scheme

Retrieved Sentences for MC

Cost of Adaptation

He should speak English.

10.67+c105

The boy should study Hindi.

28.25+2c105

You should speak.

8.5++

You can speak Hindi.

15.5++

He can speak English.

30.67+c105 ++

Table 5.28: Five Most Similar Sentence for MC You


should speak Hindi. Using Cost of Adaptation based
Scheme

To obtain the translation of RC and MC, we consider the first sentence of Table
5.27 and Table 5.28, respectively. Thus, R1 is You go to school., and R2 is He
should speak English.. The Hindi translations of these sentences are:
Translation of R1

: tum vidyaalay jaate ho


(you) (school)

Translation of R2

(go)

: us ko english bolnii chaahiye


(he) (English) (speak) (should)

261

5.11. Illustrations

The translation of R1 and R2 is adapted to generate the translations of RC and


MC, respectively. Thus, the translation of RC and MC are:
Translation of RC

tum india jaate ho


(you) (India)

Translation of MC

(go)

tum ko hindi bolnii chaahiye


(you) (Hindi) (speak) (should)

The morpho-words jab and tab are to be added in the beginning of the Hindi
translation of RC and MC, respectively. After this modification, these two sentences
are concatenated. Hence the desired translation of the given input sentence is jab
tum india jaate ho tab tum ko hindi bolnii chaahiye.

5.11.2

Illustration 2

Let us consider another input sentence The student who wants to learn Hindi should
study this book. and its parsed version:
@DN> ART the, @SUBJ N SG student, @SUBJ <Rel> PRON
WH SG/PL who, @+FMAINV V PRES want, @INFMARK>
INFMARK> to, @-FMAINV V INF learn, @OBJ <proper> N SG
Hindi, @+FAUXV V AUXMOD should, @-FMAINV V INF study,
@DN> DEM this , @OBJ N SG book < $. >

After applying algorithm of splitting complex sentence (refer Figure 5.4 and
Figure 5.6), two simple sentences RC and MC are obtained. These are as follows:

262

A Cost of Adaptation Based Scheme

RC :

He wants to learn Hindi. @SUBJ PRON PERS SG3 he, @+FMAINV


V PRES want, @INFMARK> INFMARK> to, @-FMAINV V INF
learn, @OBJ <proper> N SG Hindi < $. >

MC :

The student should study this book. @DN> ART the, @SUBJ N
SG student, @+FAUXV V AUXMOD should, @-FMAINV V INF
study, @DN> DEM this , @OBJ N SG book < $. >

Five most similar sentences for RC and MC are given in Table 5.29 and Table
5.30, respectively. One point to be noted here is that the cost for obtaining the
translation of subject slot of RC is not considered in Table 5.29.
Retrieved Sentences for RC

Cost of Adaptation

He likes to learn Hindi.

17.58+c105

Ram wants to teach a student.

23.08+c105

The student wants to play football.

23.08+c105

The student wants to study this book. 28.08+c105+


33.08+ c105 ++ 2

He is leaning Hindi.

Table 5.29: Five Most Similar Sentence for RC He wants


to learn Hindi. Using Cost of Adaptation Based Scheme

Retrieved Sentences for MC

Cost of Adaptation

The student wants to study this book. 24+ +2


The student should listen this poem.

37.85 +2(c105 )

The student studies books.

20.17+c105 +2+

The boy should study Hindi.

48.17+3(c105 )++

The student wants to play football.

56.52+ 3(c105 )+2+3

Table 5.30: Five Most Similar Sentence for MC The


student should study this book. Using Cost of Adaptation
Based Scheme
263

5.12. Concluding Remarks

For all the possible combinations of sentences given in Tables 5.29 and 5.30,
the cost of adaptation involved for generating the translation of the input sentence
is calculated in the way explained under the Section 5.10. The minimum cost of
adaptation, which is (17.58+c105 ) + (3+) + (24++2) + (0.5+2) + (3
+ + ) = 48.08 + c105 + 2 + 6, is obtained for following sentences:
R1: He likes to learn Hindi.
R2: The student wants to study this book.
Hence the translation of the given input sentence after generating the translation
of RC (i.e wah (he) hindi (Hindi) siikhanaa (to learn) chaahtaa hai (likes)) and
MC (i.e. vidyarthii (student) ko yah (this) kitaab (book) padhnii (study) chaahiye
(should)), and appending the relative adjective jo, and the appropriate personal
pronoun from the set {we, wah} in the begining of the translation of RC and MC,
respectively, is wah vidyarthii ko jo hindi siikhanaa chaahtaa hai yah kitaab padhnii
chaahiye.

5.12

Concluding Remarks

In this chapter we have examined a technique for evaluating similarity between


sentences. This is required for effective retrieval of past examples in order to facilitate
efficient EBMT. However, we observed that with respect to EBMT similarity may
have to be defined in a different way. Since the key focus of EBMT is adaptation, we
define cost of adaptation as a measure for similarity between sentences. According
to this definition a sentence d is said to be similar to a given input sentence e if the
adaptation of d to generate the translation of e is computationally less expensive.
We showed the results obtained by applying some of the existing retrieval techniques,
264

A Cost of Adaptation Based Scheme


based on syntax and semantics, used in text retrieval (Manning and Schutze, 1999),
(Gupta and Chatterjee, 2002). These results have been compared with the result of
cost of adaptation based scheme, and have shown the superiority of proposed scheme
over syntactic and semantic based scheme. The proposed scheme works on simple
sentences, and for measuring cost of adaptation, adaptation operations described in
Chapter 2 have been used.
One apparent drawback of this scheme is that it needs to compare the input
sentence with all the sentences of the example base. This makes the process computationally very expensive. Hence one needs to filter out irrelevant sentences, and
then evaluate the cost of adaptation on a smaller set of sentences.
In this respect, we have proposed two-level filtration scheme for measuring dissimilarity. The filtration scheme works on the following two steps:

1. Measuring surface similarity which is based on functional tags (FTs).


2. Measuring characteristic feature dissimilarity. The dissimilarity is measured
on the basis of tense and POS tag along with its root word.

The lower is the dissimilarity score of an example, the lesser will be its adaptation cost to generate the required translation. Finally, cost of adaptation based
scheme is applied on the selected set of sentences provided by the filtration scheme.
The advantage of this filtration scheme is that it reduces the number of example
base sentences that are to be analysed for evaluating the cost of adaptation. In the
worst case it reduced this number by 25%. However, as we repeated our experiments
with 500 different sentences we found that the average reduction in the number of
sentence subjected to evaluation of cost of adaptation is about 75%. The proposed
265

5.12. Concluding Remarks

scheme however cannot be applied for complex sentences straightway. This is because adaptation with respect to complex sentence is more difficult due to the more
complicated structure of complex sentences for both English and Hindi. Consequently, we suggest that complex sentences may be first split into simple sentences.
Then the adaptation cost based scheme may be applied to retrieve best matches
for each of the simple sentences. These retrieved translations may then be adapted
to generate the translation of these simple sentences. These translations may then
be combined using linguistic rules to generate the translation of the input complex
sentence. The novelty of the scheme is that it gives an algorithmic way for handling
complex sentences under EBMT as well.
In this work we have dealt with complex sentences with main clause and one
relative clause. We have developed heuristics to first determine whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. In this report, we
have discussed algorithms for splitting complex sentences for five connectives: who,
when, where, wherever and whenever. We have also developed the splitting
rules for other connectives (e.g. which, whom, whose, whoever, whichever), but to
avoid the similar type of discussion, we have not discussed all of them in this report.
Finally, we have shown the adaptation procedure for adapting given complex sentences with any of the above-mentioned connective. In particular, we showed for a
given complex sentence how the scheme retrieves two similar simple sentences from
the example base that can be used to generate the translation of the input complex
sentences.

266

Chapter 6
Discussions and Conclusions

Chapter 6. Discussions and Conclusions

6.1

Goals and Motivation

The primary goal of this research is to study various aspects of designing an EBMT
system for translation from English to Hindi. It may be observed that in todays
world a lot of information is being generated around the world in various fields. However, since most of this information is in English, it remains out of reach of people at
large for whom English is not the language of communication. This is particularly
true for a country like India, where the population size is more than a billion, yet
only about 3% of the population can understand English. As a consequence, an increasing demand for developing machine translation systems from English to various
languages of Indian subcontinent is being felt very strongly. However, development
of MT systems typically demands availability of a large volume of computational
resources, which is currently not available for these languages, in general. Moreover,
generating such a large volume of computational resources (which may comprise an
extensive rule base, a large volume of parallel corpora etc.) is not an easy task.
EBMT scheme, on the other hand, is less demanding on computational resources
making it more feasible to implement in respect of these languages.
In this respect, we further observed that although a few number of English to
Hindi MT systems are available online, quality of the translations produced by them
is not always up to a satisfactory level. This prompted us to investigate in detail the
various difficulties that one may face while developing an MT system from English
to Hindi. We feel that the studies made in this research will be helpful not only for
Hindi, but also for other languages that are major regional languages of Indian subcontinent, and at the same time prominent minority languages of other countries
(e.g. U.K.). Although an increasing demand for MT systems from English to these
languages is clearly evident, development of necessary computational resources is
267

6.2. Contributions Made by This Research

still at a very rudimentary stage.


In this research we studied different aspects of designing an English to Hindi
EBMT system in great details. In particular, we concentrated on finding suitable
solutions for the following aspects:
a) Development of efficient retrieval and adaptation scheme.
b) Study of divergence for English to Hindi translation, and how translation divergence can be effectively handled within an EBMT framework.
c) How to handle complex sentences - which are in general considered to be
difficult to deal with in an MT system.

6.2

Contributions Made by This Research

Development of an efficient adaptation scheme. Efficient adaptation of past


examples is a major aspect of an EBMT system. Even an efficient similarity measurement scheme, and a quite large example base cannot, in general, guarantee an
exact match for a given input sentence. As a consequence, the need arises for an
efficient and systematic adaptation scheme for modifying a retrieved example, and
thereby generating the required translation. In this work we developed an adaptation strategy consisting of ten different adaptation operations. A study of different adaptation techniques suggested in different EBMT systems intimates that
these techniques work primarily at word level. However, with respect to English and
Hindi, we observe that both the languages depend heavily on suffixes for carrying out
morphological variations. We further observed that adaptation may often be simple
and computationally less expensive, if the adaptation scheme focuses on suffixes as
268

Chapter 6. Discussions and Conclusions

well. In a similar way, we further observed that declensions of Hindi verb, noun
and adjective often depend on some auxiliary words, called morpho-words. Adaptation using morpho-words also makes it efficient and computationally cheaper. The
above observations motivate us to design an adaptation scheme comprising nine different basic operations, beside copy, to perform addition, deletion or replacement
of constituent words, morpho-words and suffixes. Successive application of these
operations help in adapting the translation of a retrieved sentence to generate the
translation of a given input. Another advantage of using these operations is that
their algorithmic nature enables one to estimate the computational cost of each of
these operations. We used this estimation to design a novel similarity measurement
scheme as explained below.
Retrieval and Adaptation. However good an adaptation scheme is, its performance is hindered seriously if the example that it attempts to adapt is not quite
similar to the input sentence. But there is no unique way of defining similarity between sentences. Depending upon the application, the definition of similarity may
vary. In this work we proposed a scheme for defining similarity from adaptation
perspective. We say that a sentence S1 should be called similar to another sentence S2 if adaptation of the translation of S1 to generate the translation of S2 is
computationally inexpensive. The lesser the cost is the more will be the similarity.
In this work we have provided appropriate models for prior estimation of cost of
adaptation. This cost depends not only on the number of basic operations to be performed, but also on the functional slots on which the operation is applied. Thorough
analysis of adaptation costs for different phrasal structures within various functional
slots (e.g., subject, object, verb), and also for different sentence types (e.g., affirmative, negative, interrogative) has been carried out, and models for estimating these
269

6.2. Contributions Made by This Research

computational costs have been designed.


We have carried out experiments on retrieval using the proposed scheme, and also
with other major similarity measurement schemes, namely schemes based on commonality of words and similarity of syntax. These experiments clearly established
the supremacy of the proposed scheme over the others.
Study of Divergence. Divergence comes as a major hindrance for EBMT.
As per Dorr, divergence occurs when structurally similar sentences of the source
language do not translate into sentences that are similar in structures in the target
language.. In this work we have studied the different types of divergences that may
occur in the context of English to Hindi translation. Our findings are compared with
the divergence types that are obtained in translations among European languages,
for which divergence has been extensively studied. Through this research, we have
been able to discover three new types of divergence that have so far not been reported
with respect to European languages. Altogether we have been able to characterize
seven divergence types that are prevalent in English to Hindi context.
In order to deal with divergence within an EBMT system we have proposed
the following. We developed algorithms for determining from an English sentence
and its Hindi translation whether the translation involves any of the seven types
of divergence. These algorithms will help one to partition the example base into
different parts depending upon whether the translation is normal, or involves
any of the divergences. This partitioning of example base is essential for designing
appropriate retrieval scheme for dealing with divergences efficiently.
We have further developed a corpus-evidence based scheme that enables the
system to take prior decision on whether the translation of given input sentence
will involve divergence or not. Depending on the decision of the scheme, similar
270

Chapter 6. Discussions and Conclusions

sentences are retrieved from an appropriate part of the example base.


Dealing with Complex sentences. One of the major difficulties in developing
any MT system is to design an appropriate scheme for handling complex sentences.
With respect to an EBMT system we have observed that formulating appropriate
adaptation and retrieval policies for complex sentences is not straightforward. In
order to resolve this problem we proposed a split and translate technique to handle
complex sentences. We have developed heuristics to identify whether a sentence
is complex. We use observations from our example base of complex sentences to
validate the heuristics used in the sentence splitting algorithm. This work is based
on those sentences that have at most one subordinate clause in complex sentence.
The split algorithm generates two simple sentences out of a given complex sentence
based on its main and relative clauses. These simple sentences are then translated
individually using the proposed approach. We have developed further heuristics
that combine these individual translations to generate the translation of the given
complex sentence.
One major difficulty that we faced while doing this research is that no suitable
online data on English-Hindi parallel corpus was available at that time. The data
required for this work should be properly aligned in both sentence level and word
level. The example base of about 4,500 sentences used for this work has been created
and aligned manually. These sentences have been collected by scrutinizing about
30,000 translation pairs collected from various sources like story books, recipes,
government notices etc. For efficient working of the proposed EBMT system, the size
of example base should be increased. Although now huge parallel data is available
online e.g., EMILLE data, various web-sites having both English and Hindi version
(e.g., www.iitd.ac.in, www.statebankofindia.com), this data is not aligned. In order
271

6.3. Possible extensions

to use these resources effectively, proper alignment techniques for English to Hindi
alignment should be developed.

6.3

Possible extensions

The work carried out in this research may be extended in several directions:

In this work, various adaptation procedures are studied and developed for
the sentence structures (and their components). These are the patterns that
are predominantly found among the sentences present in our example base.
Many other variations in sentence structures are possible which have not been
discussed in this work. Adaptation rules may be developed for such sentence
patterns.
Although we have dealt with complex sentences, we imposed certain restrictions in this work on the structures of complex sentence. The splitting and
adaptation rules have been developed for these restricted structures only. The
proposed split and translate technique needs to be extended for complex sentences having more complicated structures. Further, we have left compound
sentences out of our discussion. Strategies are to be developed for dealing with
compound sentences as well.
With respect to English to Hindi translation, seven types of divergence are
identified. Although this finding is based on our analysis of about 30,000
sentences, possibility of some other divergence types cannot be completely
ruled out. More detailed study of English-Hindi parallel corpus is required to
identify other divergence types, if any.
272

Chapter 6. Discussions and Conclusions

Robustness of the scheme proposed for taking a prior decision about the possible divergence types in translation of given input sentence (refer Chapter 4),
depends on PSD/NSD. These dictionaries contain the proper sense of the word,
and are created manually. For automating the construction of PSD/NSD,
appropriate Word Sense Disambiguation technique should be developed and
applied.
Prior identification of Possessional Divergence is not discussed in Chapter 4.
This is due to the reason that possessional divergence may be associated with
a large number of variations on the properties of the subject, object, premodifier of the object etc. which are not governed by a simple set of rules.
The hypernym (according to WordNet 2.0) of these words need to be analyzed
and compared to arrive at any conclusion regarding prior identification of this
divergence type.
Divergence identification algorithms (discussed in Chapter 3) depends on FT
and SPAC of the English sentence and its Hindi translation. For English
sentence, this knowledge is extracted from the parsed version which is obtained
from the parsers available online. But no such resource is available for Hindi.
Thus, in our work we parsed and obtained FT and SPAC of Hindi sentences
manually. For practical applications of the proposed algorithms, Hindi parser
is needed to obtain the required information of Hindi sentence.

6.4

Epilogue

There are many issues pertaining MT that have not been dealt with in this work.
Arguably, the two most important ones of them are:
273

6.4. Epilogue

Study of pre-editing and post-editing requirements


How to evaluate the quality of translation given by an MT system.
Although both of them pertain to a full-fledged working MT system, and hence do
not fall within the purview of the work reported in this work. We include a brief
description of these two topics here in order that future works on English-Hindi
EBMT may be directed to take care of these issues too.

6.4.1

Pre-editing and Post-editing

Pre-editing is the process of identifying and editing, where necessary, the source
text prior to translation so that any sentences (segments) of text that the machine
will have problems with are highlighted and removed. In other words, pre-editing is
based on the building up of new text data from a given text data of existing version
(e.g. paragraph form) that the MT system is able to handle. Pre-editing metric
varies according to the requirement of MT system.
In case of our study on EBMT system, we have also done some pre-editing according to the requirement of the problem of retrieval, adaptation and identification
of divergence. Firstly, we have assumed that our original data is aligned sententially,
i.e. one source language sentence corresponds to one target language sentence. In
the retrieval and the adaptation procedure, we have added the parsed version of
source sentence, which is based on morpho-functional tags, along with the information of word alignment at the root level. This minimum information is stored in
our example base for carrying out adaptation, converting complex sentences into
simple sentences, and measuring similarity for effective retrieval. In Chapter 1, we
have provided Figure 1.1 of an example record of the example base. However, the
274

Chapter 6. Discussions and Conclusions

algorithms for divergence identification require both FTs and SPAC (see Appendix
B) information for the parallel corpus. Pre-editing has to be done accordingly.
The task of the post-editing is to edit, modify and/or correct the output of the
MT system that has been processed by a machine translation system from a source
language into target language. In other words, post-editing corrects the output of the
MT system to an agreed standard, e.g. amending the style of the output sentences,
or any minimal amendments which will make the text more readable.
As such we have not developed any MT system, so post-editing is not directly
relevant in this work. But still here we point out some situations where post-editing
can be possible while deigning an EBMT system.
In case of EBMT system post-editing may be required in the following form. The
desired translation of the input sentence is generated by adapting the translation
of the best similar example. Sometimes it may happen that even after adaption
the system does not produce a translation that is grammatically correct in the
target language. It may be because of insufficient morpho-syntactic information
or grammar rules that the system uses while carrying out the adaptation task. In
this situation one has to correct translation according to the requirement. Another
situation, where post-editing can be useful is when the system does not have sufficient
number of words in the dictionary. Typically, in these cases MT system provides
transliteration of these words in the target language. Post-editing is useful in these
cases too. The amount of post-editing required on the output provides a good
yardstick for measuring the output quality of an MT system.

275

6.4. Epilogue

6.4.2

Evaluation Measures of Machine Translation

Implementation of an MT system can be considered to be successful only if the quality of the translation produced by the system is of acceptable quality. This automatically raises the issue of how to evaluate the quality of the output produced by a system. In recent years, various methods have been proposed to automatically evaluate
machine translation quality. Typically, these methods take the help of some reference translation of some pre-selected test data. Reference translation is also known
as gold-standard translation. By comparing the output produced by the system
under consideration (with respect to the pre-selected test data) with the reference
translation, an estimate of possible discrepancy is arrived at. This in turn gives a
measure of the translation quality of the said system. Examples of such methods are
Word Error Rate (WER), Position-independent word Error Rate (PER)(Tillmann
et al., 1997), and multi-reference Word Error Rate (mWER)(Nieen et al., 2000).
Below we describe the above-named methods.

WER: The word error rate is based on the Levenshtein distance. It is computed as the minimum number of substitution, insertion and deletion operations that have to be performed to convert the generated sentence into the
reference sentence.
PER: A major shortcoming of the WER is the fact that it requires a perfect
word order. In order to overcome this problem, the position independent word
error rate (PER) was introduced as additional measure. It compares the words
in the two sentences without taking the word order into account. Words that
have no matching counterparts are counted as substitution errors, missing
words are deletion error and additional words are insertion errors. Evidently,
276

Chapter 6. Discussions and Conclusions

the PER provides a lower bound for the WER.


mWER: Often there exist many possible correct translations of a sentence.
The WER and the PER compare the produced translation only to one given
reference which might be insufficient due to variance in syntax. Thus, a set
of reference translations for each test sentence is built. For each translation
hypothesis, the Levenshtein distance to the most similar reference sentence in
this set is calculated. This yields a more reliable error measure, and is a lower
bound for the WER.

In later years n-gram based schemes have been proposed to evaluate translation quality. The most prominent among them are BLEU (Bilingual Evaluation
Understudy)(Papineni et al., 2001) and NIST (National Institute of Standard and
Technology) (Doddington, 2002) scores. All these criteria try to approximate human assessment, and often achieve an astonishing degree of correlation to human
subjective evaluation of fluency1 and adequacy2 . These methods work as follows.

BLEU: This scheme has been proposed by IBM (Papineni et. al., 2001). It is
based on the notion of modified n-gram precision, for which all candidate ngram counts are collected. The geometric mean of n-grams precision of various
lengths between a hypothesis and a set of reference translations are computed.
This score is multiplied by brevity penalty (BP) factor to penalize too short
translations. Therefore,
1

A fluent sentence is one that is well-formed grammatically, contains correct spellings and
adheres to common use of terms, is intuitively acceptable and can be sensibly interpreted by a
native speaker
2
The judge is presented with the gold-standard translation, and should evaluate how much of
the meaning expressed in the gold-standard translation is also expressed in the target translation

277

6.4. Epilogue

BLU E = BP exp

N
X
log pn
n=1

Here pn denotes the precision of n-gram in the hypothesis translation. N denotes total number of n-grams considered, usually N {1, 2, 3, 4}. Papineni et.
al. (2001) state that BLEU captures adequacy3 as well as fluency4 . BLEU is
an accuracy measure, while the above-mentioned measures are error measures.
The disadvantage of BLUE score is that longer n-grams dominate over shorter
n-grams, and it cannot match corresponding (sub)parts from hypothesis to
reference.
NIST: This score was proposed by National Institute of Standard and Technology in 2002. It reduces the effect of longer n-grams. This criterion computes
a arithmetic average over n-gram counts instead of geometric mean and multiplied by a factor BP that penalizes short sentences. Both, NIST and BLEU
are accuracy measures, and thus larger values reflect better translation quality.

Each of the above schemes focuses on certain aspects of translation. No single


one of them can be considered to be the best from all perspectives. Hence typically
translation quality is expressed in terms of four scores, viz. WER, PER, BLUE and
NIST.
Designing a full-fledged MT system is an enormous task. Many approaches have
been proposed, and many techniques have been pursued. However, no consensus
have yet been reached regarding the best way of designing a system. In this work
we have made contributions on various aspects of MT. Some of them are specific
3
4

Matches of shorter n-grams(n = 1,2,...) capture adequacy.


Matches of longer n-grams(n = 3,4,...) capture fluency.

278

Chapter 6. Discussions and Conclusions

for EBMT, while some other, such as, study of divergences, is relevant for other
paradigms as well. We hope that the contributions made in this thesis will be useful
for designing English to Hindi MT system, and also for many other language pairs
at large.

279

Appendix A

Appendix A.

A.1

English and Hindi Language Variations

English and Hindi languages are of two different origin, so study of their general
structural properties is necessary. In this discussion, some of the basic concepts of the
translation from English to Hindi are briefly outlined. Some of the general structural
properties of English and Hindi(Kachru, 1980)(Kellogg and Bailey, 1965)(Singh,
2003)(Qurick and Greenbaum, 1976) are described below. For example,

Sentence Pattern: The basic sentence pattern in English is Subject (S) Verb
(V) Object (O), whereas it is SOV in Hindi. Consider for example Radha
eats mango here Radha is subject; eats is the verb while mango is the
object. So the words occur in the order SVO. But in Hindi it becomes
radha (S)

aama (O) khaatii hai (V)

Radha

mango

eats

Order of Words in a Sentence: English is a positional language and is therefore


has (relatively) fixed order. Relation between various components of the sentence are mainly shown by the relative position of the components. Consider
this example as:
Radha watches the sparrows.
is very different from
The sparrows watch Radha.
Hindi is (relatively) free-order. Relation between various components of the
sentence are mainly shown by inflecting the components. Change of position
of components normally change the emphasis of an utterance, and not the
basic meaning.

281

A.1. English and Hindi Language Variations

For Example:

radha

chidiyaan

(Radha) (sparrows)

dekhatii hai
(watches)

has the same meaning


chidiyaan
(sparrows)

radha

dekhatii hai

(Radha)

(watches)

Above mentioned differences are structural differences between English and Hindi.
Some differences are in the part of speech properties of English and Hindi languages.
These discrepancies are as follows:

Noun: Hindi nouns are effected by gender, number and case ending (Kellogg
and Bailey, 1965). These are as follows:
1. Gender : English has four genders-masculine (MASC), feminine (FEM),
common and neuter, whereas Hindi has only two- masculine and feminine. The neuter gender of Sanskrit (Origin of Indian languages), Hindi
as well as the closely related languages, has vanished.
2. Number : As English, Hindi also has two numbers- Singular and Plural.
There are some possible suffixes for singular to plural conversion in Hindi,
which are as follows (Kellogg and Bailey, 1965):
For example:
singular

Plural

ladkaa - boy (MASC)

ladke - boys

ghar - house (MASC)

ghar - houses (No change)

kapadaa - cloth (MASC)

kapade - clothes

ladkii - girl (FEM)

ladkiyaan - girls

kakshaa-class(FEM)

kakshayen- classes
282

Appendix A.

3. Case ending: There are eight case endings in Hindi, which are given below
in Table A.2. All these are appended to the oblique form of the noun,
where such a form exists. There are some rules for making oblique nouns.
Some of them are as follows:
(a) Masculine singular nouns ending in aa change into e when some
case ending is added : e.g. ladkaa + ne ladke ne. Nouns ending
in other vowels do not undergo such changes e.g. ghar ko, daaku
kaa.
(b) If a noun (masculine or feminine) ends in a, it is changed into
aon in plural, when a case ending is added. For example: in the
house ghar mein while in the houses ghar on

main.

Note that, normally the plural of ghar is ghar , but it changes to


gharon in the above example because of the case ending.
The addition to the oblique form of noun of certain particles is commonly
called postposition.
Case

Case-endings

Nominative case

ne

Accusative case

ko

Agent case

se (by, with and through )

Dative case

ko (to), ke liye (for), ke waaste

Ablative case

se ( from, since)

Possessive case

kaa, ke, kii

Locative case

mein, par ( in, on)

Vocative case

he, ajii, are

Table A.2: Different Case Ending in Hindi

283

A.1. English and Hindi Language Variations

No postposition is used with the nominative and Vocative. Here we will discuss
three cases nominative, accusative and possessive case. Other cases work same
as English case ending. These cases as follows:
1. Nominative case: The subject of a sentence takes the nominative sign
ne only when its predicative is a transitive verb in the past tense (past
indefinite, present perfect and past perfect). The use of this case is to
make a noun or pronoun act as subject of a verb. In that case, verb agrees
with the object in gender and number. For example,
Ram narrated a story.

The farmer has sowed the seeds.

ram ne

kahaanii

sunaayii

(Ram)

(story)

(narrated)

biij

boyee hain

kisaan ne
(farmer)

(seeds)

(sowed has)

Here in these two examples the objects of translated sentence are kahaanii , biij . The number and gender of these nouns are singular
feminine and plural masculine, respectively.
2. Accusative case: ko is the sign of this case and it is generally added
only to animate objects. Sometimes it is also added to inanimate objects, either to intensify its effect or to express a special significance. For
example:
The boy beats the dog.

ladkaa

kutte ko

(boy)

(dog)

martaa hai
(beats)

3. Possessive case: the signs of this case are kaa , ke and kii . These
words are used with noun according to gender, number and case-ending
of the following noun. This case ending has already been discussed in
detail in Section 2.5.2 of Chapter 2.
284

Appendix A.

Preposition-Postposition: In English, preposition occurs before the noun, e.g.


on the table, in the box. But in Hindi it occurs after the noun (e.g.
meja(table) par (on) - on the table), and hence this may be called postposition instead of preposition. However, for ease of understanding we shall
call them preposition only.
Article: As Hindi has no article, the distinction indicated in English by the definite and indefinite articles cannot always be expressed in Hindi. As ghodha
may be either a horse or the horse; istriyaan may be women or the
women. The indefinite article may sometimes be rendered by the numeral
aka, one, or the indefinite pronoun, anykoyii , some kuchh.

A.2

Verb Morphological and Structure Variations

Every language has its own grammar rules. In other words, we can find same sentence
following different grammatical aspects corresponding to the language concerned.
For example, consider an English sentence He will be sleeping at the moment. Its
translation in Hindi is wah iss samay so rahaa hogaa. As per English grammar
rules, verb phrase follows future tense and progressive aspect (or continuous aspect)
but at the same time Hindi sentence verb phrase comes under definite potential type
of mood according to Hindi grammar. For the translation work, we have followed
English grammar categorization for verb phrase structure (Quirk and Greenburm,
1976) which involves different combination of tense, aspect and mood.
To understand English to Hindi verb structure, conjugation of root verb in Hindi
has been presented in the following subsection.

285

A.2. Verb Morphological and Structure Variations

A.2.1

Conjugation of Root Verb

Verb morphological variations in Hindi depend on four aspects: tense and form of
the sentence, gender of the subject, person of the subject and number of the subject.
All these variations affect the root verb of a sentence. Since there are three tenses
(i.e. Present, Past and Future) and four forms (i.e. Indefinite, Continuous, Perfect,
and Perfect Continuous), in all one can have 12 different conjugations. In Hindi,
these conjugations are realized using suffixes attached to the root verbs, and/or
adding some auxiliary verbs, which we call Morpho-Words (MW). Table A.3 gives
the total number of morphological words and suffixes in Hindi, for all the tenses and
their forms.
Tense form

Present Tense

Indefinite

Suffix: taa, tii, te Suffix : taa, tii, te


MW: hoon, hai, MW: thaa, thii,

Suffix :
oongaa,
oongii, oge, ogii,

ho, hain

the

egaa, egii, enge, engii

MW: rahaa, rahe,


rahii, thaa, the, thii

MW: rahaa, rahe,


rahii, hoongaa,

Continuous MW: rahaa, rahe,


rahii, hoon, hai, ho,

Past Tense

hain
Prefect

Perfect

Future Tense

hoongii, hoonge,
hogaa, hogii, hoge

MW: hoon, hai,


hain, ho, chukaa,

MW: thaa, the,


thii, chukaa, chukii,

MW:
hoongii,

chukii, chuke

chuke

hogaa, hogii, hoge,


hongee,
chukaa,
chukii, chuke

Same as Continu-

Same as Continu-

Same as Continu-

ous

ous

Continuous ous

Table A.3: Suffixes and Morpho-Words for Hindi Verb


Conjugations

286

hoongaa,
hoonge,

Appendix A.

Above suffixes and morphological words in present prefect, past indefinite and
past prefect are used for literal translation of a sentence. Actually conjugation in
root verb is aa, e and ii . It has been observed that according to Table
A.3, suffixes {taa, te, tii } are added in the root form of past indefinite tense form.
According to the tense forms, the morpho-words {thaa, the, thii }, {chukaa, chukii,
chuke} and {hoon, hai, ho, hain} are added after the main verb of the sentence.
Another possible way of expressing these three tenses and forms in Hindi is that,
in place of above mentioned suffixes different conjugations of verbs is used that
is different from the verb of tenses and forms discussed earlier. The morpho words
{thaa, the, thii } or { hoon, hai, ho, hain} is added depending upon the tense towards
the end of the sentence.
Some rules of these conjugation of verbs are as follows (Sastri and Apte, 1968):
If the root of the verb ends in a (silent) lengthen it to aa in masculine
singular and change it into e for masculine plural; in feminine singular it
becomes ii and in feminine plural iin. For example the verb play khelaa is in Hindi khelaa (masculine singular), khelii (feminine singular),
khele (masculine plural) and kheliin (feminine plural).
If the root ends in aa or oo, yaa is added, which changes according
to the aa, ai and ii rule1 . Sometimes e is used in place of ye; and
ii and iin in the place of yii and yiin, respectively. For example, the
verb is come -aa, in masculine aayaa (singular) and aaye or aae
(plural), and in feminine aayii or aaii (singular) and aayiin or aaiin
(plural).
1

The aa, ai, ii Rule (Sastri and Apte, 1968): Masculine words ending in aa form their
plurals by changing the aa into e and their feminine by changing aa into ii

287

A.2. Verb Morphological and Structure Variations

If the verb-root ends in uu, change it into u and add aa and e in


masculine and ii and iin in feminine. For example the verb is touch, in
masculine chhuaa and chhue, and in feminine chhuii and chhuiin.

These rules are defined as PCP verb form rules.


Of the English verb group, above mentioned morpho word and suffixes are called
the morphological transformations in Hindi. Table A.4 provides some conjugation of
verb write, in a view of the systematize knowledge which has been given in Table
A.3.
English Sentence

Gender

Tense

Hindi Sentence

I am writing a letter.

M/F

Present
continuous

main patr likh rahaa hoon


main patr likh rahee hoon

You write a letter.

M/F

Present
Indefinite

tum patr likhte ho


tum patr likhtii ho

I write a letter.

M /F

Present

main patr likhtii hoon

Indefinite

main patr likhtaa hoon

Past

wah patr likh rahaa thaa

continuous

wah patr likh rahii thii

Future

hum patr likhenge

Indefinite

hum patr likhengii

Past
Indefinite

Sita ne patr likhaa

He (She) was writing a

M /F

letter.
We will write a letter.
Sita wrote a letter.

M/F
F

Table A.4: Verb Morphological Changes From English to


Hindi Translation

Similar discussion can be done for the passive verb form also. Passive form can
be formulated for transitive verbs only.
The morphological variation depends on the gender and number of the object
288

Appendix A.

of the active form of the sentence that is basically the subject in the passive form
(Sastri and Apte, 1968). The subject of the active form occurs in the passive forms
as the instrumental case followed by by and its Hindi is either se, ke duwaraa
or duwaraa. In passive form, the changes in the main verb are according to the
rules of PCP form of verb as discussed in the section A.1. Moreover an extra verb
jaa is introduced after the main verb and the suffixes that are given in Table A.3
are added in this additional verbs instead of the main verb of the sentence. The
morpho words are added after the conjugation of verb jaa. Suppose the set of
examples are:
We add sugar to milk.
ham
(we)

dudh

mein

(milk)

shakkar

(in)

(sugar)

daalte hain
(add)

Sugar is added to milk by us


dudh
(milk)

mein

shakkar

hamare

(in)

(sugar)

(us)

duwaraa

daalii jatii hai

(by)

(added) (is)

The first example is in the active form and the second example in the passive
form. The verb morphological changes are according the above discussion.

289

Appendix B

Appendix B.

B.1

Functional Tags

In this work we have used the ENGCG parser1 for parsing the English sentence.
Most of the FTs that are relevant for this work are obtained directly from the
parser. Description of these FTs are given below:
@+FAUXV

Finite Auxiliary Predicator


(e.g. He can read.)

@-FAUXV

Nonfinite Auxiliary Predicator


(e.g. She may have read.)

@+FMAINV

Finite Main Predicator


(e.g. He reads.)

@-FMAINV

Nonfinite Main Predicator


(e.g. She has read.)

@SUBJ

Subject
(e.g. He reads.)

@F-SUBJ

Formal Subject
(e.g.There was some argument about that. It is raining.)

@N

Title
(King George and Mr. Smith)

@DN>

Determiner
(He read the book.)

@NN>

Premodifying Noun
(The car park was full.)

@AN>

Premodifying Adjective
(The blue car is mine.)

http://www.lingsoft.fi/cgi-bin/engcg

291

B.1. Functional Tags

@QN>

Premodifying Quantifier
(He had two sandwiches and some coffee.)

@GN>

Premodifying Genitive
(My car and Bills bike are blue.)

@AD-A>

Premodifying Ad-Adjective
(e.g. She is very intelligent.)

@OBJ

Object
(e.g. She read a book.)

@PCOMPL-S

Subject Complement
(e.g. He is a fool.)

@I-OBJ

Indirect Object
(e.g.He gave Mary a book.)

@ADVL

Adverbial
(e.g. She came home late. She is in the car.)

@<NOM-OF

Postmodifying of
(e.g. Five of you will pass.)

@<NOM-FMAINV

Postmodifying Nonfinite Verb


(e.g.He has the licence to kill. John is easy to please.)

@<AD-A

Postmodifying Ad-Adjective
(e.g. This is good enough.)

@INFMARK>

Infinitive Marker
(e.g. John wants to read.)

@<P-FMAINV

Nonfinite Verb as Complement of Preposition


(e.g. This is a brush for cleaning.)

@CC

Coordinator

292

Appendix B.

(e.g.John and Bill are friends.)


@CS

Subordinator
(e.g. If John is there, we shall go, too.)

@NEG

Negative Particle
(e.g.It is not funny.)

@DN>

Determiner
(e.g. He read the book.)

@AN>

Premodifying Adjective
(e.g. The blue car is mine.)

@QN>

Premodifying Quantifier
(e.g. He had two sandwiches and some coffee.)

@GN>

Premodifying Genitive
(e.g. My car and Bills bike are blue.)

@<P

Other Complement of Preposition


(e.g. He is in the car.)

Each FT tag is prefixed by @ in contradistinction to other types of tags. Some


tags include an angle bracket, < or >. The angle bracket indicates the direction
where the head of the word is to be found.
Some of the functional tags that are required for divergence identification algorithms are not directly given by the available parsers. These FTs are Adjunct
(A), predicative adjunct (PA) and VC (verb complement)(refer Appendix C of their
definitions). We have formulated rules for obtaining these FTs by using information
available in the morpho tags of the underline sentence.

293

B.2. Morpho Tags

B.2

Morpho Tags

Each morpho tag is followed by a short description and an example.


Part-of-speech tags
A

adjective (small)

ADV

adverb (soon)

CC

coordinating conjunction (and)

CS

subordinating conjunction (that)

DET

determiner (any)

INFMARK>

infinitive marker (to)

INTERJ

interjection (hooray)

noun (house)

NEG-PART

negative particle (not)

NUM

numeral (two)

PCP1

-ing form (writing)

PCP2

-ed/-en form (written, decided)

PREP

preposition (in)

PRON

pronoun (this)

verb (write)

Features for adjectives


ABS

absolute form (good)

CMP

comparative form (better)

SUP

superlative form (best)

Features for adverbs


294

Appendix B.

ABS

absolute form (much)

CMP

comparative form (sooner)

SUP

superlative form (fastest)

WH

wh-adverb (when)

ADVL

adverb always used as an adverbial (in)

Features for determiners


<**CLB>

clause boundary (which)

<Def>

definite (the)

<Indef>

indefinite (an)

<Quant>

quantifier (some)

ABS

absolute form (much)

ART

article (the)

CENTRAL

central determiner (this)

CMP

comparative form (more)

DEM

demonstrative determiner (that)

GEN

genitive (whose)

NEG

negative form (neither)

PL

plural (few)

POST

postdeterminer (much)

PRE

predeterminer (all)

SG

singular (much)

SG/PL

singular or plural (some)

SUP

superlative form (most)

WH

wh-determiner (whose)

295

B.2. Morpho Tags

Features for nouns


<Proper>

proper (Jones)

GEN

genitive case (peoples)

PL

plural (cars)

SG

singular (car)

SG/PL

singular or plural (means)

Features for numerals


<Fraction>

fraction (two-thirds)

CARD

cardinal numeral (four)

ORD

ordinal numeral (third)

SG

singular (one-eighth)

PL

plural (three-eighths)

Features for pronouns


<**CLB>

clause boundary (who)

<Comp-Pron>

compound pronoun (something)

<Interr>

interrogative (who)

<Quant>

quantitative pronoun (some)

<Refl>

reflexive pronoun (themselves)

<Rel>

relative pronoun (which)

ABS

absolute form (much)

CMP

comparative form (more)

DEM

demonstrative pronoun (those)

296

Appendix B.

FEM

feminine (she)

GEN

genitive (our)

MASC

masculine (he)

NEG

negative form (none)

PERS

personal pronoun (you)

PL

plural (fewer)

PL1

1st person plural (us)

PL2

2nd person plural (yourselves)

PL3

3rd person plural (them)

RECIPR

reciprocal pronoun (each=other)

SG

singular (much)

SG/PL

singular or plural (some)

SG1

1st person singular (me)

SG2

2nd person singular (yourself)

SG2/PL2

2nd person singular or plural (you)

SG3

3rd person singular (it)

SUP

superlative form (most)

WH

wh-pronoun (who)

SUBJ

a pronoun in the nominative that is always used


as a subject (he)

Features for prepositions


<CompPP>

multi-word preposition (in=spite=of)

297

B.2. Morpho Tags

Features for verbs


<SV>

intransitive (go)

<SVO>

monotransitive (open)

<SVOO>

ditransitive (give)

<SVC/A>

copular with adjective complement (plead)

<SVC/N>

copular with noun complement (become)

AUXMOD

modal auxiliary (can)

IMP

imperative (go)

INF

infinitive (be)

PAST

past tense (wrote)

PRES

present tense (sings)

298

Appendix C

Appendix C.

C.1

Definitions of Some Non-typical Functional


Tags and SPAC Sturctures

The definitions of some non-typical functional tags that we have used in our algorithms are given below.
1. Adjunct (A): An adjunct is a type of adverbial indicating the circumstances
of the action. Adjuncts may be obligatory or optional. They express such
relations as time, place, manner, reason, condition, i.e. they are answers to
the questions where, when, how and why.
For example:
He lives in Brazil.
She was walking slowly.
Here, in Brazil is the adjunct as it gives the answer to where.
2. Predicative Adjunct (PA): If the copula (linking verb) is present, and it allows
an adverbial as complementation, then the complementation is called predicative adjunct.
For example:
The children are at the zoo.
The party will be at nine oclock.
The two eggs are for you.
The party will be tonight.
Here, the underlined prepositional phrase and adverb are the examples of predicative adjunct.
299

C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures

3. Adjective Complementation by Preposition Phrase (SC C): Some predicative


adjectives require complementation by a prepositional phrase. Such a prepostional phrase is called the postmodifier of the adjective complement. Note that
the preposition may be specific to a particular adjective.
For example :
Mary is fond of music. (Here, of music is the SC C)
He was angry (with Mary)., here parenthesis implies optional.
4. Verb Complement (VC): Sometimes prepositional phrases may act as the complementation of a verb. We use a generalized term verb complement to denote them. This happens when the main verb of the sentence is intransitive.
In case of transitive and ditransitive verbs direct and/or indirect objects are
used to complete the sense of the sentence. But these are not considered here
under verb complement. We have used their actual names object and indirect
object.
For Example :
We depend on him. (here, on him is the verb complement)
We want treat. (direct object).
He gave a pen. (direct object ) to me (indirect object).
Although many English parsers are available on-line, none of them give full information on both FT and SPAC. In order to glean both the information, we had
to combine the output of two parsers. In particular, we have used the following two
parsers:

300

Appendix C.

1. ENGCG parser1 for FTs


2. Memory-Based Shallow Parser (MBSP)2 (Buchholz, 2002), (Daelemans et. al.,
1996) for SPAC
Obtaining the FTs in a given sentence is necessary for successful implementation
of the algorithms. Some of the functional tags that are required for divergence identification algorithms are not directly given by the available parsers. These FTs are
Adjunct (A), predicative adjunct (PA), Postmodifier of the subjective complement
(SC C), and VC (verb complement). We have formulated rules for obtaining these
FTs by using information available in the morpho tags of the underline sentence.
The SPAC structure has been taken from MBSP. No specific rules are required
to capture the SPAC structure for a sentence. However, we made small structural
changes so that they can be manipulated by program easily. For example, consider
the sentence The student is weak in his studies. The MBSP gives the following
output:
[NP the/DT student/NN NP] [VP is/VBZ VP] [ADJP weak/JJR ADJP] PNP}{PNP
[Prep in/IN Prep][NP his/PRP$ studies/NNS NP]PNP}.
Since all these information is not required of our identification algorithm, we
have some what simplified the representations. Thus in our notation the above tag
information is represented as:
[NP [the/DT student/N]] [VP [is/V]] [ADJP weak/Adj]][PP in/IN [NP his/PRP$ studies/N]]
For Hindi, no parser is available online. According to the English parser information, we have tagged Hindi sentences for our work i.e. we have used same FTs
1
2

http://www.lingsoft.fi/cgi-bin/engcg
http://pi0657.kub.nl/cgi-bin/tstchunk/demo.pl

301

C.1. Definitions of Some Non-typical Functional Tags and SPAC Sturctures

and same SPAC structure for Hindi.

302

Appendix D

Appendix D.

D.1

Semantic Similarity

Semantic similarity between two words is computed on the basis of their semantic
distance (sd ) (Stretina et. al., 1998), as follows:
sim(a,b) = 1 (sd (a,b))2
The semantic similarity score lies between 0 and 1. Semantic distance [Stetina
et. al., 1998] between two words, say a and b, is computed as:

Semantic Distance for Nouns and Verbs

1
sd(a, b) =
2

Ha H Hb H
+
Ha
Hb

Here, Ha is the depth of the hypernyms of a, Hb is the depth of the hypernyms


of b, and H is the depth of their nearest common ancestor.
Semantic Distance for Adjectives and Adverb
sd (a, b) = 0 for the same adjectival synsets (including Synonymy)
sd (a, b) = 0 for the synsets in antonym relation ant(a,b)
sd (a, b) = 0.5 for the same synsets in the same similarity cluster and antonym
relation ant(a,b)
sd (a, b) = 1 for all other synsets.

303

Appendix E

Appendix E.

E.1

Cost Due to Adapting Pre-modifier Adjective


to Pre-modifier Adjective

Here transformation from genitive case to genitive case requires fourteen adaptation
operations. Below we describe the cost for each of them. Note that the pre-modifier
word can be either ABS or A(PCP1) or A(PCP2) (See Section 2.5.4). We denote
this set by R.

1. The average cost of word replacement from the set R to ABS. This cost is
denoted by w1 . Note that in this case adjective dictionary search is necessary
for which the search time is 12.41 (see item 4 of Section 5.3). Hence, the total
average cost may be computed as (l1 L2 ) + (l2

Lp
)
2

+ {(d 12.41) + (c

105 )}.
2. The average cost of word replacement from set R to either A(PCP1) or A(PCP2).
We denote this cost as w2 . Here also, dictionary search is required. Note that
adjective forms A(PCP1) and A(PCP2) are derived from the verb part of
speech, therefore, in this case dictionary search time is 12.08. Hence the total
average cost is (l1 L2 ) + (l2

Lp
)
2

+ {(d 12.08) + (c 105 )}.

3. The average cost of morpho-word addition from the set {huaa, huye, huii }.
This cost is denoted as w3 . Since the total morpho words are three, the average
cost may be formulated as (l1 L2 ) + (m 32 ) + +.
4. The average cost of morpho-word deletion from the set {huaa, huye, huii }.
We denoted it as w4 . This average cost is evaluated as (l1 L2 ) + (m 23 ) + .

305

E.1. Cost Due to Adapting Pre-modifier Adjective to Pre-modifier Adjective

5. The average cost of suffix replacement from the set {aa, e, ii } is (l1 L2 ) +
(k 32 ) + (k 32 ). We denote it as s1 .
6. The average cost of suffix addition from the set {taa, tai, tii }. This cost is
denoted as s2 , which is computed to be (l1 L2 ) + (k 32 ).
7. The average cost of suffix replacement in verb form of A(PCP2) by using PCP
form of verb (see Appendix A). We denote it as s3 . Hence, the total average
cost is (l1 L2 ) + (k 62 ) + (k 62 ).
8. The average cost of suffix addition in verb form of A(PCP2) by using PCP
form of verb is (l1 L2 ) + (k 82 ). We denote this cost as s4 .
9. We denote the average cost of suffix replacement from the set {taa, te, tii } as
s5 , which is formulated as (l1 L2 ) + (k 32 ) + (k 32 ).
10. The average cost of suffix replacement from the set {aa, ye, ii } is (l1 L2 ) +
(k 32 ) + (k 32 ). We denote it as s6 .
11. The average cost of suffix replacement from the suffix set {taa, te, tii } to any
of the suffix which is required for verb form of A(PCP2) (using PCP verb form
rule, see Appendix A). We denote it as s7 . Since the number of suffixes required
for verb form of A(PCP2) is fourteen, the average cost of this operation may
be formulated as (l1 L2 ) + (k

14
)
2

+ (k 32 ).

12. The average cost of suffix replacement from any of the suffix which is required
for converting verb form of A(PCP2) to {taa, te, tii } is (l1 L2 ) + (k 32 ) +
(k

14
)
2

(Here also, as in item 11 above). We denote it as s8 .

13. The average cost of suffix replacement for verb form of A(PCP2) to verb form

306

Appendix E.

of A(PCP2) (using PCP verb form rule, see appendix A) is (l1 L2 ) + (k 28 )


+ (k 82 ). This cost is denoted by s9 .
14. The average cost of suffix addition for verb form of A(PCP2) to verb form
of A(PCP2). We denote it as s10 , which may be formulated as (l1 L2 ) +
(k 62 )(In a similar manner, as in item 13 above).
Table E.1 discusses the cost of pairwise modification from pre-modifier adjective
to pre-modifier adjective by referring its adaptation rule Table 2.10.
Input
Retd
ABS

ABS
(0
or
s1
(w1 +{s1 })

A(PCP1)
or w2 + w2 + s2

A(PCP2)
w2 + w3 + (s3 or s4 )

A(PCP1) w2 + {s1 } + w4

(0 or ({w2 } + {s5 } + ((s7 + {s6 }) or


{s6 } ))
(w2 + s7 + {s6 })

A(PCP2) w2 + {s1 } + w4

((s8 + {s6 }) or (w2 + (0 or ({w2 }+ {s6 }+


s8 + {s6 })
{ s9 or s10 }))

Table E.1: Costs Due to Adapting Pre-modifier Adjective


to Pre-modifier Adjective

307

Bibliography
Ansell, M.: 2000, English Grammar: Explanations and Exercises, Second edn,
http://www.fortunecity.com/bally/durrus/153/gramdex.html.
Arnold, D. and Sadler, L.: 1990, Theoretical basis of MiMo, Machine Translation
5(3), 195222.
Bender, E.: 1961, HINDI Grammar and Reader, University of Pennsylvania Press,
University of Pennsylvania South Asia Regional Studies, Philadelphia, Pennsylvania.
Bennett, W. S.: 1990, How much semantics is necessary for MT systems?, Proceedings of the Third International Conference on Theoretical and Methodological
Issues in Machine Translation of Natural Languages, Vol. TX, Linguistics Research Center, The University of Texas, Austin, pp. 261269.
Bharati, A., Sriram, V., Krishna, A. V., Sangal, R. and Bendre, S.: 2002, An
algorithm for aligning sentences in bilingual corpora using lexical information,
International Conference on Natural Language Processing, Mumbai.
Brown, P. F.: 1990, A statistical approach to Machine Translation, Computational
Linguistics 16(2), 7985.

BIBLIOGRAPHY
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., Lafferty, J. D. and Mercer, R. L.:
1992, Analysis, statistical transfer, and synthesis in machine translation, Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, Montreal, Canada,
pp. 83100.
Brown, P. F., Pietra, S. A., Pietra, V. J. D., Pietra, D. and Mercer, R. L.: 1993, The
mathematics of statistical Machine Translation: parameter estimation, Computational Linguistics 19(2), 263311.
Brown, P., Lai, J. C. and Mercer, R. L.: 1991, Aligning sentences in parallel corpora, Proc. of 29th Annual Meeting of Association for Computational Linguistic,
Berkeley, pp. 169176.
Brown, R. D.: 1996, Example-Based Machine Translation in the pangloss system,
Proceedings of the 16th International Conference on Computational Linguistics
(COLING-96), Copenhagen, Denmark, pp. 169174.
Brown, R. D.: 1999, Adding linguistic knowledge to a lexical Example-Based
Translation System, Proceedings of the Eighth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99), Chester, UK,
pp. 2232.
Brown, R. D.: 2000, Automated generalization of translation examples, Proceedings
of the Eighteenth International Conference on Computational Linguistics, pp. 125
131.
Brown, R. D.: 2001, Transfer-rule induction for Example-Based Translation, Pro-

310

BIBLIOGRAPHY
ceedings of the MT Summit VIII Workshop on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 111.
Buchholz, S.: 2002, Memory-Based Grammatical Relation Finding, PhD thesis,
Tilburg University, Netherlands.
Carl, M. and Hansen, S.: 1999, Linking translation memories with Example-Based
Machine Translation, Proceedings of Machine Translation Summit VII, Singapore,
pp. 617624.
Carl, M. and Way, A.: 2003, Advances in Example-Based Machine Translation
Series: Text, Speech and Language Technology, Vol. 21, Kluwer Academic Publishers, Netherlands.
Chatterjee, N.: 2001, A statistical approach to similarity measurement for EBMT,
Proceedings of STRANS-2001, IIT Kanpur, pp. 122131.
Choueka, Y., Conley, E. S. and Dagan, I.: 2000, A comprehensive bilingual word
alignment system: Accommodating disparate languages: Hebrew and English, J.
Vronis (ed.): Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Clough,

P.:

2001,

A Perl program for sentence splitting using rules,

www.ayre.ca/library/cl/files/sentenceSplitting.ps.
Collins, B.: 1998, Example-Based Machine Translation: an Adaptation-Guided
Retrieval Approach, PhD thesis, University of Dublin, Trinity College.
Collins, B. and Cunningham, P.: 1996, Adaptation guided retrieval in ebmt: A
case-based approach to Machine Translation, EWCBR, pp. 91104.

311

BIBLIOGRAPHY
Daelemans, W., Zavrel, J., Berck, P. and Gillis, S.: 1996, MBT: A memory-based
part of speech tagger-generator, Proceedings of the Fourth Workshop on Very
Large Corpora, E. Ejerhed and I. Dagan (eds.), Copenhagen, Denmark, pp. 14
27.
Dave, S., Parikh, J. and Bhattacharya, P.: 2002, Interlingua Based English-Hindi
Machine Translation and language divergence, Journal of Machine Translation
(JMT) 17.
Doi, T. and Sumita, E.: 2003, Input sentence splitting and translating, HLT-NAACL
2003 Workshop: Building and Using Parallel Texts Data Driven machine Translation and Beyond, Edmonton, pp. 104110.
Dorr, B. J.: 1993, Machine Translation: A View from the Lexicon, MIT Press,
Cambridge, MA.
Dorr, B. J., Jordan, P. W. and Benoit, J. W.: 1998, A survey of current paradigms
in Machine Translation, Technical Report LAMP-TR-027, UMIACS-TR-98-72,
CS-TR-3961, University of Maryland, College Park, USA.
Dorr, B. J., Pearl, L., Hwa, R. and Habash, N. Y. A.: 2002, DUSTer: A method for
unraveling cross-language divergences for statistical word level alignment., Proceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Fung, P. and McKeown, K.: 1996, A technical word-and term-translation aid using
noisy parallel corpora across language groups, The Machine Translation Journal,
Special Issue on New Tools for Human Translators, pp. 5387.

312

BIBLIOGRAPHY
Furuse, O., Yamada, S. and Yamamoto, K.: 1998, Splitting long or ill-formed input
for robust spoken-language translation, Proceedings of the Thirty-Sixth Annual
Meeting of the ACL and Seventeenth International Conference on Computational
Linguistics, pp. 421427.
Gale, W. A. and Church, K. W.: 1991a, Identifying word correspondences in parallel texts, Proceedings of the Fourth DARPA Workshop on Speech and Natural
Language, Morgan Kaufmann Publishers, Inc., pp. 152157.
Gale, W. A. and Church, K. W.: 1991b, A program for aligning sentences in bilingual
corpora, ACL 91, Berkeley CA, pp. 177184.
Gale, W. and Church, K.: 1993, A program for aligning sentences in bilingual corpora, Computational Linguistics 19(1), 75102.
George, D.: 2002, Automatic evaluation of machine translation quality using ngram co-occurrence statistics, Proceedings ARPA Workshop on Human Language
Technology.
Germann, U.: 2001, Building a Statistical Machine Translation system from scratch:
How much bang for the buck can we expect?, ACL 2001 Workshop on Data-Driven
Machine Translation, Toulouse, France.
Goyal, S., Gupta, D. and Chatterjee, N.: 2004, A study of Hindi translation patterns for English sentences with have as the main verb, Proceedings of International Symposium on MT, NLP and Translation Support Systems: iSTRANS2004, CDEC and IIT Kanpur, Tata McGraw-Hill, New Delhi, pp. 4651.
Grishman, R. and Kosaka, M.: 1992, Combining rationalist and empiricist approaches to Machine Translation, Proceedings of the Fourth International Confer313

BIBLIOGRAPHY
ence on Theoretical and Methodlogical Issues in Machine Translation of Natural
Languages, Montreal, Canada, pp. 263274.
Gupta, D. and Chatterjee, N.: 2002, Study of similarity and its measurement for
English to Hindi EBMT, Proceedings of STRANS-2002, IIT Kanpur.
Gupta, D. and Chatterjee, N.: 2003a, Divergence in English to Hindi Translation:
Some studies, International Journal of Translation 15, 524.
Gupta, D. and Chatterjee, N.: 2003b, Identification of divergence for English to
Hindi EBMT, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 141148.
Gupta, D. and Chatterjee, N.: 2003c, A morpho-syntax based adaptation and retrieval scheme for English to Hindi EBMT, Proceedings of Workshop on Computational Linguistic for the Languages of South Asia: Expanding Synergies with
Europe, Budapest, Hungary, pp. 2330.
G
uvenir, H. A. and Cicekli, I.: 1998, Learning translation templates from examples,
Information System 23, 353363.
Habash, N.: 2003, Generation-Heavy Hybrid Machine Translation, PhD thesis,
University of Maryland, College Park.
Habash, N. and Dorr, B. J.: 2002, Handling translation divergences: Combining statistical and symbolic techniques in generation-heavy Machine Translation, Proceedings of the Fifth Conference of the Association for Machine Translation in the
Americas, AMTA-2002, Tiburon, CA.
Han, C.-h., Benoit, L., Martha, P., Owen, R., Kittredge, R., Korelsky, T., Kim,
N. and Kim, M.: 2000, Handling structural divergences and recovering dropped

314

BIBLIOGRAPHY
arguments in a Korean/English machine translation system, Proceedings of the
Fourth Conference of the Association for Machine Translation in the Americas,
AMTA-2000, Cuernavaca, Mexico.
Hutchins, J.: 2003, The Oxford Handbook of

Computational Linguistics, Oxford

University Press, chapter Machine translation: general overview, pp. 501511.


Jain, R.:

1995, HEBMT: A Hybrid Example-Based Approach for Machine

Translation (Design and Implementation for Hindi to English), PhD thesis, I.I.T.
Kanpur.
Kachru, Y.: 1980, Aspects of Hindi Grammar, Manohar Publications, New Delhi.
Kellogg, R. S. and Bailey, T. G.: 1965, A Grammar of the Hindi Language, Routledge and Kegan Paul Ltd., London.
Kit, C., Pan, H. and Webster, J.: 2002, Example-Based Machine Translation: A
New Paradigm, Translation and Information Technology, Chinese U of HK Press,
pp. 5778.
Leffa, V. J.: 1998, Clause processing in complex sentences, Proceedings of the First
International Conference on Language Resources and Evaluation, Vol. 1, pp. 937
943.
Loomis, M. E. S.: 1997, Data Management and File Structures, second edn, Prentice
Hall of India Private Limited, New Delhi-110001.
Manning, C. and Schutze, H.: 1999, Foundations of Statistical Natural Language
Processing, The MIT Press, MA.

315

BIBLIOGRAPHY
McEnery, A. M., Oakes, M. P. and Garside, R.: 1994, The use of approximate
string matching techniques in the alignment of sentences in parallel corpora, in
A. Vella (ed.), The Proceedings of Machine Translation: 10 Years On, University
of Cranfield.
McTait, K.: 2001, Translation Pattern Extraction and Recombination for ExampleBased Machine Translation, PhD thesis, Centre for Computational Linguistics
Department of Language Engineering, UMIST.
Nagao, M.: 1984, Artificial and Human Intelligence, North-Holland, chapter A
Framework of a Mechanical Translation Between Japanese and English by Analogy Principle, pp. 173180.
Nie en, S., Och, F. J., Leusch, G. and Ney, H.: 2000, An evaluation tool for machine translation: Fast evaluation for machine translation research., Proceedings
of the Second Int. Conf. on Language Resources and Evaluation (LREC), Athens,
Greece, pp. 3945.
Nirenburg, S.: 1993, Example-Based Machine Translation, Proceedings of the Bar
Ilan Symposium on Foundations of Artificial Intelligence, Bar Ilan University,
Israel.
Nirenburg, S., Grannes, D. and Domashnev, K.: 1993, Two approaches of matching in Example-Based Machine Translation, Proceedings of TMIMT-93, Kyoto,
Japan.
Oard, D. W.: 2003, The surprise language exercises, ACM Transactions on Asian
Language Processing 2(2), 7984.

316

BIBLIOGRAPHY
Orasan, C.: 2000, A hybrid method for clause splitting in unrestricted English texts,
Proceedings of ACIDCA 2000, Monastir, Tunisia.
Papineni, K. A., Roukos, S., Ward, T. and Zhu, W.-J.: 2001, Bleu: a method
for automatic evaluation of machine translation, Technical Report Technical Report RC22176 (W0109-022), IBM Research Division, Thomas J. Watson Research
Center, Yorktown Heights, NY.
Piperidis, S., Boutsis, S. and Papageorgiou, H.: 2000, From sentences to words and
clauses, Parallel text processing, Kluwer Academic Publishers, Dordrecht.
Puscasu, G.: 2004, A multilingual method for clause splitting, Proceedings of CLUK
2004, Birmingham, UK, pp. 199 206.
Quirk, R. and Greenbaum, S.: 1976, A University Grammar of English, English
Language Book Socitey, Longman.
Rao, D.: 2001, Human aided Machine Translation from English to Hindi: The
MaTra project at NCST, Proceedings Symposium on Translation Support Systems, STRANS-2001, I.I.T. Kanpur.
Rao, D., Mohanraj, K., Hegde, J., Mehta, V. and Mahadane, P.: 2000, A practical framework for syntactic transfer of compound-complex sentences for EnglishHindi Machine Translation, Proceedings of the Conf. on Knowledge based computer systems, National Centre for Software Technology, Mumbai, pp. 343354.
Resnik, P. and Yarowsky, D.: 2000, Distinguishing systems and distinguishing senses:
New evaluation methods for word sense disambiguation, Natural Language Engineering 5(2), 113133.

317

BIBLIOGRAPHY
Sang, E. F. T. K. and Dejean, H.: 2001, Introduction to the CoNLL-2001 shared
task: Clause identification, Proceedings of CoNLL-2001, Toulouse, France, pp. 53
57.
Sangal, R.: 2004, Shakti: IIIT-Hyderabad machine translation system (experimental), http://shakti.iiit.net/ shakti/.
Sastri, S. and Apte, B.: 1968, Hindi Grammar, Dakshina Bharat Hindi Prachar
Sabha, Madras, India.
Sato, S.:

1992, CTM: An Example-Based Translation aid system, Proc. of

COLING-1992, pp. 12591263.


Shimohata, M., Sumita, E. and Matsumoto, Y.:

2003, Retrieving meaning-

equivalent sentences for Example-Based Rough Translation, HLT-NAACL 2003


Workshop: Building and Using Parallel Texts Date Driven Machine Translation
and Beyond, Edmonton, pp. 5056.
Shiri, S., Bond, F. and Takhashi, Y.: 1997, A Hybrid Rule and Example-Based
Method for Machine Translation, Proceedings of the 4th Natural Language Processing Pacific Rim Symposium: NLPRS-97, Phuket, Thailand, pp. 4954.
Singh, S. B.: 2003, English- Hindi Translation Grammar, first edn, Prabhat
Prakashan, 4/19 Asaf Ali Road, New Delhi-110002.
Sinha, R. and Jain, A.:

2003, AnglaHindi:

An English to Hindi Machine

Translation system, Proceedings of the MT SUMMIT IX, Orleans, LA, pp. 2327.
Sinha, R. M. K., Jain, R. and Jain, A.: 2002, An English to Hindi machine aided
translation system based on ANGLABHARTI technology ANGLA HINDI,
I.I.T. Kanpur, http://anglahindi.iitk.ac.in/translation.htm.
318

BIBLIOGRAPHY
Somers, H.: 1997, Machine Translation and minority languages, Translating and the
Computer 19: Papers from the Aslib conference, London.
Somers, H.: 1998, Further experiments in bilingual text alignment, International
Journal of Corpus Linguistics 3, 115150.
Somers, H.: 1999, Review article: Example-Based Machine Translation, Machine
Translation 14, 113158.
Somers, H.: 2001, EBMT seen as case-based reasoning, MT Summit VIII Workshop
on Example-Based Machine Translation, Santiago de Compostela, Spain, pp. 56
65.
Stetina, J., Kurohashi, S. and Nagao, M.: 1998, General word sense disambiguation method based on a full sentential context, Proceedings of COLING-ACL
Workshop, Usage of WordNet [http://www.cogsci.princeton.edu/cgi-bin/webwn]
in Natural Language Processing, Montreal, Canada.
Sumita, E.: 2001, Example-Based Machine Translation using DP-matching between word sequences, Proc. of the ACL 2001 Workshop on Data-Driven Methods
in Machine Translation, pp. 18.
Sumita, E. and Iida, H.: 1991, Experiments and prospects of Example-Based
Machine Translation, Proceedings of the 29th Annual Meeting of the Association
for Computational Linguistics, Berkeley, California, USA, pp. 85192.
Sumita, E., Iida, H. and Kohyama, H.: 1990, Translating with examples: A new
approach to Machine Translation, TMI-1990, pp. 203212.
Sumita, E. and Tsutsumi, Y.: 1988, A translation aid system using flexible text
retrieval based on syntax matching, Proceedings of TMI-88, CMU, Pittsburgh.
319

BIBLIOGRAPHY
Takezawa, T.: 1999, Transformation into meaning-ful chunks by dividing or connecting utterance units, Journal of Natural Language Processing 6(2).
Thurmair, G.: 1990, Complex lexical transfer in METAL, Proceedings of the Third
International Conference on Theoretical and Methodological Issues in Machine
Translation of Natural languages, Linguistics research center, The University of
Texas, Austin, TX, pp. 91107.
Tillmann, C., Vogel, S., Ney, H., Zubiaga, A. and Sawaf, H.: 1997, Accelerated dp
based search for statistical translation, In European Conf. on Speech Communication and Technology, Rhodes, Greece, p. 26672670.
Uchida, H. and Zhu, M.: 1998, The Universal Networking Language (UNL) specifications version 3.0. 1998, Technical report, United Nations University, Tokyo,
http://www.unl.unu.edu/unlsys/unl/unls30.doc.
Veale, T. and Way, A.: 1997, Gaijin: A template-driven bootstrapping approach to
Example-Based Machine Translation, International Conference, Recent Advances
in Natural Language Processing, Tzigov Chark, Bulgaria,, pp. 239244.
Vikas, O.: 2001, Technology development for indian languages, Proceedings of Symposium on Translation Support Systems STRANS-2001, IIT Kanpur.
Watanabe, H.: 1992, A similarity-driven transfer system, Proceeding of the 14th
COLING, pp. 770776.
Watanabe, H., Kurohashi, S. and Aramaki, E.: 2000, Finding structural correspondences from bilingual parsed corpus for Corpus-Based Translation, Proceedings
of COLING-2000, Saarbrucken, Germany.

320

BIBLIOGRAPHY
Weiderhold, G.: 1987, File organization for database design, McGraw-Hill Inc. New
York, USA.
Wren, P., Martin, H. and Rao., N.: 1989, High School English Grammar, S. Chand
& Co. Ltd., New Delhi.
Wu, D.: 1995, Large-scale automatic extraction of an English-Chinese translation
lexicon, Machine Translation 9(3-4), 285313.

321

About the Author


Ms. Deepa Gupta was born on July 5th , 1977. She obtained a Bachelors Degree
in Mathematics (honors) from L.B. College, University of Delhi, in 1997 with an overall score of 73.00%. She completed her Post Graduation in Mathematics in 1999 from
Indian Institute of Technology Delhi with an C.G.P.A 7.70. She joined the Ph.D.
programme of Department of Mathematics at IIT Delhi in July, 1999 as a Junior Research Fellow. Thereafter in July, 2001 she was awarded as a Senior Research Fellow.
During her research tenure she participated and presented various research articles
and published seven research papers in different national/international journals and
conferences. She can be contacted at gupta deepa@rediffmail.com or deepag iitd @yahoo.com.

List of Publications
Published Paper(s) in Journal

1. Gupta D. and Chatterjee N. 2003. Divergence in English to Hindi Translation: Some Studies International Journal of Translation, Vol. 15, pp.5-24.

Published Papers in Conference


1. Gupta D. and Chatterjee N. 2001. Study of Divergence for Example Based
English-Hindi Machine Translation, In the Proc. of STRANS 2001, IIT Kanpur. pp.132-139.

2. Gupta D. and Chatterjee N. 2002. Study of Similarity and its Measurement


for English to Hindi EBMT, In the Proc. of STRANS-2002, IIT Kanpur.
3. Gupta D. and Chatterjee N. 2002. A Systematic Adaptation Scheme for
English-Hindi Example-Based Machine Translation, In the Proc. of STRANS2002, IIT Kanpur.
4. Gupta D. and Chatterjee N. 2003. A Morpho-Syntax based Adaptation
and Retrieval Scheme for English to Hindi EBMT, In the Proc. of Workshop
on Computational Linguistic for the Languages of South Asia: Expanding
Synergies with Europe, Budapest, Hungary, pp 23-30.
5. Gupta D. and Chatterjee N. 2003. Identification of Divergence for English
to Hindi EBMT In Proc. MT SUMMIT IX, New Orleans, LA, pp. 141-148.
6. Goyal S., Gupta D. and Chatterjee N. 2004. A Study of Hindi Translation
Patterns for English Sentences with Have as the Main Verb, In the Proc.
of International Symposium on MT, NLP and Translation Support Systems:
iSTRANS-2004, New Delhi, pp 46-51.
Communicated Paper(s)
For International Journal Machine Translation Kluwer Academic Publishers:
1. FT and SPAC Based Algorithm for Identification of Divergence from Parallel
aligned corpus for English to Hindi EBMT (Co-author Dr. Niladri Chatterjee)
2. Will sentences have Divergence upon Translation? : A Corpus-Evidence based
Solution for Example Based Approach (Co-authors Dr. Niladri Chatterjee and
Shailly Goyal)

Honours and Awards


Tata Infotech Research Fellowship of Rs. 12,000, during July, 2002 to April,
2004.
Vice-president of Mathematics Society, Indian Institute of Technology Delhi,
during April, 2001 to April, 2003.
Secured 85.04 percentile in Graduate Aptitude Test For Engineers (GATE).
GATE is a National Level Entrance Test for Higher Education in the field of
Technology and Basic Sciences, conducted by Indias premier research institutes IISc, Bangalore, and IITs.
College topper in B.A. (Hons), IInd year, 1995-1996.