Академический Документы
Профессиональный Документы
Культура Документы
By
Ankit Aggarwal (03d05009)
Guided by
Dr. Pushpak Bhattacharyya
ROADMAP
What is Transliteration?
Existing Approaches
Theory of Syllables
Phonemes
What are Syllables?
Syllable Structure
Syllabification
Maximal Onset Principle
Sonority Hierarchy
Constraints
Implementation
Conclusion & Future Work
References
WHAT IS TRANSLITERATION?
Practice of transcribing a word or text written in one writing system
into another writing system.
E.g., ‘school’ will be transliterated to ‘skUla’.
Different from translation:
‘school’ will be translated to ‘pazSaalaa’ (‘paathashaala’)
Why Transliteration?
Information present in select number of languages.
Effective knowledge transfer across linguistics require bringing down
language barriers.
Plays an important role in cross-lingual applications.
PROBLEM STATEMENT
Given a word (either Hindi or English) written in English language
script, we have to:
Rule-based
Hand crafted rules are used upon the input source language to generate
words of the target language.
Statistical
Statistics play a more important role in determining target word generation.
RULE-BASED APPROACHES
STATISTICAL APPROACHES
OUR APPROACH: THEORY OF
SYLLABLES
A framework of our approach:
1. A large parallel corpora of names in both English and Hindi languages is taken.
2. To prepare the training data, names are syllabified
(automatically/manually/both).
3. Next, we store the probability with which any Hindi syllable string is mapped
to any English syllable string.
4. Now, given any new word (test data) written in English language, we use the
automatic syllabification of Step 2 to syllabify it.
5. Then, we use Viterbi Algorithm to find out three most probable transliterated
words with their corresponding probabilities.
6. If the probability difference between first and second is too large, we output
only the first transliterated word, else all three.
English Phonology
No. of speech sounds in English varies from dialect to dialect.
Longman Dictionary: 24 consonant phonemes (c.p.), 23 vowel phonemes
(v.p.), additionally 2 c.p. & 4 v.p. for foreign words.
American Heritage Dictionary: 25 c.p., 18 v.p., additionally 1 c.p. & 5 v.p.
for foreign words.
CONSONANT PHONEMES
25 consonant phonemes found in most dialects of English.
Categorized under six different categories (on the basis of their
sonority level, stress, way of pronunciation etc.):
Plosive: Produced by stopping the airflow in the vocal tract (the cavity
where sound is filtered).
S ≡ Syllable, O ≡ Onset
R ≡ Rhyme, N ≡ Nucleus
Co ≡ Coda
SYLLABLE STRUCTURE: EXAMPLES
‘word’
‘sprint’
SYLLABLE STRUCTURE: EXAMPLES
‘may’
‘opt’
‘air’
No Coda.
No Onset.
No Coda, No Onset.
SYLLABLE STRUCTURE
Light Syllable: A syllable which is open and ends in a short vowel.
General Description – CV.
Example, ‘air’.
Working: ‘constructs’
Consonant sequence: n-s-t-r
Either ‘con structs’ OR ‘cons tructs’ OR ‘const ructs’ OR ‘constr ucts’.
As, ‘str’ can serve as the onset of a syllable, that’s why the correct
syllabification will be ‘con structs’.
SONORITY HIERARCHY
Sonority: A perceptual property referring to the loudness of a sound
relative to that of other sounds with the same length.
Syllabic:
Both the onset and the coda are optional (as seen previously).
/j/ at the end of an onset (/pj/, /bj/, /tj/, /dj/, /kj/, /fj/, /vj/, /θj/, /sj/, /zj/,
/hj/, /mj/, /nj/, /lj/, /spj/, /stj/, /skj/) must be followed by /uɪ/ or /ʊə/.
Long vowels and diphthongs are not followed by /ŋ/.
/ʊ/ is rare in syllable-initial position.
Stop + /w/ before /uɪ, ʊ, ʌ, aʊ/ are excluded.
IMPLEMENTATION
CONCLUSION & FUTURE WORK
Conclusion
We took a look at the English to Hindi transliteration problem.
Explored various techniques used for transliteration between other language
pairs.
Took a look at the approach of syllabification.
Noticed the results with an accuracy of 99%.
Future Work
For the complete goal, following are the things that need to be worked
upon:
1. We need to syllabify the parallel names in Devanagari script as well.
2. System will have to be trained to generate probability of occurrences of
English and Hindi syllable pairs.
3. This trained system will be used to transliterate any new word provided.
REFERENCES
1. Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-
arabic cross language information retrieval. In Conference on Information and
Knowledge Management, pages 139–146, 2003.
2. Ann K. Farmer Andrian Akmajian, Richard M. Demers and Robert M. Harnish.
Linguistics: An Introduction to Language and Communication. MIT Press, 5th
edition, 2001.
3. Association for Computer Linguistics. Collapsed Consonant and Vowel
Models: New Approaches for English-Persian Transliteration and Back-
Transliteration, 2007.
4. Slaven Bilac and Hozumi Tanaka. Direct combination of spelling and
pronunciation information for robust back-transliteration. In Conferences on
Computational Linguistics and Intelligent Text Processing, pages 413–424,
2005.
5. Ian Lane Bing Zhao, Nguyen Bach and Stephan Vogel. A log-linear block
transliteration model based on bi-stream hmms. HLT/NAACL-2007, 2007.
REFERENCES
6. H. L. Jin and K. F. Wong. A Chinese dictionary construction algorithm for
information retrieval. In ACM Transactions on Asian Language Information
Processing, pages 281–296, December 2002.
7. K. Knight and J. Graehl. Machine transliteration. In Computational Linguistics,
pages 24(4):599–612, Dec. 1998.
8. Lee-Feng Chien Long Jiang, Ming Zhou and Chen Niu. Named entity
translation with web mining and transliteration. In International Joint
Conference on Artificial Intelligence (IJCAL-07), pages 1629–1634, 2007.
9. Dan Mateescu. English Phonetics and Phonological Theory. 2003.
10. Della Pietra P. Brown and R. Mercer. The mathematics of statistical machine
translation: Parameter estimation. In Computational Linguistics, page
19(2):263U˝ 311, 1990.
11. Ganapathiraju Madhavi Prahallad Lavanya and Prahallad Kishore. A simple
approach for building transliteration editors for Indian languages. Zhejiang
University SCIENCE-2005, 2005.