Translation of Unknown Words in Low Resource Languages
Biman Gujral, Huda Khayrallah, and Philipp Koehn of Johns Hopkins University presented a talk at AMTA 2016 on the challenges of translating unknown words in low-resource languages. The study focuses on generating candidates for out-of-vocabulary words in languages like Hindi, English, and Uzbek, using data from different sources and addressing various types of unknown words. They delve into methods such as utilizing a phrase-based MT system to tackle this issue effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf bib: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.bib
Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016
Out of Vocabulary Words (OOVs) Hindi English: It , graphic , Polo T , , , and bright embroidered jackets etc are included. Uzbek English: Quvayt o yinga how ko ryapmiz with the preparation. Gujral, Khayrallah, Koehn 3
Goals Generate candidates for each OOV Select the best one Gujral, Khayrallah, Koehn 4
How big is this problem? Gujral, Khayrallah, Koehn 5
Data Hindi English WMT14 News Uzbek English LORELEI News, Wikipedia, social media Training 55k sentences Training 274k sentences Test Test 2.5k sentences ~1 OOV/sentence ~5% OOVs 1k sentences ~4 OOVs/sentence ~20% OOVs Gujral, Khayrallah, Koehn 6
OOV Examples Names I Huda Misspellings grammer/grammar Inflections play/plays/playing Borrowed words Halloween Reinflected Borrowings skirts Googlear to Google Content words speculation Gujral, Khayrallah, Koehn 7
Distribution of OOVs Named Entities Borrowed Words 7% Source Content Words 36% Misspellings & Typos 22% Acronyms 29% Reinflected Borrowings Numbers & Punctuation Gujral, Khayrallah, Koehn 8
MT System Moses (Koehn et al. 2007) Phrase Based Large English language model WMT English 07- 12 Gujral, Khayrallah, Koehn 9
Methods Gujral, Khayrallah, Koehn 10
Methods Transliteration Levenshtein distance Word Embeddings Gujral, Khayrallah, Koehn 11
Transliteration v Huda Halloween Unsupervised Moses mode (Durrani et al. 2014) Character translation model Incorporate larger English language model Uzbek is already written in Latin script, keep original spelling Generate 1 candidate Gujral, Khayrallah, Koehn 12
Levenshtein distance grammer/grammar play/plays Minimum number of: insertions deletions substitutions Gujral, Khayrallah, Koehn 13
Levenshtein distance qilyapmiz qilyapsiz Find source words with distance 2 from OOV Use their English translation as translation candidate Generate 18 candidates on average doing Levenshtein distance = 1 doing Gujral, Khayrallah, Koehn 14
Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 15
Word Embeddings Word2vec (Mikolov et al. 2013) monolingual corpora Multilingual word vectors (Faruqui & Dyer 2014) monolingual vectors alignments Canonical Correlation Analysis (CCA) Generates 20 candidates Gujral, Khayrallah, Koehn 16
Word Embeddings English Projection Matrix English Vectors Word2Vec English Text CCA Hindi Projection Matrix Hindi Vectors Hindi Text Word2Vec Gujral, Khayrallah, Koehn 17
Word Embeddings English Projection Matrix Projected English Vectors * English Vectors English Candidates Candidates Hindi KNN Hindi Projection Matrix Projected Hindi Vectors * Hindi Vectors Gujral, Khayrallah, Koehn 18
Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 19
Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 20
Word Embeddings 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) rumors rumor crore doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 21
Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) 1) rumor 2) crore 3) 4) 5) 6) 7) 8) 9) 10) Gujral, Khayrallah, Koehn 22
Integration Gujral, Khayrallah, Koehn 23
Integration Transliteration English Translation Select Levenshtein Hindi OOVs English Candidates Embeddings Gujral, Khayrallah, Koehn 24
Integration Language Model Phrase table Gujral, Khayrallah, Koehn 25
Language Model Large English language model XML markup in Moses (Koehn & Haddow, 2009) Selection occurs during decoding Gujral, Khayrallah, Koehn 26
Phrase Table Secondary Phrase Table only includes OOVs Features: Method Word Vector Distance Levenshtein distance Inverse frequency in Monolingual corpus Gujral, Khayrallah, Koehn 27
Results Gujral, Khayrallah, Koehn 28
Oracle Upper bound on how well a selection method can do given current generation methods Select word from list of candidates that is in the reference Gujral, Khayrallah, Koehn 29
BLEU - Uzbek 10.5 10 9.5 9 8.5 8 Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 30
BLEU - Hindi 14 13.5 13 12.5 12 11.5 11 Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 31
Beyond BLEU Goals: generate candidates for each OOV How well can we generate translation candidates? select the best one How well can we select from the translation candidates? Gujral, Khayrallah, Koehn 32
Coverage How well can we generate translation candidates? Was one of the candidates generated by this method in the reference? Gujral, Khayrallah, Koehn 33
Coverage - Uzbek 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration (copy) All Gujral, Khayrallah, Koehn 34
Coverage - Hindi 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration All Gujral, Khayrallah, Koehn 35
Accuracy How well can we select from the translation candidates? Is the word we selected in the reference? Gujral, Khayrallah, Koehn 36
Accuracy - Uzbek 60% 50% 40% 30% 20% 10% 0% Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 37
Accuracy - Hindi 60% 50% 40% 30% 20% 10% 0% Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 38
Conclusion & Future Work Generate Quality translations Selection does not perform as well Improved selection methods More sophisticated embedding projection Analysis of what methods work on which types of OOVs Gujral, Khayrallah, Koehn 39
Acknowledgement This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Gujral, Khayrallah, Koehn 40
Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn {bgujral1, huda, phi}@jhu.edu Johns Hopkins University
References Durrani, Haddow, Koehn, and Heafield. (2014). Edinburgh s Phrase- Based Machine Translation Systems for WMT-14. Workshop on Statistical Machine Translation Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin, and Herbst. (2007). Moses: Open source toolkit for Statistical Machine Translation. ACL Interactive Poster and Demonstration Sessions Koehn and Haddow. Edinburgh s submission to all tracks of the WMT2009 Shared Task with Reordering and Speed Improvements to Moses. Workshop on Statistical Machine Translation Faruqui and Dyer. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. In Proceedings of EACL. Gujral, Khayrallah, Koehn 42