Translation of Unknown Words in Low Resource Languages

Slide Note

Biman Gujral, Huda Khayrallah, and Philipp Koehn of Johns Hopkins University presented a talk at AMTA 2016 on the challenges of translating unknown words in low-resource languages. The study focuses on generating candidates for out-of-vocabulary words in languages like Hindi, English, and Uzbek, using data from different sources and addressing various types of unknown words. They delve into methods such as utilizing a phrase-based MT system to tackle this issue effectively.

robist Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf bib: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.bib

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016

Out of Vocabulary Words (OOVs) Hindi English: It , graphic , Polo T , , , and bright embroidered jackets etc are included. Uzbek English: Quvayt o yinga how ko ryapmiz with the preparation. Gujral, Khayrallah, Koehn 3

Goals Generate candidates for each OOV Select the best one Gujral, Khayrallah, Koehn 4

How big is this problem? Gujral, Khayrallah, Koehn 5

Data Hindi English WMT14 News Uzbek English LORELEI News, Wikipedia, social media Training 55k sentences Training 274k sentences Test Test 2.5k sentences ~1 OOV/sentence ~5% OOVs 1k sentences ~4 OOVs/sentence ~20% OOVs Gujral, Khayrallah, Koehn 6

OOV Examples Names I Huda Misspellings grammer/grammar Inflections play/plays/playing Borrowed words Halloween Reinflected Borrowings skirts Googlear to Google Content words speculation Gujral, Khayrallah, Koehn 7

Distribution of OOVs Named Entities Borrowed Words 7% Source Content Words 36% Misspellings & Typos 22% Acronyms 29% Reinflected Borrowings Numbers & Punctuation Gujral, Khayrallah, Koehn 8

MT System Moses (Koehn et al. 2007) Phrase Based Large English language model WMT English 07- 12 Gujral, Khayrallah, Koehn 9

Methods Gujral, Khayrallah, Koehn 10

Methods Transliteration Levenshtein distance Word Embeddings Gujral, Khayrallah, Koehn 11

Transliteration v Huda Halloween Unsupervised Moses mode (Durrani et al. 2014) Character translation model Incorporate larger English language model Uzbek is already written in Latin script, keep original spelling Generate 1 candidate Gujral, Khayrallah, Koehn 12

Levenshtein distance grammer/grammar play/plays Minimum number of: insertions deletions substitutions Gujral, Khayrallah, Koehn 13

Levenshtein distance qilyapmiz qilyapsiz Find source words with distance 2 from OOV Use their English translation as translation candidate Generate 18 candidates on average doing Levenshtein distance = 1 doing Gujral, Khayrallah, Koehn 14

Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 15

Word Embeddings Word2vec (Mikolov et al. 2013) monolingual corpora Multilingual word vectors (Faruqui & Dyer 2014) monolingual vectors alignments Canonical Correlation Analysis (CCA) Generates 20 candidates Gujral, Khayrallah, Koehn 16

Word Embeddings English Projection Matrix English Vectors Word2Vec English Text CCA Hindi Projection Matrix Hindi Vectors Hindi Text Word2Vec Gujral, Khayrallah, Koehn 17

Word Embeddings English Projection Matrix Projected English Vectors * English Vectors English Candidates Candidates Hindi KNN Hindi Projection Matrix Projected Hindi Vectors * Hindi Vectors Gujral, Khayrallah, Koehn 18

Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 19

Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 20

Word Embeddings 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) rumors rumor crore doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 21

Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) 1) rumor 2) crore 3) 4) 5) 6) 7) 8) 9) 10) Gujral, Khayrallah, Koehn 22

Integration Gujral, Khayrallah, Koehn 23

Integration Transliteration English Translation Select Levenshtein Hindi OOVs English Candidates Embeddings Gujral, Khayrallah, Koehn 24

Integration Language Model Phrase table Gujral, Khayrallah, Koehn 25

Language Model Large English language model XML markup in Moses (Koehn & Haddow, 2009) Selection occurs during decoding Gujral, Khayrallah, Koehn 26

Phrase Table Secondary Phrase Table only includes OOVs Features: Method Word Vector Distance Levenshtein distance Inverse frequency in Monolingual corpus Gujral, Khayrallah, Koehn 27

Results Gujral, Khayrallah, Koehn 28

Oracle Upper bound on how well a selection method can do given current generation methods Select word from list of candidates that is in the reference Gujral, Khayrallah, Koehn 29

BLEU - Uzbek 10.5 10 9.5 9 8.5 8 Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 30

BLEU - Hindi 14 13.5 13 12.5 12 11.5 11 Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 31

Beyond BLEU Goals: generate candidates for each OOV How well can we generate translation candidates? select the best one How well can we select from the translation candidates? Gujral, Khayrallah, Koehn 32

Coverage How well can we generate translation candidates? Was one of the candidates generated by this method in the reference? Gujral, Khayrallah, Koehn 33

Coverage - Uzbek 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration (copy) All Gujral, Khayrallah, Koehn 34

Coverage - Hindi 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration All Gujral, Khayrallah, Koehn 35

Accuracy How well can we select from the translation candidates? Is the word we selected in the reference? Gujral, Khayrallah, Koehn 36

Accuracy - Uzbek 60% 50% 40% 30% 20% 10% 0% Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 37

Accuracy - Hindi 60% 50% 40% 30% 20% 10% 0% Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 38

Conclusion & Future Work Generate Quality translations Selection does not perform as well Improved selection methods More sophisticated embedding projection Analysis of what methods work on which types of OOVs Gujral, Khayrallah, Koehn 39

Acknowledgement This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Gujral, Khayrallah, Koehn 40

Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn {bgujral1, huda, phi}@jhu.edu Johns Hopkins University

References Durrani, Haddow, Koehn, and Heafield. (2014). Edinburgh s Phrase- Based Machine Translation Systems for WMT-14. Workshop on Statistical Machine Translation Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin, and Herbst. (2007). Moses: Open source toolkit for Statistical Machine Translation. ACL Interactive Poster and Demonstration Sessions Koehn and Haddow. Edinburgh s submission to all tracks of the WMT2009 Shared Task with Reordering and Speed Improvements to Moses. Workshop on Statistical Machine Translation Faruqui and Dyer. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. In Proceedings of EACL. Gujral, Khayrallah, Koehn 42

Translation of Unknown Words in Low Resource Languages

Download Presentation

Presentation Transcript

Related

More Related Content