Translation of Unknown Words in Low Resource Languages

Translation of Unknown Words in Low Resource Languages
Slide Note
Embed
Share

Biman Gujral, Huda Khayrallah, and Philipp Koehn of Johns Hopkins University presented a talk at AMTA 2016 on the challenges of translating unknown words in low-resource languages. The study focuses on generating candidates for out-of-vocabulary words in languages like Hindi, English, and Uzbek, using data from different sources and addressing various types of unknown words. They delve into methods such as utilizing a phrase-based MT system to tackle this issue effectively.

  • Translation
  • Unknown Words
  • Low Resource Languages
  • MT System
  • Language Processing

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016 This talk was presented at AMTA 2016 It is based on this paper: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.pdf bib: http://www.cs.jhu.edu/~huda/papers/gujral2016AMTA.bib

  2. Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn Johns Hopkins University 31 October 2016

  3. Out of Vocabulary Words (OOVs) Hindi English: It , graphic , Polo T , , , and bright embroidered jackets etc are included. Uzbek English: Quvayt o yinga how ko ryapmiz with the preparation. Gujral, Khayrallah, Koehn 3

  4. Goals Generate candidates for each OOV Select the best one Gujral, Khayrallah, Koehn 4

  5. How big is this problem? Gujral, Khayrallah, Koehn 5

  6. Data Hindi English WMT14 News Uzbek English LORELEI News, Wikipedia, social media Training 55k sentences Training 274k sentences Test Test 2.5k sentences ~1 OOV/sentence ~5% OOVs 1k sentences ~4 OOVs/sentence ~20% OOVs Gujral, Khayrallah, Koehn 6

  7. OOV Examples Names I Huda Misspellings grammer/grammar Inflections play/plays/playing Borrowed words Halloween Reinflected Borrowings skirts Googlear to Google Content words speculation Gujral, Khayrallah, Koehn 7

  8. Distribution of OOVs Named Entities Borrowed Words 7% Source Content Words 36% Misspellings & Typos 22% Acronyms 29% Reinflected Borrowings Numbers & Punctuation Gujral, Khayrallah, Koehn 8

  9. MT System Moses (Koehn et al. 2007) Phrase Based Large English language model WMT English 07- 12 Gujral, Khayrallah, Koehn 9

  10. Methods Gujral, Khayrallah, Koehn 10

  11. Methods Transliteration Levenshtein distance Word Embeddings Gujral, Khayrallah, Koehn 11

  12. Transliteration v Huda Halloween Unsupervised Moses mode (Durrani et al. 2014) Character translation model Incorporate larger English language model Uzbek is already written in Latin script, keep original spelling Generate 1 candidate Gujral, Khayrallah, Koehn 12

  13. Levenshtein distance grammer/grammar play/plays Minimum number of: insertions deletions substitutions Gujral, Khayrallah, Koehn 13

  14. Levenshtein distance qilyapmiz qilyapsiz Find source words with distance 2 from OOV Use their English translation as translation candidate Generate 18 candidates on average doing Levenshtein distance = 1 doing Gujral, Khayrallah, Koehn 14

  15. Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 15

  16. Word Embeddings Word2vec (Mikolov et al. 2013) monolingual corpora Multilingual word vectors (Faruqui & Dyer 2014) monolingual vectors alignments Canonical Correlation Analysis (CCA) Generates 20 candidates Gujral, Khayrallah, Koehn 16

  17. Word Embeddings English Projection Matrix English Vectors Word2Vec English Text CCA Hindi Projection Matrix Hindi Vectors Hindi Text Word2Vec Gujral, Khayrallah, Koehn 17

  18. Word Embeddings English Projection Matrix Projected English Vectors * English Vectors English Candidates Candidates Hindi KNN Hindi Projection Matrix Projected Hindi Vectors * Hindi Vectors Gujral, Khayrallah, Koehn 18

  19. Word Embeddings rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 19

  20. Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) rumors doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 20

  21. Word Embeddings 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) rumors rumor crore doubts rumours suspicions misgivings worry speculation worried Gujral, Khayrallah, Koehn 21

  22. Word Embeddings 1) doubts 2) rumours 3) suspicions 4) misgivings 5) worry 6) worried 7) speculation 8) 9) 10) 1) rumor 2) crore 3) 4) 5) 6) 7) 8) 9) 10) Gujral, Khayrallah, Koehn 22

  23. Integration Gujral, Khayrallah, Koehn 23

  24. Integration Transliteration English Translation Select Levenshtein Hindi OOVs English Candidates Embeddings Gujral, Khayrallah, Koehn 24

  25. Integration Language Model Phrase table Gujral, Khayrallah, Koehn 25

  26. Language Model Large English language model XML markup in Moses (Koehn & Haddow, 2009) Selection occurs during decoding Gujral, Khayrallah, Koehn 26

  27. Phrase Table Secondary Phrase Table only includes OOVs Features: Method Word Vector Distance Levenshtein distance Inverse frequency in Monolingual corpus Gujral, Khayrallah, Koehn 27

  28. Results Gujral, Khayrallah, Koehn 28

  29. Oracle Upper bound on how well a selection method can do given current generation methods Select word from list of candidates that is in the reference Gujral, Khayrallah, Koehn 29

  30. BLEU - Uzbek 10.5 10 9.5 9 8.5 8 Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 30

  31. BLEU - Hindi 14 13.5 13 12.5 12 11.5 11 Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 31

  32. Beyond BLEU Goals: generate candidates for each OOV How well can we generate translation candidates? select the best one How well can we select from the translation candidates? Gujral, Khayrallah, Koehn 32

  33. Coverage How well can we generate translation candidates? Was one of the candidates generated by this method in the reference? Gujral, Khayrallah, Koehn 33

  34. Coverage - Uzbek 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration (copy) All Gujral, Khayrallah, Koehn 34

  35. Coverage - Hindi 60% 50% 40% 30% 20% 10% 0% Levenshtein Embeddings Transliteration All Gujral, Khayrallah, Koehn 35

  36. Accuracy How well can we select from the translation candidates? Is the word we selected in the reference? Gujral, Khayrallah, Koehn 36

  37. Accuracy - Uzbek 60% 50% 40% 30% 20% 10% 0% Baseline Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 37

  38. Accuracy - Hindi 60% 50% 40% 30% 20% 10% 0% Baseline Transliteration Language Model Phrase Table Oracle Gujral, Khayrallah, Koehn 38

  39. Conclusion & Future Work Generate Quality translations Selection does not perform as well Improved selection methods More sophisticated embedding projection Analysis of what methods work on which types of OOVs Gujral, Khayrallah, Koehn 39

  40. Acknowledgement This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Gujral, Khayrallah, Koehn 40

  41. Translation of Unknown Words in Low Resource Languages Biman Gujral, Huda Khayrallah, and Philipp Koehn {bgujral1, huda, phi}@jhu.edu Johns Hopkins University

  42. References Durrani, Haddow, Koehn, and Heafield. (2014). Edinburgh s Phrase- Based Machine Translation Systems for WMT-14. Workshop on Statistical Machine Translation Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin, and Herbst. (2007). Moses: Open source toolkit for Statistical Machine Translation. ACL Interactive Poster and Demonstration Sessions Koehn and Haddow. Edinburgh s submission to all tracks of the WMT2009 Shared Task with Reordering and Speed Improvements to Moses. Workshop on Statistical Machine Translation Faruqui and Dyer. (2014). Improving Vector Space Word Representations Using Multilingual Correlation. In Proceedings of EACL. Gujral, Khayrallah, Koehn 42

More Related Content