Insights into Dutch-to-Afrikaans Conversion Techniques

Slide Note
Embed
Share

Explore the development of a Dutch-to-Afrikaans convertor, focusing on the closely-related nature of the two languages and the potential for recycling existing technologies to aid in resource-scarce language processing. Various methods like bootstrapping and algorithm optimization are discussed, along with the complex concept of closely-relatedness among languages. The study highlights a hypothesis proposing that recycling software applicable to a well-resourced language, like Dutch, may expedite the development process for a resource-scarce language, such as Afrikaans.


Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Some Thoughts on a Dutch-to-Afrikaans Convertor Gerhard B van Huyssteen Human Language Technologies Research Group Meraka Institute, CSIR, Pretoria gvhuyssteen@csir.co.za Sul ne Pilon Centre for Text Technology (CTexT) North-West University, Potchefstroom sulene.pilon@nwu.ac.za

  2. Overview Context/-cept Closely-relatedness Afrikaans and Dutch Orthographic level Morphosyntactic level Lexical level Dutch-to-Afrikaans convertor (D2AC) Evaluation Future work Antwerp; 2010/02/02

  3. Some context Broad question in development of HLTs: How can we fast-track the development of resources for resource-scarce languages? Various methods: Bootstrapping Algorithm optimisation Active Learning Etc. Our approach: Recycling of existing technologies Antwerp; 2010/02/02

  4. Basic hypothesis If languages L1 and L2 are similar enough, then it should be easier and quicker to recycle software applicable to L1 than to rewrite it from scratch for L2[Rayner et al., 1997] L1 = well-resourced language (e.g. Dutch) L2 = resource-scarce language (e.g. Afrikaans) Antwerp; 2010/02/02

  5. Basic hypothesis If languages L1 and L2 are similar enough, then it should be easier and quicker to recycle software applicable to L1 than to rewrite it from scratch for L2[Rayner et al., 1997] L1 = well-resourced language (e.g. Dutch) L2 = resource-scarce language (e.g. Afrikaans) Antwerp; 2010/02/02

  6. A word on closely-relatedness Difficult to define closely-relatedness No definitive, objective measures yet Distinguish between:[Haji et al., 2000] Language variants Considered one language Closely-related languages Similarity: morphology/lexis Flemish Dutch Afrikaans German Swedish Very close languages Similarity: morphology/syntax/lexis Related languages Shared origin; no similarities necessarily Antwerp; 2010/02/02

  7. Basic hypothesis If languages L1 and L2 are similar enough, then it should be easier and quicker to recycle software applicable to L1 than to rewrite it from scratch for L2 [Rayner et al., 1997] L1 = well-resourced language (e.g. Dutch) L2 = resource-scarce language (e.g. Afrikaans) Antwerp; 2010/02/02

  8. Example Die kind is mooi Die<PRON> kind<N> is<V> mooi<ADJ> Die<DET> kind<N> is<V> mooi<ADJ> Antwerp; 2010/02/02

  9. Problems Potential low accuracy on real-world data Could still be tedious/expensive Antwerp; 2010/02/02

  10. Solution Potential higher accuracy Potential low accuracy on real-world data Convert text Could still be tedious/expensive Automated: quick and cheap Antwerp; 2010/02/02

  11. Example Hy speel met die vis Hij speelt met de vis Hij<PRON> speelt<V> met<P> de<DET> vis<N> Hy<PRON> speel<V> met<P> die<DET> vis<N> Antwerp; 2010/02/02

  12. Focus Hy speel met die vis Hij speelt met de vis Antwerp; 2010/02/02

  13. This research Hy speel met die vis Hij speelt met de vis Aim: Proof-of-concept for conversion Not on recycling yet Focus on grapheme, morphosyntactic and lexical levels only Focus on concept, rather than on linguistics Would be more difficult to convert Afrikaans to Dutch But: discover linguistic issues for A2DC Antwerp; 2010/02/02

  14. Dutch and Afrikaans as very close/closely related languages See paper/hand-out for some differences and similarities between Dutch and Afrikaans on: Graphological (orthographic and morphosyntactic) level Lexical level Note some important concepts: Identical cognates Non-identical cognates False friends Non-cognates Antwerp; 2010/02/02

  15. ...for non-linguists Identical cognates Graphologically identical word-forms, due to linguistic inheritance man; kind; vis; mooi; kelder; olie; straat Non-identical cognates Etymologically related, but differ graphologically in a systematic way speel vs speelt/spelen/speeld/speelde/speelden False friends Same word-form, different meanings, due to semantic broadening/narrowing or referent changes Hij heeft de trein amper gehaald Hy het die trein amper gehaal Non-cognates Same meaning, graphologically unrelated morsjors vs knoeipot; seekoei vs nijlpaard; maalvleis vs gehakt Antwerp; 2010/02/02

  16. D2AC: Overview Implemented as series of Perl scripts Input (List.D2AC.Ndl.txt): Tokenised text No spelling errors, proper names, acronyms, abbreviations Output (List.D2AC.Afr.txt): Tokens with tags <<D2ALex>>: converted as false friend/non-cognate (sometimes also non-identical cognate) <<AfrLex>>: converted as identical cognate <<Translated>>: translated by grapheme or morpheme rules <<Untranslated>>: If input text could not be converted Antwerp; 2010/02/02

  17. D2AC: Lexicons D2ALex.txt Bilingual list Covering false friends and non-cognates (sometimes also non-identical cognates) hoes duvetoortreksel mop grap slijterij drankwinkel pad pad//padda sliep het geslaap Compiled manually (internet/language learning books) 2,738 Dutch entries Antwerp; 2010/02/02

  18. D2AC: Lexicons AfrLex.txt Monolongual list Covering identical cognates man kind vis met speel Full lexicon of Afrikaanse Speltoetser 3.0 385,599 entries Antwerp; 2010/02/02

  19. D2AC: Conversion rules MorphRules.txt Used by MorphModule.pm Handles systematic differences on morphosyntactic level Only morphs and allomorphs included Easy format for non-programmers Simple regular expressions in format: lijk$ lik atie$ asie Currently 90rules Antwerp; 2010/02/02

  20. D2AC: Conversion rules G2GRules.txt Used by G2GModule.pm Converts Dutch graphemes to Afrikaans graphemes Only rules that apply on sub-morphemic level (e.g. clusters of vowels and consonants) Simple regular expressions in format: auw ou sch sk Currently 55rules Antwerp; 2010/02/02

  21. Evaluation Two evaluation experiments Word-level evaluation Sentence-level evaluation Some tweaking done since previous version: Clean-up and refinement of G2GRulesand MorphRules; Revision of rule-ordering; Adding of additional rules; Enlargement of D2ALex; and Slightly larger AfrLex became available. Also more comprehensive evaluation Antwerp; 2010/02/02

  22. Experiment 1 Aim Intended use: orthographic conversion Evaluate translation accuracy on word-level 500 words randomly extracted 5,000 most frequently occurring words in CGN Removed proper names, etc. Topped-up from CGN frequency list Resulting 500 words translated manually Antwerp; 2010/02/02

  23. Experiment 1: Results # tags assigned 60 138 174 128 500 # correct tags 60 138 161 0 359 <D2ALex> <AfrLex> <Translated> <Untranslated> TOTAL Accuracy: 71,8% Precision (conversion rules): 92,5% Examples of incorrect translations: Du. blijk seems -> Afr. blik tin/glance Du. trokken pulled -> Afr. trokke trucks Antwerp; 2010/02/02

  24. Experiment 1: Error Analysis Cause # of % of <Untranslated> <Untranslated> Flemish word Not in MorphRules Rule-ordering Not in D2ALex Ambiguity of -en TOTAL 1 8 0.78% 6.25% 21.09% 32.81% 39.06% 100% 27 42 50 128 Antwerp; 2010/02/02

  25. Experiment 1: Error Analysis Cause # of % of <Untranslated> <Untranslated> Flemish word Not in MorphRules Rule-ordering Not in D2ALex Ambiguity of -en TOTAL 1 8 0.78% 6.25% 21.09% 32.81% 39.06% 100% 27 42 50 128 Rule-ordering: Du. dochters should have been translated as Afr. dogters daughters by the same rule that translates Du. achtien to Afr. agtien eighteen . However, dochters is left untranslated, while achtien is translated correctly. Antwerp; 2010/02/02

  26. Experiment 1: Error Analysis Cause # of % of <Untranslated> <Untranslated> Flemish word Not in MorphRules Rule-ordering Not in D2ALex Ambiguity of -en TOTAL 1 8 0.78% 6.25% 21.09% 32.81% 39.06% 100% 27 42 50 128 Not in D2ALex: Conjugation of strong verbs (e.g. Du. viel vs. Afr. geval fell ) Other inflections (e.g. plurals, participles, degrees of comparison, etc.) Need to be expanded paradigmatically Antwerp; 2010/02/02

  27. Experiment 1: Error Analysis Cause # of % of <Untranslated> <Untranslated> Flemish word Not in MorphRules Rule-ordering Not in D2ALex Ambiguity of -en TOTAL 1 8 0.78% 6.25% 21.09% 32.81% 39.06% 100% 27 42 50 128 Ambiguity of -en: Retain -en: Du. rijkswapen vs. Afr. rykswapen, state insignia ); Delete -n: Du. vluchtelingen vs. Afr. vlugtelinge refugees Delete -en: Du. verschijnen vs. Afr. verskyn appear Replace -en with -s: Du. kinderen vs. Afr. kinders children Antwerp; 2010/02/02

  28. Experiment 1: Conclusion 71,8% accurate on word-level translation Promising results Need to be improved D2ALex.txt: refine and expand Conversion modules: refine and expand Ambiguity of -en will be problematic for A2DC Antwerp; 2010/02/02

  29. Experiment 2 Aim: D2AC not intended as Dutch-Afrikaans MT system Very rudimentary experiment on sentence level Get an impression of validity of D2AC (given Dutch- Afrikaans Google Translate) 200 Dutch sentences (1,967 words) taken from Metis II evaluation data Translated to Afrikaans by 4 human translators Used as reference translations to calculate BLEU scores Antwerp; 2010/02/02

  30. Experiment 2: Results D2AC 58.88 29.36 16.76 9.59 0.22 GT 71.96 46.13 32.88 23.05 0.39 % of 1-gram matches % of 2-gram matches % of 3-gram matches % of 4-gram matches BLEU Unsurprisingly, GT outperforms D2AC Low 3- and 4-gram matches for D2AC and GT: importance of syntactic conversion Antwerp; 2010/02/02

  31. Experiment 2: Error analysis D2AC and GT Syntactic conversion: word order, double negation, etc. Large number of words left untranslated Not all translation alternatives are available Compounds translated ineffectively GT: Du. vervoersituaties -> Afr. vervoer situasies Antwerp; 2010/02/02

  32. Future work Refine and extend bilingual translation list (D2ALex.txt) Refine and re-order rules Additional rules Iteration (keeping greediness in mind) Automatic rule discovery, using Default&Refine Automated disambiguation Development of A2DC Incorporate GT from start Experiment with technology recycling Antwerp; 2010/02/02

  33. Acknowledgements National Research Foundation (FA207041600015). Kirsten Arnauts, Veronique de Gres, and Shanna Pettens Martin Puttkammer, Carla-Mari van den Heever, and Daan Wissing Antwerp; 2010/02/02

Related


More Related Content