Impact of Learner Errors on Automatic Part-of-Speech Tagging with CLAWS

Slide Note
Embed
Share

This study investigates the impact of learner errors on the performance of the CLAWS tagging system when tagging Estonian learner English. Learner language, characterized as dynamic and variable, plays a crucial role in the accuracy of part-of-speech tagging. Factors influencing errors in learner language tagging include unknown words, differing part-of-speech distributions, and characteristic sequences. The CLAWS7 system, known for its high accuracy in tagging native English texts, faces challenges when handling learner errors in the Estonian context.


Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. AUTOMATIC PART- OF-SPEECH TAGGING WITH CLAWS 7: IMPACT OF LEARNER ERRORS Liina Tammek nd (Lecturer of English and Linguistics Reeli Torn-Leesik (Lecturer of English and Linguistics) 27.04.2023 1

  2. Aim and research question The aim automatic part-of-speech tagging. aim of this study is to investigate the impact of learner errors on W Which types of learner errors have a marked impact on the performance of hich types of learner errors have a marked impact on the performance of the CLAWS tagging system when tagging Estonian learner English? the CLAWS tagging system when tagging Estonian learner English? 2

  3. Learner language Learner language Learner language (interlanguage) (Selinker 1972, Selinker and Rutherford 1992, Corder 1981) is a language system that learners build on the linguistic input from the language they are learning. Characteristics Characteristics: not a steady-state product but dynamic in nature; exhibits variation. 3

  4. Learner corpora in Estonia The Estonian Interlanguage Corpus of Tallinn University (EIC) The Estonian learner language corpus of the University of Tartu The Tartu Learner Corpus of Spanish as a L3+ 4

  5. POS POS- -tagging and CLAWS7 tagging and CLAWS7 A POS POS- -tagger tagger assigns each word a tag that identifies the part-of-speech category that the word belongs to and collects other grammatical category information regarding it without input from the user. CLAWS7 CLAWS7 uses statistical calculations and has an accuracy of 95 98% in tagging native English texts (Garside 1996, UCREL Team 1996). 5

  6. POS POS- -tagging learner language tagging learner language Factors causing errors in POS-tagging learner language (Nagata et al. 2018) many unknown words caused either by spelling or grammar errors; different POS distributions (concentrate as a verb vs concentrate as a noun); characteristic POS-sequences (Aarts and Granger 1998). Learner language tagging accuracy is 95% Schafer 2002, 2003 as cited in van Rooy 2015). 95% (De Haan 2000, van Rooy and 6

  7. CLAWS7 and tagging Estonian learner English CLAWS7 and tagging Estonian learner English Tammek nd and Torn-Leesik (2022) tested the suitability of the CLAWS7 automatic POS-tagging system for tagging the Estonian learner English corpus (TCELE). The error rate was 4.01% Problems: assigned incorrect tags to determiners, adverbs, general adverbs, and singular common nouns; successfully assigned general noun and verb tags but experienced problems when attempting to analyse words at a more granular level; had difficulties in differentiating between nouns and adverbs, as well as between conjunctions and adverbs; has no separate tag for this/that in the (relative) pronominal function. 4.01% 7

  8. Earlier research (1) Earlier research (1) Errors in texts produced by language learners are divided into two broad categories: spelling errors: typing (keyboard mistakes), spacing, and capitalisation errors language errors: the learner s morphological, syntactic and lexical errors (de Haan 2002, van Rooy and Sch fer 2002, Mizumoto and Nagata 2017, Nagata et al. 2018) 8

  9. Earlier research (2) Earlier research (2) Tagging unknown and known words may lead to errors (Mizumoto and Nagata 2017, Nagata et al. 2018). Differences in research designs may lead to different conclusions: Mizumoto and Nagata (2017): spelling errors pose a major difficulty to automatic POS-tagging Van Rooy and Sch fer (2002): language errors have a higher impact on the tagger s performance 9

  10. Material and method (1) Material and method (1) Tartu Corpus of Estonian Learner English essays written as part of the University of Tartu s English Language and Literature BA programme entrance exam 75,818 words essays of up to 250 300 words based on a short journalistic text CEFR B2 10

  11. Material and method (2) Material and method (2) Error as deviation from the norms of the target language (Ellis 1994) The steps of analysis: 1. A TCELE sample of 24,812 words (92 essays) was POS-tagged using CLAWS7 tag set. 2. Learner errors in the sample were manually identified by the authors of the paper. 3. Based on previous research (de Haan 2002, van Rooy, Sch fer 2002, Mizumoto, Nagata 2017, Nagata et al. 2018), errors were classified into two main groups, and an error taxonomy was created. 4. POS-tagged and error-tagged samples were collated and compared to map correlations between learner errors and tagger errors. 5. Learner error taxons correlated with a notable increase in the tagger s error rate were identified. 6. Hypotheses were formulated concerning the possible impact of learner errors on tagger errors. 11

  12. Learner error taxonomy (1): spelling errors Subcategory Subcategory Examples Examples Omission of a hyphen Omission of a hyphen Whether it is an artistic work of fiction or a real life (correct: a real-life experience ). When studying literature through through- -out throughout )? Litertaure Litertaure has been around for hundreds of years (correct: literature ). It also played a huge role it it their entertainment (correct: in ) / the negativity towards it stems from not getting to do it out of freewill freewill, (correct: free will ). becoming book worms book worms (correct: bookworms ). ... but i i believe that the negativity towards_it (correct: I ) And id id say the general consecion on the role of literature has stayed the same.(correct: I d ) real life experience, Extra hyphen Extra hyphen out the years, (correct: Nonword Nonword Real word Real word Space merging Space merging Extra space Extra space Capitalisation Capitalisation Compound spelling error Compound spelling error 12

  13. Learner error taxonomy (2): language errors Subcategory Subcategory Verb s grammatical Verb s grammatical category category Verb pattern errors Verb pattern errors Examples Examples In the past literature has been regarded has been regarded (correct: was regarded ) /.../ I definitely see the creative community be (correct: being ). Which in todays currency todays currency is about 40$, (correct: in today s currency ) the basic knowledges knowledges among us (correct: knowledge ) be more active, Genitive construction Genitive construction errors errors Noun phrase errors Noun phrase errors Quantifier errors Quantifier errors Article errors Article errors Pronoun errors Pronoun errors There are less and less less and less libraries (correct: fewer and fewer ) I am of an an opinion(correct: of the opinion ) Not too long ago, there was a time where read.(correct: when ) publishing a book has never been more easier. where most people couldn't even Adjective and adverb Adjective and adverb errors errors more easier.(correct: easier ). 13

  14. Learner error taxonomy (3): language errors Subcategory Subcategory Preposition errors Preposition errors Examples Examples to build a stronger foundation to to the world we live in now (correct: for ) that is a huge reason for the change in attitude towards literature, technology(correct: literature and technology ). Though literature has moved on from being physical to being more online, people tend to not have tend to not have as much interest in it, as it had a hundred years ago (correct: tend not to have ). Conjunction errors Conjunction errors Sentence structure Sentence structure errors errors Derivation errors Derivation errors Lexical choice errors Lexical choice errors was perceived as merely can be seen in the numbers of people who have a literary (correct: literature degree ) 100 years ago the of literature ago the of literature was considered a universal language (correct: ago literature ) merely entertainment. (correct: mere) literary degree. Miscellaneous errors Miscellaneous errors 14

  15. Results (1) Results (1) Categories and number of learner errors in the TCELE sample: Categories Categories Language errors Language errors Spelling errors Spelling errors TOTAL TOTAL No of errors No of errors 560 118 678 The percentage of language errors that defied the tagger s analysis and were attributed an incorrect POS tag was relatively low (38 of 560 errors, or 2.8%). As to spelling errors, every fifth such error (28 of 118, or 22%) resulted in a word that was wrongly tagged. 15

  16. Results (2) Tagging errors were deemed to have been caused by learner errors when the tagger assigned the wrong POS-tag to the learner s erroneous form. If the learner s form is a real English word (although incorrect in the context) and the tagger tagged it as such, this is not considered a tagging error, as in (1) and (2). (1) I_PPIS1 think_VV0 that_DD1 literature_NN1 is_VBZ very_RG important_JJ to_II a_AT1 persons_NN2 persons_NN2 life_NN1 ,_, because_CS literature_NN1 nurtures_NN2 and_CC helps_VVZ our_APPGE creativity_NN1 flow_VVI . (2) Nowadays_RT the_AT study_NN1 of_IO literature_NN1 has_VHZ once_RR21 again_RR22 reclaimed_VVN it_PPH1 ' it_PPH1 's_VBZ s_VBZ rightful_JJ place_NN1 in_II both_DB2 academia_NN1 and_CC with_IW the_AT general_JJ public_NN1 ._. 16

  17. Results (3) Within the category of language errors, the most frequent subcategories: sentence structure: 112 the use of articles: 105 prepositions: 80 verb categories of tense, mood, number and voice: 75 Although the numbers of errors in these subcategories are relatively high, the resulting forms predominantly still received the correct POS-tag. (3) This_DD1 has_VHZ lead_VVN of_IO literature_NN1 nowadays_RT. lead_VVN to_II the_AT downfall_NN1 of_IO the_AT quality_NN1 17

  18. Results (4) In the category of spelling errors, nonwords (42) and omitted hyphens (31) were the most frequent correlates of tagger errors, but the tagger returned a markedly higher number of contextually incorrect tags in the case of the missing hyphen. (4) Whether_CSW it_PPH1 is_VBZ an_AT1 artistic_JJ work_NN1 of_IO fiction_NN1 or_CC a_AT1 real_JJ real_JJ life_NN1 life_NN1 experience_NN1. In the case of nonwords, only 5 out of 42 instances led to a tagging error (12%). (5) Litertaure_NP1 Litertaure_NP1 has_VHZ been_VBN around_RP for_IF hundreds_NNO2 of_IO years_NNT2 ._. 18

  19. Results (5) Although the number of space merger errors is small (6), every third one (33.3%) correlates with a tagging error. (6) Especially_RR books_VVZ that_CST are_VBR atleast_NN1 years_NNT2 old_JJ atleast_NN1 one_MC1 hundred_NNO There were 11 capitalisation errors, 5 of which affected the tagger s decision recognition of the resulting form. (7) of_IO hatred_NN1 for_IF having_VHG to_TO study_VVI the_AT artform_NN1 ,_, but_CCB i_MC1 i_MC1 believe_VV0 that_CST the_AT negativity_NN1 towards_II it_PPH1 19

  20. Conclusion The data allowed learner errors to be classified into two major groups language errors and spelling errors. The total number of learner errors in the sample was 678. 560 language errors and 118 spelling errors. Only 16 (2.8%) of the 560 language errors were misanalysed by the tagger. The tagger was misled by 26 (22%) of the 118 spelling errors. Correcting learner spelling errors would increase the tagging accuracy of the CLAWS7 automatic POS-tagging system. 20

  21. References Aarts, J., Granger, S. 1998. Tag sequences in learner corpora: A key to interlanguage grammar and discourse. Learner English on Computer. S. Granger (ed.). London: Routledge, 132 141. https://doi.org/10.4324/9781315841342-10 Corder, P. 1981. Error Analysis and Interlanguage. Oxford: Oxford University Press. https://doi.org/10.3138/cmlr.40.4.649 de Haan, P. 2000. Tagging non-native English with the TOSCA ICLE tagger. Corpus Linguistics and Linguistic Theory. Language and Computers 33. Ch. Mair, M. Hundt (eds.). Amsterdam: Rodopi. 69 79. https://doi.org/10.1163/9789004490758_007 Ellis, R. 1994. The Study of Second Language Acquisition. Oxford: Oxford University Press. Garside, R. 1996. The robust tagging of unrestricted text: the BNC experience. Using corpora for language research: Studies in the Honour of Geoffrey Leech. J. Thomas and M. Short (eds.), 167 180. London: Longman. Mizumoto, T., Nagata, R. 2017. Analyzing the Impact of Spelling Errors on POS-Tagging and Chunking in Learner English. Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications, 54 58. https://aclanthology.org/W17-5909 (31.01.2023) Nagata, R., Mizumoto, T., Kikuchi, Y., Kawasaki, Y., Funakoshi, K. 2018. A POS tagging model designed for learner English. Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text. W. Xu, A. Ritter, T. Baldwin, A. Rahimi (eds.). Brussels: Association for Computational Linguistics, 39 48. https://doi.org/10.18653/v1/W18-6106 Selinker, L. 1972. Interlanguage. International Review of Applied Linguistics 10 (3), 209 231. https://doi.org/10.1515/iral.1972.10.1-4.209 Selinker, L., Rutherford, W. E. 1992. Rediscovering Interlanguage. Routledge: London. https://doi.org/10.4324/9781315845685 Tammek nd, Liina; Reeli Torn-Leesik. 2022. POS-tagging Tartu Corpus of Estonian Learner English with CLAWS7. Estonian Papers in Applied Linguistics 18, 263 278. https://doi:10.5128/ERYa18.15 UCREL Team 1996. A Post-Editor s Guide to Claws7 Tagging. http://www.natcorp.ox.ac.uk/docs/claws7.html (15.01.2023). van Rooy, B., Sch fer, L. 2002. The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies 20 (4), 325 335. https://doi.org/10.2989/16073610209486319. van Rooy, Bertus 2015. Annotating learner corpora. The Cambridge Handbook of Learner Corpus Research. S. Granger, G. Gilquin, F. Meunier (eds.). Cambridge: Cambridge University Press, 79 106. https://doi.org/10.1017/CBO9781139649414.005. 21

  22. Thank you for listening! 22

Related