Exploring Idioms and Conversational Routines in Dutch Language

Slide Note
Embed
Share

Delve into the treatment of idioms and conversational routines in Woordcombinaties as discussed by Carole Tiberius and Lut Colman. The project focuses on challenges, proposed solutions, and extensions related to lemmatisation, variations, and data models. Explore a reference work for understanding words in context and discover various word combinations in Dutch. Gain inspiration from related language learning and lexical computing resources.


Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The treatment of idioms and conversational routines in Woordcombinaties Carole Tiberius and Lut Colman carole.tiberius@ivdnt.org lut.colman@ivdnt.org Keel ja keelekasutajad / Language: The User in Focus 27.-28. aprill 2023 / 27-28 April 2023 Tallinn 1

  2. Introduction to the project Woordcombinaties Treatment of idioms and conversational routines in Woordcombinaties Current Envisaged extensions Challenges faced: Lemmatisation of idioms and conversational routines Variations and Extensions Proposed solutions: Changes to the data model Enable subentries in the DWS 2

  3. https://woordcombinaties.ivdnt.org Reference work for using and understanding words in context Resource for teaching and testing materials Advanced learners of Dutch; native speakers Verbs (1), nouns (2) 3

  4. Combination: any meaningful and statistically relevant combination of words with spaces, excluding compounds and combinations not formed in Dutch: o Collocation: offer + support, advice, service, opportunity o Multiword lexical unit: yellow spot o Idiom: hand something to someone on a silver platter o Proverb: all good things come to an end o Formula: have a nice day! o (Valency) pattern: something takes place vs someone takes his place someone 4

  5. Not included o Compounds without spaces appeltaart( apple pie ) kruidje-roer-me-niet( touch-me-not , plant/touchy person ) o Loan combinations ad hoc ad fundum out of the blue 5

  6. Inspiration SkELL: Sketch Engine for Language Learning (Lexical Computing, Brno) PDEV: Pattern Dictionary of English Verbs (Patrick Hanks, Wolverhampton) E-VALBU: Das elektronische Valenzw rterbuch deutscher Verben (IDS, Mannheim) CVVD: Contrastive Verb Valency Dictionary (Contragram, Gent) SweCcn: Swedish Constructicon van Spr kbanken (Gothenburg) StringNet Navigator 4.0 (David Wible, Taiwan) Combinatiewoordenboek de Kleijn (Piet de Kleijn) Etc. 6

  7. TNE & CPA (Patrick Hanks) TNE: Theory of Norms and Exploitations Words have meaning potentials rather than meanings Meanings are evoked by context Norms are normal phraseological patterns as main carriers of meaning Exploitations are creative uses of normal patterns CPA: Corpus Pattern Analysis Corpus-driven technique for mapping meaning onto words in context Sketch Engine supports annotation of usage patterns in corpus samples Pattern: form-meaning pair Slots in patterns populated by semantic types from an ontology: [[Human]], [[Animal]], [[Furniture]], Semantic types are categories of lexical sets (collocates): [[Furniture]]: table, chair, bed, 7

  8. Tools & methodology Corpus Approx. 200 million tokens Mainly newspaper and web Material from the Netherlands and Belgium parsed with Alpino parser Patterns o CPA Corpus Pattern Analysis (P. Hanks) Corpus-driven approach > corpus-based in Woordcombinaties o Ontology label argument positions and sets of collocates with a semantic type (Hanks, E. Je ek) o SKEMA-editor Sketch Engine Manual Annotation (V. Baisa) Examples & Combinations o GDEX Good Dictionary Examples o TBL TickBox Lexicography o Word sketches o SwingLex Dictionary Writing System (in-house) 8

  9. 9

  10. o Quickscan Patterns Collocates 10

  11. Linked to INT Spelling Database 11

  12. (whistle) (arbiter) (bullet) (public)(ref) (referee) (wind) (bird) 12

  13. 13

  14. 14

  15. Idioms and conversational routines: current status Included at the microstructural level Patterns Combinations Only idioms, proverbs and formulas idiomatic 15

  16. Idioms and conversational routines: envisaged Treatment at macrostructural level Specific search options search for idioms based on image categories (e.g. body parts, food) and sense categories (e.g. have a property) een vinger in de pap hebben ( have a finger in the pie ) Search for conversational routines based on speech act (e.g. greeting) goedemorgen ( good morning ) + two tabs 16

  17. Conversational routines Pragmatic meaning pragmatic search Access through predefined lists of speech acts I want to ask information give information apologise express emotion ... anger joy doubt surprise Hoe laat vertrekt/gaat de trein naar x? (What time does the train for x depart?) 17

  18. Challenge Lemmatisation of MWEs What is the canonical form? Which lemma form do we use? Cf. task in UniDive COST action on harmonizing lemmatisation rules (for words and MWEs) and lexical features across languages 18

  19. Why is this challenging? Variation E.g. een schat van een baby, kind, man, vrouw ( a gem of a baby, child, man, woman ) iemand naar zijn hand zetten ( force someone to one s will ) iemand of iets naar zijn hand zetten ( force someone or something to one s will ) iets naar je hand zetten ( force something to your will ) Even more difficult for constructional idioms E.g. reflzich + RESULT + V zich naar/rot/ziek lachen ( split one s sides laughing ) 19

  20. Canonical forms of MWEs in computational approaches o DUCAME: Dutch Canonicalised Multiword Expressions dd:[die] vlieger zal 0niet opgaan ( that s (simply) not on ) o PARSEME: 20

  21. Canonical forms of MWEs in the lexicographic literature There are no ready-made solutions in lexicography for representing the different types of variation of idioms. [ ] Idioms must be presented in their full form and in their usual constructions, i.e. the syntactic valency of the idiom must be shown (e.g. look at/see sth through the rose-tinted glass ). However, it is also important not to include too much context, as the idiom should not appear to be more restricted contextually than it actually is. Note that adding this information to the lemma form of MWEs is not in line with lemmatisation practices for words, where syntactic valency is not normally part of the lemma form.(Svens n 2009:199) Harras and Proost s (2002:289) Citation Form Maxim: Idioms should basically be entered in their basic or canonical form. This means that the citation form should contain only general pronouns like someone or somebody and something. VP-idioms should basically be entered in the infinitive form of the head verb. Where deviations from the canonical citation form are required, these should be in accordance with the following submaxims: (1) The citation form must indicate as many restrictions as possible. [ ] (2) Morphological restrictions should also be indicated by the citation form. [ ] (3) The citation form should not be too restrictive. [ ] 21

  22. Canonical forms of MWEs in the lexicographic literature Vrbinc and Vrbinc (2016) emphasise that variation in MWEs should be included in a way that is least ambiguous and most user-friendly so that users are made aware of the possible alternatives. 22

  23. Lemmatisation of MWEs in Woordcombinaties Goal: to find a balance between complexity and applicability but maintain readability Complexity Die Hand darauf/dadrauf/aus das/ein Verprechen geben ( to give one s hand on it/sth./on a promise (quoted from Ermakova et al. 2022:854)). Applicability DUCAME: dd:[die] vlieger zal 0niet opgaan. PARSEME: 23

  24. Lemmatisation of MWEs in Woordcombinaties In Woordcombinaties a human-friendly lemma form will be complemented with a pattern form, which is compatible with more NLP oriented work. Preliminary guidelines I: MWEs are entered in their canonical form, e.g. infinitive form for verbal MWEs. Variable, but obligatory arguments and variable parts of arguments and complements are indicated by means of dummies (e.g. iemand de ogen openen open someone s eyes ) or other generic forms such as zijn (e.g. zijn gezicht laten zien show one's face ) and zich (e.g. zich op de vlakte houden not commit oneself ). 24

  25. Lemmatisation of MWEs in Woordcombinaties Preliminary guidelines II: A fixed order of components is followed as much as possible: e.g. place and direction complements in verbal MWEs will usually occur before the verb and fixed prepositions after it (e.g. in de bres springen voor iemand of iets throw oneself into the breach for someone or something ). Articles are not included in the lemma form of noun MWEs (e.g. blinde vink some meat ) except for those cases where the article is part of the construction een schat van een kind ( a gem of a child ) Canonical forms, variants and lexical realisations of constructional MWEs will be lemmatised separately and linked. 25

  26. Lemmatisation of MWEs in Woordcombinaties Preliminary guidelines III: Negation of MWEs: lemmatised separately and linked, e.g. geen kaas gegeten hebben van iets vs. kaas gegeten hebben van iets ( not have a clue about something vs. have a clue about something ) Extensions of MWEs still subject to further research (cf. e.g. Ermakova et al. 2022) de gordiaanse knoop ( the Gordian knot ) de gordiaanse knoop doorhakken ( cut the Gordian knot ) de gordiaanse knoop ontwarren ( disentangle the Gordian knot ) 26

  27. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. main entry add, open, delete subentry 27

  28. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. add subentry 28

  29. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. 29

  30. Subentries and link to GiGaNT-Molex Lemma forms are linked to the Molex (spelling database). The individual components of the MWE are also linked to their respective lemmas in Molex. houden=> own ID 105952 een vinger aan de pols 61767 87148 ( have/keep a finger on the pulse ) 30

  31. Data model idioms and conversational routines 31

  32. Preliminary conclusions Record occurrences of idioms and conversational routines in a systematic way in Woordcombinaties Full treatment of idioms and conversational routines separate module More and more diverse corpus data needed Woordcombinaties: a unique point of access for anyone who wants to learn more about Dutch phraseology. 32

  33. On behalf of the Woordcombinaties team: Lut Colman, Jan Niestadt, Carole Tiberius 33

Related