Idioms and Conversational Routines in Dutch Language

undefined
 
Carole Tiberius and Lut Colman
 
 
Keel ja keelekasutajad / Language: The User in Focus
27.-28. aprill 2023 / 27-28 April 2023
Tallinn
 
1
 
The treatment of idioms and
conversational routines in
Woordcombinaties
 
Introduction to the project 
Woordcombinaties
Treatment of idioms and conversational routines in
Woordcombinaties
Current
Envisaged extensions
Challenges faced:
Lemmatisation of idioms and conversational routines
Variations and Extensions
Proposed solutions:
Changes to the data model
Enable subentries in the DWS
 
 
2
 
 
https://woordcombinaties.ivdnt.org
Reference work for using and understanding words in context
Resource for teaching and testing materials
Advanced learners of Dutch; native speakers
Verbs (1), nouns (2)
 
3
 
 
Combination:
 any meaningful and statistically relevant combination of words
with spaces, excluding compounds and combinations not formed in Dutch:
 
o
Collocation: 
offer
 + support, advice, service, opportunity
o
Multiword lexical unit: yellow 
spot
o
Idiom: 
hand
 something to someone on a silver platter
o
Proverb: all good things 
come
 to an end
o
Formula: 
have
 a nice day!
o
(Valency) pattern: something takes place vs someone takes 
his
 place
someone
 
 
4
 
Not included
 
o
Compounds without spaces
appeltaart
 (‘apple pie’)
kruidje-roer-me-niet
 (‘touch-me-not’, ‘plant/touchy person’)
o
Loan combinations
ad hoc
ad fundum
out of the blue
 
5
 
Inspiration
 
SkE
LL
:
 
Sketch Engine for Language Learning (Lexical Computing, Brno)
PDEV:
 Pattern Dictionary of English Verbs (Patrick Hanks, Wolverhampton)
E-VALBU:
 
Das elektronische Valenzwörterbuch deutscher Verben (IDS,
Mannheim)
CVVD:
 Contrastive Verb Valency Dictionary (Contragram, Gent)
SweCcn:
 Swedish Constructicon van Språkbanken (Gothenburg)
StringNet Navigator 4.0
 
(David Wible, Taiwan)
Combinatiewoordenboek
 
de Kleijn (Piet de Kleijn)
Etc.
 
6
 
TNE & CPA (Patrick Hanks)
 
TNE: Theory of Norms and Exploitations
Words have meaning potentials rather than meanings
Meanings are evoked by context
Norms are ‘normal’ phraseological patterns as main carriers of meaning
Exploitations are creative uses of normal patterns
CPA: Corpus Pattern Analysis
Corpus-driven technique for mapping meaning onto words in context
Sketch Engine supports annotation of usage patterns in corpus samples
Pattern: form-meaning pair
Slots in patterns populated by semantic types from an ontology:
 [[Human]], [[Animal]], [[Furniture]],…
Semantic types are categories of lexical sets (collocates):
[[Furniture]]: table, chair, bed, …
 
 
7
 
Tools & methodology
 
Examples & Combinations
 
o
GDEX
G
ood 
D
ictionary 
Ex
amples
o
TBL
T
ick
B
ox 
L
exicography
o
Word sketches
o
SwingLex Dictionary Writing System (in-house)
 
Patterns
 
o
CPA
C
orpus 
P
attern 
A
nalysis (P. Hanks)
Corpus-driven approach > corpus-based in
Woordcombinaties
o
Ontology
label argument positions and sets of collocates
with a semantic type (Hanks, E. Ježek)
o
SKEMA-editor
Sk
etch 
E
ngine 
M
anual 
A
nnotation (V. Baisa)
 
 
 
8
 
Corpus
Approx. 200 million tokens
Mainly newspaper and web
Material from the Netherlands and Belgium
parsed with Alpino parser
 
9
 
o
Quickscan
Patterns
Collocates
 
10
 
11
 
Linked to INT Spelling Database
 
12
 
(whistle)
 
(
arbiter
)
 
(bullet)
 
(public)
 
(
ref
)
 
(referee)
 
(bird)
 
(
wind
)
 
13
 
14
 
Idioms and conversational routines: current status
 
Included at the microstructural level
 
15
 
Patterns
 
Combinations
 
idiomatic
 
Only idioms, proverbs and formulas
 
Idioms and conversational routines: envisaged
 
Treatment at macrostructural level
Specific search options
search for idioms based on image categories (e.g. body parts, food)
and sense categories (e.g. have a property)
een vinger in de pap hebben 
(‘have a 
finger
 in the 
pie
’)
Search for conversational routines based on speech act (e.g. greeting)
goedemorgen 
(‘good morning’)
 
16
 
+ two tabs
 
Conversational routines
Pragmatic meaning → pragmatic search
 
Access through predefined lists of speech acts
 
I want to
ask information
give information
apologise
express emotion
       
...
 
 
 
 
Hoe laat vertrekt/gaat de trein naar x? (What time does the train for x depart?)
 
17
 
anger
joy
doubt
surprise
 
Challenge
 
Lemmatisation of MWEs
   What is the canonical form?
   Which lemma form do we use?
 
 
Cf. task in 
UniDive COST
 action on 
harmonizing lemmatisation
rules (for words and MWEs) and lexical features across languages
 
18
 
Why is this challenging?
 
Variation
E.g. 
een schat van een baby, kind, man, vrouw
       
(‘a gem of a baby, child, man, woman’)
 
 
iemand
 naar 
zijn
 hand zetten 
(‘force 
someone
 to 
one’s
 will’)
 
iemand
 of 
iets
 naar 
zijn
 hand zetten 
(‘force 
someone
 or 
something
 to 
one’s
 will’)
 
iets
 naar 
je
 hand zetten 
(‘force 
something
 to 
your
 will’)
 
Even more difficult for constructional idioms
    E.g. refl
zich
 + RESULT + V
    
 
 zich naar/rot/ziek lachen 
(‘split one’s sides laughing’)
 
19
 
Canonical forms of MWEs in computational
approaches
 
o
DUCAME
: Dutch Canonicalised Multiword Expressions
 
dd:[die] vlieger zal 0niet opgaan 
(‘that’s (simply) not on’)
o
PARSEME
:
 
20
 
Canonical forms of MWEs in the lexicographic
literature
 
There are no ready-made solutions in lexicography for representing the different types of variation of
idioms.
 […] 
Idioms must be presented in their full form and in their usual constructions, i.e. the syntactic
valency of the idiom must be shown (e.g. ‘look at/see sth through the rose-tinted glass’). However, it is
also important not to include too much context, as the idiom should not appear to be more restricted
contextually than it actually is.
 
Note that adding this information to the lemma form of MWEs is not in
line with lemmatisation practices for words, where syntactic valency is not normally part of the lemma
form.
(Svens
é
n 2009:199)
 
Harras and Proost’s (2002:289) 
Citation Form Maxim
:
Idioms should basically be entered in their basic or canonical form. This means that the citation form
should contain only general pronouns like 
someone
 or 
somebody
 and 
something
. VP-idioms should
basically be entered in the infinitive form of the head verb. Where deviations from the canonical citation
form are required, these should be in accordance with the following submaxims:
(1) The citation form must indicate as many restrictions as possible. […]
(2) Morphological restrictions should also be indicated by the citation form. […]
(3) The citation form should not be too restrictive. […]
 
21
 
Canonical forms of MWEs in the lexicographic
literature
 
Vrbinc and Vrbinc (2016) emphasise that variation in MWEs should be
included 
in a way that is least ambiguous and most user-friendly so that
users are made aware of the possible alternatives.
 
22
 
Lemmatisation of MWEs in 
Woordcombinaties
 
Goal: 
 
to find a balance between complexity and applicability but maintain
 
readability
 
Complexity
Die Hand darauf/dadrauf/aus das/ein Verprechen … geben
(
to give one’s hand on it/sth./on a promise…
 (quoted from Ermakova et al. 2022:854)).
 
 
Applicability
DUCAME: 
dd:[die] vlieger zal 0niet opgaan
.
PARSEME:
 
23
 
Lemmatisation of MWEs in 
Woordcombinaties
 
In 
Woordcombinaties 
a human-friendly lemma form will be complemented with a
pattern form, which is compatible with more NLP oriented work.
 
Preliminary guidelines I:
MWEs are entered in their canonical form, e.g. infinitive form for verbal MWEs.
 
Variable, but obligatory 
arguments
 and variable parts of arguments and
complements are indicated by means of 
dummies
 (e.g. 
iemand de ogen openen
‘open someone’s eyes’) or other generic forms such as 
zijn
 (e.g.  
zijn gezicht laten
zien
 ‘show one's face’) and 
zich
 (e.g. 
zich op de vlakte houden
 ‘not commit
oneself’).
 
24
 
Lemmatisation of MWEs in 
Woordcombinaties
 
Preliminary guidelines II:
A fixed order of components is followed as much as possible: e.g. place and
direction complements in verbal MWEs will usually occur before the verb and fixed
prepositions after it (e.g. 
in de bres springen voor iemand of iets 
‘throw oneself into
the breach for someone or something’).
 
Articles are not included in the lemma form of noun MWEs (e.g. 
blinde
 
vink 
‘some
meat’) except for those cases where the article is part of the construction 
een schat
van een kind
 (‘a gem of a child’)
 
Canonical forms, variants and lexical realisations of constructional MWEs will be
lemmatised separately and linked.
 
25
 
Lemmatisation of MWEs in 
Woordcombinaties
 
Preliminary guidelines III:
Negation of MWEs: lemmatised separately and linked,
       e.g. 
geen kaas gegeten hebben van iets 
vs. 
kaas gegeten hebben van iets
 
 
(‘not have a clue about something’ vs. ‘have a clue about something’)
 
Extensions of MWEs … still subject to further research (cf. e.g. Ermakova et al.
2022)
 
de gordiaanse knoop 
(‘the Gordian knot’)
 
de gordiaanse knoop 
doorhakken
 
(‘
cut
 the Gordian knot’)
 
de gordiaanse knoop 
ontwarren 
(‘
disentangle
 the Gordian knot’)
 
26
 
Subentries
 
Idioms and conversational routines are edited as subentries and
links can be added to these subentries in the DWS.
 
27
 
add, open, delete subentry
 
main entry
 
Subentries
 
Idioms and conversational routines are edited as subentries and
links can be added to these subentries in the DWS.
 
28
 
add subentry
 
Subentries
 
Idioms and conversational routines are edited as subentries and
links can be added to these subentries in the DWS.
 
29
 
Subentries and link to GiGaNT-Molex
 
Lemma forms are linked to the Molex (spelling database).
The individual components of the MWE are also linked to their
respective lemmas in Molex.
 
een  vinger 
 
aan 
 
de 
 
pols 
  
houden=> 
own ID
 
87148
    
61767
  
105952
(‘
have/keep a finger on the pulse
’)
 
30
 
Data model idioms and conversational routines
 
31
 
Preliminary conclusions
 
Record occurrences of idioms and conversational routines in a
systematic way in 
Woordcombinaties
Full treatment of idioms and conversational routines separate
module
More and more diverse corpus data needed
 
Woordcombinaties
: 
a unique point of access for anyone who
wants to learn more about Dutch phraseology.
 
32
 
33
 
 
On behalf of the 
Woordcombinaties
 team:
Lut Colman, Jan Niestadt, Carole Tiberius
Slide Note
Embed
Share

Delve into the treatment of idioms and conversational routines in Woordcombinaties as discussed by Carole Tiberius and Lut Colman. The project focuses on challenges, proposed solutions, and extensions related to lemmatisation, variations, and data models. Explore a reference work for understanding words in context and discover various word combinations in Dutch. Gain inspiration from related language learning and lexical computing resources.

  • Dutch language
  • Idioms
  • Conversational routines
  • Woordcombinaties project
  • Language learning

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The treatment of idioms and conversational routines in Woordcombinaties Carole Tiberius and Lut Colman carole.tiberius@ivdnt.org lut.colman@ivdnt.org Keel ja keelekasutajad / Language: The User in Focus 27.-28. aprill 2023 / 27-28 April 2023 Tallinn 1

  2. Introduction to the project Woordcombinaties Treatment of idioms and conversational routines in Woordcombinaties Current Envisaged extensions Challenges faced: Lemmatisation of idioms and conversational routines Variations and Extensions Proposed solutions: Changes to the data model Enable subentries in the DWS 2

  3. https://woordcombinaties.ivdnt.org Reference work for using and understanding words in context Resource for teaching and testing materials Advanced learners of Dutch; native speakers Verbs (1), nouns (2) 3

  4. Combination: any meaningful and statistically relevant combination of words with spaces, excluding compounds and combinations not formed in Dutch: o Collocation: offer + support, advice, service, opportunity o Multiword lexical unit: yellow spot o Idiom: hand something to someone on a silver platter o Proverb: all good things come to an end o Formula: have a nice day! o (Valency) pattern: something takes place vs someone takes his place someone 4

  5. Not included o Compounds without spaces appeltaart( apple pie ) kruidje-roer-me-niet( touch-me-not , plant/touchy person ) o Loan combinations ad hoc ad fundum out of the blue 5

  6. Inspiration SkELL: Sketch Engine for Language Learning (Lexical Computing, Brno) PDEV: Pattern Dictionary of English Verbs (Patrick Hanks, Wolverhampton) E-VALBU: Das elektronische Valenzw rterbuch deutscher Verben (IDS, Mannheim) CVVD: Contrastive Verb Valency Dictionary (Contragram, Gent) SweCcn: Swedish Constructicon van Spr kbanken (Gothenburg) StringNet Navigator 4.0 (David Wible, Taiwan) Combinatiewoordenboek de Kleijn (Piet de Kleijn) Etc. 6

  7. TNE & CPA (Patrick Hanks) TNE: Theory of Norms and Exploitations Words have meaning potentials rather than meanings Meanings are evoked by context Norms are normal phraseological patterns as main carriers of meaning Exploitations are creative uses of normal patterns CPA: Corpus Pattern Analysis Corpus-driven technique for mapping meaning onto words in context Sketch Engine supports annotation of usage patterns in corpus samples Pattern: form-meaning pair Slots in patterns populated by semantic types from an ontology: [[Human]], [[Animal]], [[Furniture]], Semantic types are categories of lexical sets (collocates): [[Furniture]]: table, chair, bed, 7

  8. Tools & methodology Corpus Approx. 200 million tokens Mainly newspaper and web Material from the Netherlands and Belgium parsed with Alpino parser Patterns o CPA Corpus Pattern Analysis (P. Hanks) Corpus-driven approach > corpus-based in Woordcombinaties o Ontology label argument positions and sets of collocates with a semantic type (Hanks, E. Je ek) o SKEMA-editor Sketch Engine Manual Annotation (V. Baisa) Examples & Combinations o GDEX Good Dictionary Examples o TBL TickBox Lexicography o Word sketches o SwingLex Dictionary Writing System (in-house) 8

  9. 9

  10. o Quickscan Patterns Collocates 10

  11. Linked to INT Spelling Database 11

  12. (whistle) (arbiter) (bullet) (public)(ref) (referee) (wind) (bird) 12

  13. 13

  14. 14

  15. Idioms and conversational routines: current status Included at the microstructural level Patterns Combinations Only idioms, proverbs and formulas idiomatic 15

  16. Idioms and conversational routines: envisaged Treatment at macrostructural level Specific search options search for idioms based on image categories (e.g. body parts, food) and sense categories (e.g. have a property) een vinger in de pap hebben ( have a finger in the pie ) Search for conversational routines based on speech act (e.g. greeting) goedemorgen ( good morning ) + two tabs 16

  17. Conversational routines Pragmatic meaning pragmatic search Access through predefined lists of speech acts I want to ask information give information apologise express emotion ... anger joy doubt surprise Hoe laat vertrekt/gaat de trein naar x? (What time does the train for x depart?) 17

  18. Challenge Lemmatisation of MWEs What is the canonical form? Which lemma form do we use? Cf. task in UniDive COST action on harmonizing lemmatisation rules (for words and MWEs) and lexical features across languages 18

  19. Why is this challenging? Variation E.g. een schat van een baby, kind, man, vrouw ( a gem of a baby, child, man, woman ) iemand naar zijn hand zetten ( force someone to one s will ) iemand of iets naar zijn hand zetten ( force someone or something to one s will ) iets naar je hand zetten ( force something to your will ) Even more difficult for constructional idioms E.g. reflzich + RESULT + V zich naar/rot/ziek lachen ( split one s sides laughing ) 19

  20. Canonical forms of MWEs in computational approaches o DUCAME: Dutch Canonicalised Multiword Expressions dd:[die] vlieger zal 0niet opgaan ( that s (simply) not on ) o PARSEME: 20

  21. Canonical forms of MWEs in the lexicographic literature There are no ready-made solutions in lexicography for representing the different types of variation of idioms. [ ] Idioms must be presented in their full form and in their usual constructions, i.e. the syntactic valency of the idiom must be shown (e.g. look at/see sth through the rose-tinted glass ). However, it is also important not to include too much context, as the idiom should not appear to be more restricted contextually than it actually is. Note that adding this information to the lemma form of MWEs is not in line with lemmatisation practices for words, where syntactic valency is not normally part of the lemma form.(Svens n 2009:199) Harras and Proost s (2002:289) Citation Form Maxim: Idioms should basically be entered in their basic or canonical form. This means that the citation form should contain only general pronouns like someone or somebody and something. VP-idioms should basically be entered in the infinitive form of the head verb. Where deviations from the canonical citation form are required, these should be in accordance with the following submaxims: (1) The citation form must indicate as many restrictions as possible. [ ] (2) Morphological restrictions should also be indicated by the citation form. [ ] (3) The citation form should not be too restrictive. [ ] 21

  22. Canonical forms of MWEs in the lexicographic literature Vrbinc and Vrbinc (2016) emphasise that variation in MWEs should be included in a way that is least ambiguous and most user-friendly so that users are made aware of the possible alternatives. 22

  23. Lemmatisation of MWEs in Woordcombinaties Goal: to find a balance between complexity and applicability but maintain readability Complexity Die Hand darauf/dadrauf/aus das/ein Verprechen geben ( to give one s hand on it/sth./on a promise (quoted from Ermakova et al. 2022:854)). Applicability DUCAME: dd:[die] vlieger zal 0niet opgaan. PARSEME: 23

  24. Lemmatisation of MWEs in Woordcombinaties In Woordcombinaties a human-friendly lemma form will be complemented with a pattern form, which is compatible with more NLP oriented work. Preliminary guidelines I: MWEs are entered in their canonical form, e.g. infinitive form for verbal MWEs. Variable, but obligatory arguments and variable parts of arguments and complements are indicated by means of dummies (e.g. iemand de ogen openen open someone s eyes ) or other generic forms such as zijn (e.g. zijn gezicht laten zien show one's face ) and zich (e.g. zich op de vlakte houden not commit oneself ). 24

  25. Lemmatisation of MWEs in Woordcombinaties Preliminary guidelines II: A fixed order of components is followed as much as possible: e.g. place and direction complements in verbal MWEs will usually occur before the verb and fixed prepositions after it (e.g. in de bres springen voor iemand of iets throw oneself into the breach for someone or something ). Articles are not included in the lemma form of noun MWEs (e.g. blinde vink some meat ) except for those cases where the article is part of the construction een schat van een kind ( a gem of a child ) Canonical forms, variants and lexical realisations of constructional MWEs will be lemmatised separately and linked. 25

  26. Lemmatisation of MWEs in Woordcombinaties Preliminary guidelines III: Negation of MWEs: lemmatised separately and linked, e.g. geen kaas gegeten hebben van iets vs. kaas gegeten hebben van iets ( not have a clue about something vs. have a clue about something ) Extensions of MWEs still subject to further research (cf. e.g. Ermakova et al. 2022) de gordiaanse knoop ( the Gordian knot ) de gordiaanse knoop doorhakken ( cut the Gordian knot ) de gordiaanse knoop ontwarren ( disentangle the Gordian knot ) 26

  27. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. main entry add, open, delete subentry 27

  28. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. add subentry 28

  29. Subentries Idioms and conversational routines are edited as subentries and links can be added to these subentries in the DWS. 29

  30. Subentries and link to GiGaNT-Molex Lemma forms are linked to the Molex (spelling database). The individual components of the MWE are also linked to their respective lemmas in Molex. houden=> own ID 105952 een vinger aan de pols 61767 87148 ( have/keep a finger on the pulse ) 30

  31. Data model idioms and conversational routines 31

  32. Preliminary conclusions Record occurrences of idioms and conversational routines in a systematic way in Woordcombinaties Full treatment of idioms and conversational routines separate module More and more diverse corpus data needed Woordcombinaties: a unique point of access for anyone who wants to learn more about Dutch phraseology. 32

  33. On behalf of the Woordcombinaties team: Lut Colman, Jan Niestadt, Carole Tiberius 33

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#