Automatic Knowledge Acquisition in Lexicography Survey

Survey – WG3 ENeL
Automatic Knowledge
Acquisition for Lexicography
Carole Tiberius, Ins
t
itute for Dutch Lexicology, Leiden, the Netherlands
Kris Heylen, University of Leuven, Belgium
Simon Krek, „Jožef Stefan“ Institute, Ljubljana, Slovenia
Purpose of the survey
Create an inventory of different types of automatic knowledge
acquisition which are currently used within the framework of
lexicographical projects
Automatic Acquisition of Knowledge
Knowledge (data) which
is automatically obtained from corpora of authentic language
use (both synchronic and diachronic);
forms either the input for lexicographers (who further
inspect and edit the data) or is included as is in the published
dictionary (possibly marked as being knowledge which has
been automatically derived from corpus data).
Types of Automatically Acquired Knowledge
(Candidate) Lemma list
Overall Lemma Frequency information
Form variation
(e.g. irregular morphology, orthographic variants)
Example sentences
(cf. Vienna COST workshop)
Multiword expressions
(i.e. sequences of words with some unpredictable properties such as "to count
somebody in" or "to take a haircut", ranging  from collocations and phrasal verbs,
(pragmatic) frozen expressions (e.g. of course, good morning) to traditional idioms,
proverbs etc.)
Types of Automatically Acquired Knowledge …
Neologisms
Definitions
Translation Equivalents
Knowledge Rich Contexts
(i.e. in terminography, a sort of hybrid of a good example and a definition, illustrating the
meaning characteristics of a term, but not being a formal definition.)
Lexical-semantic relations
(e.g. synonyms, antonyms, hypernyms)
Word senses
Grammatical patterns
(e.g. word profiles, valency)
Linguistic labels
(domain/ region/ dialect/ register/ style/ time/ slang and jargon/ attitude/ offensive terms)
General
Web address: 
https://www.1ka.si/a/60608
Questions: 
134
 (variables: 
134
)
Pages: 
18
Completed:
 
45
Partially completed:
 
6
Total valid
: 
51
All units in database: 19
6
First entry:
 
13.
 
4.
 20
14
, 
Last entry:
 
1.
 
5.
 20
15
Coverage
Positions
Automatic Knowledge Acquisition
Other types of AKA
Prioritizing lemmas 
(Denmark - 
Society for Danish Language and Literature)
Word formation information 
(
France - Université de Franche-Comté, Besançon
)
Semantic relations 
(
France - Université de Franche-Comté, Besançon)
Termhood probability 
(
France - Université de Franche-Comté, Besançon
)
Selectional preferences 
(
Hungary - Research Institute for Linguistics of the Hungarian
Academy of Sciences)
Discourse markers 
(Denmark / France
Aarhus University, Business and Social Sciences, Department of Business Communication; Université de
Bourgogne, Maison des Sciences de l\'Homme)
Meeting Programme:
http://www.elexicography.eu/working-groups/working-group-3/wg3-meetings/wg3-
herstmonceux-2015/
Q
:
 
I
s
 
t
h
e
 
a
u
t
o
m
a
t
i
c
a
l
l
y
 
a
c
q
u
i
r
e
d
 
k
n
o
w
l
e
d
g
e
 
d
i
r
e
c
t
l
y
i
n
t
e
g
r
a
t
e
d
 
i
n
 
t
h
e
 
p
u
b
l
i
s
h
e
d
 
d
i
c
t
i
o
n
a
r
y
 
w
i
t
h
o
u
t
h
u
m
a
n
 
i
n
t
e
r
v
e
n
t
i
o
n
?
Integrated without human intervention:
Lemma lists
Frequency information
Example sentences
Translation equivalents
Lexical-semantic relations
Integrated with human intervention:
Form variation
MWE
Neologisms
Knowledge Rich Contexts
Word senses
Grammatical patterns
Linguistic labels
Q
:
 
I
s
 
t
h
e
 
a
u
t
o
m
a
t
i
c
a
l
l
y
 
a
c
q
u
i
r
e
d
 
k
n
o
w
l
e
d
g
e
 
d
i
r
e
c
t
l
y
i
n
t
e
g
r
a
t
e
d
 
i
n
 
t
h
e
 
p
u
b
l
i
s
h
e
d
 
d
i
c
t
i
o
n
a
r
y
 
w
i
t
h
o
u
t
h
u
m
a
n
 
i
n
t
e
r
v
e
n
t
i
o
n
?
Q
:
 
H
o
w
 
d
o
 
t
h
e
 
l
e
x
i
c
o
g
r
a
p
h
e
r
s
 
j
u
d
g
e
 
t
h
e
 
q
u
a
l
i
t
y
o
f
 
t
h
e
 
a
u
t
o
m
a
t
i
c
a
l
l
y
 
a
c
q
u
i
r
e
d
 
k
n
o
w
l
e
d
g
e
?
Lemma lists
    
4
  
Frequency information
  
4
  
Form variation
   
3
  
Example sentences
  
4
  
MWEs
    
3
  
Neologisms
    
3
  
Definitions
    
3
  
Translation equivalents
  
3 - 4
  
Knowledge Rich Contexts
  
3 – 4
 
Lexical-semantic relations
 
3
 
Word senses
   
3 – 4
 
Grammatical patterns
  
4
 
Linguistic labels
   
3
  
Wishes/ Comments
automatic extraction of 
contrastive
 
data
 
annotated syntactic and
semantically
There is a huge need for methods and tools for these tasks - if EU languages
shall be supported with high quality dictionaries published by EU institutions or
publishers - otherwise US IT giants will dominate the future. For publishers the
rights are important, as the model is changing from licensing/royalty models to
ownership models. But also for public institutions, that might publish for free,
the 
ownership 
is an important issue. There will be a certain degree of
skepticism about these methods and tools, and it will be hard to convince the
community about the quality and ROI.
... We use a LOT of knowledge acquisition, but it is 
not strictly applied to
lexicography yet
.
Wishes/ Comments
My work focuses on 
definitions
. I have developed systems  for extracting encyclopedic
definitions from encyclopedic text, also for extracting hypernyms from definitions, and for
learning taxonomies from free text using the previous systems. Part of my work also focuses
in harvesting 
semantic relations 
from the web, and disambiguating them where
possible. For my PhD work, I would like to have a system that given a set of documents which
belong to a certain domain, is able to identify candidate definitions and score them according
to their relevance to the document in which they are included, the corpus to which
document belongs, and finally the domain to which such corpus belong.
Several researchers have applied types of automatic knowledge acquisition in their individual
research, e.g. to generate candidate lemma lists (Bratanić, Ostroški Anić and Radišić 2010,
Aviation English Terms and Collocations (An alphabetical checklist). Zagreb: Sveučilište u
Zagrebu, Fakultet prometnih znanosti.), to 
extract terms and collocations or
to extract frequency information
 (Stojanov and Vučić, 2012.
Korpusnojezikoslovna obradba tekstova Sportskih novosti. N-gramsko modeliranje
dohvaćanja podataka i vizualizacija. Filologija 59, 103-129).
Wishes/ Comments
We use Sketch Engine functions (Word List, Collocates, Frequency) to analyze
concordances, e.g. in order to find 
form variations 
(irregular morphology,
orthographic variants). We also used function Word Sketch for extraction of
collocations.
Domain sensitivity 
is a crucial lexicographical parameter and therefore,
automated processes can't be developed and exploited in the same way as in
lexicography for genreral purposes. The survey doesn't seem to include
innovative aspects on the analysis and representation of specialised knowledge
as  such, so full automation has still a long way to go.
Since we are working in the academic monolingual dictionary the level of the
corpus AAK is for us quite satisfying. In the case of such dictionary it is always
important to leave a space for a 
deeper semantic investigation
.
more for 
word sense disambiguation 
and 
definition
extraction
A
K
A
 
p
e
r
 
i
n
s
t
i
t
u
t
i
o
n
Basque country - Elhuyar Foundation
Lemma list
Frequency information
Example sentences (experimental level)
Multiword expressions
Neologisms
Translation equivalents
Grammatical Patterns (experimental level)
Elhuyar Hiztegiak (http://hiztegiak.elhuyar.org). Basque-Spanish dictionary ZTH-Dictionary of
Science and Technology (zthiztegia.elhuyar.org) Laneki Hiztegia (http://jakinbai.eu/hiztegia)
Automotive Dictionary (http://www.automotivedictionary.net/) (en, es and eu terms) Ihobe
Hiztegia environmental dictionary (intranet) CAF railway dictionary (intranet) on-going projects:
Osakidetza (Basque Health System); Social work (provincial governments of Araba, Bizkaia and
Gipuzkoa)
Belgium - 
KU Leuven
Lemma List
Frequency information
Corpus support to third party lexicographic publication on Belgian Dutch:   "Typisch
Vlaams. 4000 Woorden en uitdrukkingen" [Typical Flemish. 4000 words and
expressions] http://www.davidsfonds.be/publisher/edition/detail.phtml?id=3540
Translation equivalents
TermWise: Resources for Specialized Language Use
http://www.cs.kuleuven.be/groups/liir/projects.php?project=177
Bulgaria
Institute for Bulgarian Language
Lemma List
Frequency Information
Neologisms
Dictionary of Bulgarian Language
Lexical-semantic relations
Bulgarian WordNet; Dictionary of Bulgarian Language
Czech republic
Masaryk University, Faculty of Arts
Lemma List
Low-cost ontology development, paper ->
http://is.muni.cz/repo/966117/gwc2012.pdf
Word senses
Currently, in pilot - we are trying to create a new semantic network based on
combination of manually annotated data which are confirmed automatically by
corpus. This testing process can be also used for extending dictionary.
Czech republic
NLP Centre, Faculty of Informatics, Masaryk University
Lemma List
Thesaurus for Geography Domain
Frequency Information
DEB dictionary browser
Example sentences
Czech Sign Language dictionary
Grammatical patterns
Verbalex, verb valency lexicon
Czech republic- 
Lexical Computing
Lemma List
Frequency Information
Form variation
Example sentences
Multiword expressions
Neologisms
DIACRAN
Definitions
Experimental
Translation equivalents
Lexical-semantic relations
Distributional thesaurus in SketchEngine
Word senses
Clustering of word sketches
Grammatical patterns
Linguistic labels
(deliveries to publishers, IT companies)
Denmark
Society for Danish Language and Literature
Other
We use an experimental mix of many of the methods mentioned above, to check
existing dictionary entries and to select and prioritize new ones. We do not use
these methods consequently and thoroughly, as suggested with this survey; this
does not fit with our dictionary-writing process.
Estonia
Institute of the Estonian Language
Lemma List
Frequency Information
Example sentences
Estonian Collocations Dictionary
France
Université de Franche-Comté, Besançon
Lemma List
Frequency Information
Form Variation
Definitions
Lexical-semantic relations
Grammatical patterns
Sensunique project (already finished)
http://tesniere.univ-fcomte.fr/sensunique.html
http://www.station-sensunique.fr/
France
Université de Franche-Comté, Besançon
Other
The Sensunique platform extracts or calculates from the corpora information about :
a) Functional Category of composed candidate terms (eg. Noun for stem cells); b)
Head and Expansion of composed candidate term (e.g. cells is a Head and stem is an
Expansion of stem cells) ; c) different associations between candidate terms (e.g.
inclusion : cells is totally included in stem cells; e.g. partial association : stem cells is
partially associated with dendritic cells) ; d) information relative to termhood
probability. The information extracted from corpora is enriched with the information
retrieved from the selected external resources (e.g. existing terminology databases),
such as definitions, variants, semantic classes.
Germany
Institut für Deutsche Sprache, Abteilung Lexik
Lemma List
Frequency Information
Example sentences
elexiko 
http://www.ids-mannheim.de/lexik/elexiko.html
Multiword expressions
Usuelle Wortverbindungen http://www.ids-mannheim.de/lexik/uwv.html
http://wvonline.ids-mannheim.de/
Greece
Institute for Language and Speech Processing, Athena RIC
Lemma List
Frequency Information
Multiword expressions
Polytropon Project: Conceptual Dictionary of Modern Greek. (Under development) Fotopoulou, A. and
Giouli, V. From \"Ekfrasis\" to Polytropon. Towards a dictionary of the Modern Greek Language
Conceptually organised. Paper accepted at the International Conference in Greek Linguistics (in Greek).
The Greek High School Dictionary Giouli, V., Gavrilidou, M., Lambropoulou, P. 2008. The Greek High
School Dictionary: Description and issues. In Proceedings of the XIII Euralex International Congress
(EURALEX 2008). July 2008, Barcelona, Spain. eMiLang Project Vakalopoulou, A., Giouli, V., Giagkou, M.,
and Efthimiou, E. 2011. Online Dictionaries for immigrants in Greece: Overcoming the Communication
Barriers. In Proceedings of the 2nd Conference “Electronic Lexicography in the 21st century: new
Applications for New users” (eLEX2011), Bled, Slovenia, 10-12 November 2011.
Translation equivalents
INTERA Project http://www.elda.fr/en/projects/archived-projects/intera/ Gavrilidou, M., Labropoulou, P.,
Desipri, E., Giouli, V., Antonopoulos, V. & Piperidis, S. (2004). Building parallel corpora for {eContent}
professionals. In COLING 2004. Geneva.
Hungary
Research Institute for Linguistics of the Hungarian
Academy of Sciences
Lemma List
Frequency Information
Translation equivalents
EFNILEX 2008--2012 http://www.nytud.hu/depts/corpus/efnilex.html 2014--2015
http://corpus.nytud.hu/efnilex-vect/
Lexicographers do not yet directly use the results, which are at the research stage yet.
Multiword expressions
Grammatical patterns
Sass, Bálint and Pajzs, Júlia. FDVC -- Creating a Corpus-driven Frequency Dictionary of Verb Phrase
Constructions for Hungarian. In: Sylviane Granger, Magali Paquot (Eds.) eLexicography in the 21st
century: New challenges, new applications. Proceedings of eLex 2009, Louvain-la-Neuve, 22-24 October
2009. Cahiers du CENTAL 7. Presses universitaires de Louvain, 2010., p. 263-272
Lexicographers manually added corpus based examples to the verb phrase constructions.
Other
Extending Hungarian WordNet With Selectional Preference Relations
Italy
European Academy of Bolzano/Bozen (EURAC)
Word senses
For the ELDIT project (www.eurac.edu/eldit) in an experimental study
Italy
University of Bologna, University of Pisa
Multiword expressions
Grammatical patterns
CombiNet - Word Combinations in Italian (http://combinet.humnet.unipi.it/) We
use the broad term \"Word combinations\" because we target both MWEs (e.g.
phrasal lexemes, idioms, collocations) and more abstract combinatorial
information (e.g. argument structure patterns, subcategorization frames, and
selectional preferences).
Netherlands
Instituut voor Nederlandse Lexicologie
Lemma List
Frequency Information
Example sentences 
(work in progress)
Neologisms 
(work in progress)
Grammatical patterns 
(work in progress)
Linguistic labels 
(work in progress)
Algemeen Nederlands Woordenboek (ANW) http://anw.inl.nl/show?page=help#overhetANW
Schoonheim, Tanneke en Rob Tempelaars (2014), ‘Algemeen Nederlands Woordenboek (ANW), A
Dictionary of Contemporary Dutch’. In: www.elexicography.eu/wp-content/uploads/2014/11/Bled-
ANW-2014.pdf Schoonheim, Tanneke and Rob Tempelaars (2010), \'Dutch Lexicography in Progress,
The Algemeen Nederlands Woordenboek (ANW)\'. In: Anne Dykstra and Tanneke Schoonheim (eds.),
Proceedings of the XIV Euralex International Congress. Ljouwert, Fryske Akademy/Afûk, 179
(abstract), de volledige tekst op de bijgevoegde cd-rom.
http://www.euralex.org/elx_proceedings/Euralex2010/059_Euralex_2010_3_SCHOONHEIM
TEMPELAARS_Dutch Lexicography in Progress_the Algemeen Nederlands Woordenboek_ANW.pdf
Poland
Institute of the Polish Language PAS
AKA types not further specified in survey
Poland
Institute of the Polish Language at the Polish Academy
of Sciences (IJP PAN)
Frequency Information
Form variation
Example sentences
Neologisms
Word senses
Grammatical patterns
Linguistic labels
Great Dictionary of Polish
, 
www.wsjp.pl
Multiword expressions
Great Dictionary of Polish, www.wsjp.pl (idioms, proverbs, scientific multiword terms,
other discontinuous textual units - so called functional units).
Portugal
Centro de Linguística da Universidade de Lisboa
Lemma List
Reference Corpus of Contemporary Portuguese http://www.clul.ul.pt/en/resources/183-
reference-corpus-of-contemporary-portuguese-crpc
Frequency Information
Multifunctional computational lexicon of contemporary portuguese
http://www.clul.ul.pt/en/research-teams/194-multifunctional-computational-lexicon-of-
contemporary-portuguese
Example sentences
Dicionário da Academia das Ciências de Lisboa
Multiword expressions
Word combinations in the Portuguese language 
http://www.clul.ul.pt/en/research-
teams/187-combina-pt-word-combinations-in-portuguese-language
Lexical-semantic relations
Portugal
Centro de Linguística da Universidade Nova de Lisboa
Faculdade de Ciências Sociais e Húmanas
Lemma List
Frequency Information
Form variation
Example sentences
Multiword expressions
Neologisms
Translation equivalents
Knowledge rich contexts
Lexical-semantic relations
Word senses
Grammatical patterns
For LSP
Slovakia
Ľ. Štúr Institute of Linguistics, Slovak Academy of
Sciences
Lemma List
Form variation
Handbook of Slovak Nouns http://slovniky.korpus.sk/?d=noundb
Example sentences
Handbook of Slovak Nouns http://slovniky.korpus.sk/?d=noundb Parallel Corpora Phrases (en-sk, cs-sk,
bg-sk) http://slovniky.korpus.sk/
Frequency Information
Handbook of Slovak Nouns http://slovniky.korpus.sk/?d=noundb Dictionary of Contemporary Slovak
http://www.juls.savba.sk/oddelenie_sucas_lexikografie_vyskumna_cinnost.html Slovak-Czech
Dictionary 
http://gacr311.ujc.cas.cz/web/
Multiword expressions
Dictionary of Slovak Collocations http://vronk.net/wicol
Translation equivalents
phrases from parallel corpora (en-sk,cs-sk,bg-sk) 
http://slovniky.korpus.sk/
Grammatical patterns
Slovak Valency Dictionary (internal database, no URL yet)
Slovenia
U
niversity of Ljubljana, Faculty of Arts; Trojina,
Institute for Applied Slovene Studies; Jožef Stefan
Institute
Lemma List
Frequency Information
Multiword expressions
Grammatical patterns
Linguistic labels
Communication in Slovene: http://eng.slovenscina.eu Slovene Lexical Database:
http://eng.slovenscina.eu/spletni-slovar/leksikalna-baza Sloleks - morphological lexicon:
http://eng.slovenscina.eu/sloleks/opis Termis: http://www.termis.fdv.uni-lj.si/index-
en.html
Form Variation
Communication in Slovene: http://eng.slovenscina.eu Ortography Guide:
http://eng.slovenscina.eu/portali/slogovni-prirocnik
Spain
Universidade da Coruña and Real Academia Galega
Example sentences
Neologisms
Definitions
Lexical-semantic relations
Word senses
Linguistic labels
   Spanish-Galician Dictionary of the Royal Galician Academy
   No publications on the automatic acquisition of knowledge
Spain
University Institute for Applied Linguistics
(Pompeu Fabra University)
Lemma list
Frequency information
Example sentences
Grammatical patterns
   Terminus 2.0, a web application for corpus and terminology
managment 
http://terminus.iula.upf.edu
Neologisms
   
Buscaneo http://obneo.iula.upf.edu/buscaneo/
Sweden
University of Gothenburg, Dpt. of Swedish, Språkbanken
Lemma lists 
1. Kelly (http://spraakbanken.gu.se/eng/kelly) 2. Academic Wordlist (AO,
http://spraakbanken.gu.se/eng/forskning/akademiska-ordlistor) 3. SVALex (ongoing, target
Swedish as a second language lexicon);
Form variation 
1) Diabase (
http://spraakbanken.gu.se/eng/forskning/diabase
); 2) Mathir
(
http://spraakbanken.gu.se/eng/mathir
);
MWEs Constructicon 
(
http://spraakbanken.gu.se/eng/sweccn
);
Example sentences 
(HitEx (
http://spraakbanken.gu.se/larka/larka_hitex_index.html
)
Definitions
 (Semantic Interoperability and Data Mining in Biomedicine
(
http://cordis.europa.eu/project/rcn/71155_en.html
)
Lexical-semantic relations 
(SweFN++ (
http://spraakbanken.gu.se/eng/swefn
)
Word senses 
(
Distributional Methods to Represent the Meaning of Frames and Constructions
(http://spraakbanken.gu.se/eng/corpsem)
Grammatical patterns
 (Culturomics (
http://spraakbanken.gu.se/eng/culturomics
)
Linguistic labels 
(
1. A Swedish vocation list 2. ongoing PhD thesis on automatic readability
classification of texts and sentences 3. Semantics in Storytelling in Swedish Fiction (list of
relations, named entity recognition, aliases)
Switzerland
École Polytechnique Fédérale de Lausanne
Example sentences:
   
Kamusi Global Online Living Dictionary 
http://kamusi.org
Denmark / France
Aarhus University, Business and Social Sciences,
Department of Business Communication;
Université de Bourgogne, Maison des Sciences de
l'Homme
Lemma list
Form variation
Example sentences
Multiword expressions
Neologisms
Knowledge rich contexts
Oenolex, wine  dictionary
Other
Discourse markers acquisition and discourse interaction markers
Slide Note
Embed
Share

Explore the automatic acquisition of knowledge in lexicographical projects through various types of data extraction methods. The survey covers the types of acquired knowledge, including lemma lists, neologisms, linguistic labels, and more, providing insights into the evolving landscape of lexical research.

  • Lexicography
  • Knowledge Acquisition
  • Survey
  • Data Extraction
  • Lexical Research

Uploaded on Nov 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Survey WG3 ENeL Automatic Knowledge Acquisition for Lexicography Carole Tiberius, Institute for Dutch Lexicology, Leiden, the Netherlands Kris Heylen, University of Leuven, Belgium Simon Krek, Jo ef Stefan Institute, Ljubljana, Slovenia

  2. Purpose of the survey Create an inventory of different types of automatic knowledge acquisition which are currently used within the framework of lexicographical projects

  3. Automatic Acquisition of Knowledge Knowledge (data) which is automatically obtained from corpora of authentic language use (both synchronic and diachronic); forms either the input for lexicographers (who further inspect and edit the data) or is included as is in the published dictionary (possibly marked as being knowledge which has been automatically derived from corpus data).

  4. Types of Automatically Acquired Knowledge (Candidate) Lemma list Overall Lemma Frequency information Form variation (e.g. irregular morphology, orthographic variants) Example sentences (cf. Vienna COST workshop) Multiword expressions (i.e. sequences of words with some unpredictable properties such as "to count somebody in" or "to take a haircut", ranging from collocations and phrasal verbs, (pragmatic) frozen expressions (e.g. of course, good morning) to traditional idioms, proverbs etc.)

  5. Types of Automatically Acquired Knowledge Neologisms Definitions Translation Equivalents Knowledge Rich Contexts (i.e. in terminography, a sort of hybrid of a good example and a definition, illustrating the meaning characteristics of a term, but not being a formal definition.) Lexical-semantic relations (e.g. synonyms, antonyms, hypernyms) Word senses Grammatical patterns (e.g. word profiles, valency) Linguistic labels (domain/ region/ dialect/ register/ style/ time/ slang and jargon/ attitude/ offensive terms)

  6. General Web address: https://www.1ka.si/a/60608 Questions: 134 (variables: 134) Pages: 18 Completed: 45 Partially completed: 6 Total valid: 51 All units in database: 196 First entry: 13. 4. 2014, Last entry: 1. 5. 2015

  7. Coverage

  8. Positions lexicographer researcher software developer computational linguist nlp researcher terminologist (associate) professor project manager/director phd student

  9. Automatic Knowledge Acquisition Q: Do you or your institution use a form of automatic knowledge acquisition within a lexicographic project(s)?: Answers Frequency 1 (YES) 36 2 (NO) 14 Valid 50

  10. Knowledge Rich Contexts Definitions Other Linguistic labels Word senses Lexical Semantic Relations Translation Equivalents Neologisms Form variation MWEs Grammatical Patterns Example sentences Frequency Information Lemma list

  11. Other types of AKA Prioritizing lemmas (Denmark - Society for Danish Language and Literature) Word formation information (France - Universit de Franche-Comt , Besan on) Semantic relations (France - Universit de Franche-Comt , Besan on) Termhood probability (France - Universit de Franche-Comt , Besan on) Selectional preferences (Hungary - Research Institute for Linguistics of the Hungarian Academy of Sciences) Discourse markers (Denmark / France Aarhus University, Business and Social Sciences, Department of Business Communication; Universit de Bourgogne, Maison des Sciences de l\'Homme)

  12. Meeting Programme: http://www.elexicography.eu/working-groups/working-group-3/wg3-meetings/wg3- herstmonceux-2015/

  13. Q: Do you or your institution use automatic knowledge acquisition for generating a candidate lemma list? Answers Frequency 1 (YES) 23 2 (NO) 9 Valid 32

  14. Q: Do you or your institution use automatic knowledge acquisition to extract frequency information, e.g. overall lemma frequency information? Answers Frequency 1 (YES) 23 2 (NO) 9 Valid 32

  15. Q: Do you or your institution use automatic knowledge acquisition to extract information on form variation e.g. irregular morphology, orthographic variants? Answers Frequency 1 (YES) 11 2 (NO) 21 Valid 32

  16. Q: Do you or your institution use automatic knowledge acquisition to extract example sentences (cf. Vienna workshop)? Answers Frequency 1 (YES) 18 2 (NO) 13 Valid 31

  17. Q: Do you or your institution use automatic knowledge acquisition to extract multiword expressions ( i.e. sequences of words with some unpredictable properties such as "to count somebody in" or "to take a haircut", ranging from collocations and phrasal verbs, (pragmatic) frozen expressions (e.g. of course, good morning) to traditional idioms, proverbs etc.)? Answers Frequency 1 (YES) 14 2 (NO) 15 Valid 29

  18. Q: Do you or your institution use automatic knowledge acquisition to extract neologisms? Answers Frequency 1 (YES) 10 2 (NO) 19 Valid 29

  19. Q: Do you or your institution use automatic knowledge acquisition to extract definitions? Answers Frequency 1 (YES) 5 2 (NO) 23 Valid 28

  20. Q: Do you or your institution use automatic knowledge acquisition to extract translation equivalents? Answers Frequency 1 (YES) 8 2 (NO) 19 Valid 27

  21. Q: Do you or your institution use automatic knowledge acquisition to extract knowledge rich contexts (i.e. a sort of hybrid of a good dictionary example and a definition in the sense that is extracted from a corpus, illustrates the meaning of a term, but it is not a formal definition) ? Answers Frequency 1 (YES) 2 2 (NO) 26 Valid 28

  22. Q: Do you or your institution use automatic knowledge acquisition to extract lexical-semantic relations (e.g. synonyms, antonyms, hypernyms) Answers Frequency 1 (YES) 7 2 (NO) 21 Valid 28

  23. Q: Do you or your institution use automatic knowledge acquisition to extract word senses? Answers Frequency 1 (YES) 7 2 (NO) 21 Valid 28

  24. Q: Do you or your institution use automatic knowledge acquisition to extract grammatical patterns (e.g. word profiles, valency) Answers Frequency 1 (YES) 16 2 (NO) 12 Valid 28

  25. Q: Do you or your institution use automatic knowledge acquisition to extract linguistic labels (e.g. domain/region/dialect/register/style/time/slang and jargon/ attitude/ offensive terms)? Answers Frequency 1 (YES) 7 2 (NO) 21 Valid 28

  26. Q: Is Q: Is the automatically acquired knowledge directly the automatically acquired knowledge directly integrated in the published dictionary without integrated in the published dictionary without human human intervention intervention? ? Integrated without human intervention: Lemma lists Frequency information Example sentences Translation equivalents Lexical-semantic relations

  27. Q: Is the automatically acquired knowledge directly Q: Is the automatically acquired knowledge directly integrated in the published dictionary without integrated in the published dictionary without human human intervention intervention? ? Integrated with human intervention: Form variation MWE Neologisms Knowledge Rich Contexts Word senses Grammatical patterns Linguistic labels

  28. Q: How Q: How do the lexicographers judge the quality do the lexicographers judge the quality of the automatically acquired knowledge? of the automatically acquired knowledge? Lemma lists 4 Frequency information 4 Form variation 3 Example sentences 4 MWEs 3 Neologisms 3 Definitions 3 Translation equivalents 3 - 4 Knowledge Rich Contexts 3 4 Lexical-semantic relations 3 Word senses 3 4 Grammatical patterns 4 Linguistic labels 3

  29. Wishes/ Comments automatic extraction of contrastive data annotated syntactic and semantically There is a huge need for methods and tools for these tasks - if EU languages shall be supported with high quality dictionaries published by EU institutions or publishers - otherwise US IT giants will dominate the future. For publishers the rights are important, as the model is changing from licensing/royalty models to ownership models. But also for public institutions, that might publish for free, the ownership is an important issue. There will be a certain degree of skepticism about these methods and tools, and it will be hard to convince the community about the quality and ROI. ... We use a LOT of knowledge acquisition, but it is not strictly applied to lexicography yet.

  30. Wishes/ Comments My work focuses on definitions. I have developed systems for extracting encyclopedic definitions from encyclopedic text, also for extracting hypernyms from definitions, and for learning taxonomies from free text using the previous systems. Part of my work also focuses in harvesting semantic relations from the web, and disambiguating them where possible. For my PhD work, I would like to have a system that given a set of documents which belong to a certain domain, is able to identify candidate definitions and score them according to their relevance to the document in which they are included, the corpus to which document belongs, and finally the domain to which such corpus belong. Several researchers have applied types of automatic knowledge acquisition in their individual research, e.g. to generate candidate lemma lists (Bratani , Ostro ki Ani and Radi i 2010, Aviation English Terms and Collocations (An alphabetical checklist). Zagreb: Sveu ili te u Zagrebu, Fakultet prometnih znanosti.), to extract terms and collocations or to extract frequency information (Stojanov and Vu i , 2012. Korpusnojezikoslovna obradba tekstova Sportskih novosti. N-gramsko modeliranje dohva anja podataka i vizualizacija. Filologija 59, 103-129).

  31. Wishes/ Comments We use Sketch Engine functions (Word List, Collocates, Frequency) to analyze concordances, e.g. in order to find form variations (irregular morphology, orthographic variants). We also used function Word Sketch for extraction of collocations. Domain sensitivity is a crucial lexicographical parameter and therefore, automated processes can't be developed and exploited in the same way as in lexicography for genreral purposes. The survey doesn't seem to include innovative aspects on the analysis and representation of specialised knowledge as such, so full automation has still a long way to go. Since we are working in the academic monolingual dictionary the level of the corpus AAK is for us quite satisfying. In the case of such dictionary it is always important to leave a space for a deeper semantic investigation. more for word sense disambiguation and definition extraction

  32. AKA AKA per per institution institution

  33. Basque country - Elhuyar Foundation Lemma list Frequency information Example sentences (experimental level) Multiword expressions Neologisms Translation equivalents Grammatical Patterns (experimental level) Elhuyar Hiztegiak (http://hiztegiak.elhuyar.org). Basque-Spanish dictionary ZTH-Dictionary of Science and Technology (zthiztegia.elhuyar.org) Laneki Hiztegia (http://jakinbai.eu/hiztegia) Automotive Dictionary (http://www.automotivedictionary.net/) (en, es and eu terms) Ihobe Hiztegia environmental dictionary (intranet) CAF railway dictionary (intranet) on-going projects: Osakidetza (Basque Health System); Social work (provincial governments of Araba, Bizkaia and Gipuzkoa)

  34. Belgium - KU Leuven Lemma List Frequency information Corpus support to third party lexicographic publication on Belgian Dutch: "Typisch Vlaams. 4000 Woorden en uitdrukkingen" [Typical Flemish. 4000 words and expressions] http://www.davidsfonds.be/publisher/edition/detail.phtml?id=3540 Translation equivalents TermWise: Resources for Specialized Language Use http://www.cs.kuleuven.be/groups/liir/projects.php?project=177

  35. Bulgaria Institute for Bulgarian Language Lemma List Frequency Information Neologisms Dictionary of Bulgarian Language Lexical-semantic relations Bulgarian WordNet; Dictionary of Bulgarian Language

  36. Czech republic Masaryk University, Faculty of Arts Lemma List Low-cost ontology development, paper -> http://is.muni.cz/repo/966117/gwc2012.pdf Word senses Currently, in pilot - we are trying to create a new semantic network based on combination of manually annotated data which are confirmed automatically by corpus. This testing process can be also used for extending dictionary.

  37. Czech republic NLP Centre, Faculty of Informatics, Masaryk University Lemma List Thesaurus for Geography Domain Frequency Information DEB dictionary browser Example sentences Czech Sign Language dictionary Grammatical patterns Verbalex, verb valency lexicon

  38. Czech republic- Lexical Computing Lemma List Frequency Information Form variation Example sentences Multiword expressions Neologisms DIACRAN Definitions Experimental Translation equivalents Lexical-semantic relations Distributional thesaurus in SketchEngine Word senses Clustering of word sketches Grammatical patterns Linguistic labels (deliveries to publishers, IT companies)

  39. Denmark Society for Danish Language and Literature Other We use an experimental mix of many of the methods mentioned above, to check existing dictionary entries and to select and prioritize new ones. We do not use these methods consequently and thoroughly, as suggested with this survey; this does not fit with our dictionary-writing process.

  40. Estonia Institute of the Estonian Language Lemma List Frequency Information Example sentences Estonian Collocations Dictionary

  41. France Universit de Franche-Comt , Besan on Lemma List Frequency Information Form Variation Definitions Lexical-semantic relations Grammatical patterns Sensunique project (already finished) http://tesniere.univ-fcomte.fr/sensunique.html http://www.station-sensunique.fr/

  42. France Universit de Franche-Comt , Besan on Other The Sensunique platform extracts or calculates from the corpora information about : a) Functional Category of composed candidate terms (eg. Noun for stem cells); b) Head and Expansion of composed candidate term (e.g. cells is a Head and stem is an Expansion of stem cells) ; c) different associations between candidate terms (e.g. inclusion : cells is totally included in stem cells; e.g. partial association : stem cells is partially associated with dendritic cells) ; d) information relative to termhood probability. The information extracted from corpora is enriched with the information retrieved from the selected external resources (e.g. existing terminology databases), such as definitions, variants, semantic classes.

  43. Germany Institut f r Deutsche Sprache, Abteilung Lexik Lemma List Frequency Information Example sentences elexiko http://www.ids-mannheim.de/lexik/elexiko.html Multiword expressions Usuelle Wortverbindungen http://www.ids-mannheim.de/lexik/uwv.html http://wvonline.ids-mannheim.de/

  44. Greece Institute for Language and Speech Processing, Athena RIC Lemma List Frequency Information Multiword expressions Polytropon Project: Conceptual Dictionary of Modern Greek. (Under development) Fotopoulou, A. and Giouli, V. From \"Ekfrasis\" to Polytropon. Towards a dictionary of the Modern Greek Language Conceptually organised. Paper accepted at the International Conference in Greek Linguistics (in Greek). The Greek High School Dictionary Giouli, V., Gavrilidou, M., Lambropoulou, P. 2008. The Greek High School Dictionary: Description and issues. In Proceedings of the XIII Euralex International Congress (EURALEX 2008). July 2008, Barcelona, Spain. eMiLang Project Vakalopoulou, A., Giouli, V., Giagkou, M., and Efthimiou, E. 2011. Online Dictionaries for immigrants in Greece: Overcoming the Communication Barriers. In Proceedings of the 2nd Conference Electronic Lexicography in the 21st century: new Applications for New users (eLEX2011), Bled, Slovenia, 10-12 November 2011. Translation equivalents INTERA Project http://www.elda.fr/en/projects/archived-projects/intera/ Gavrilidou, M., Labropoulou, P., Desipri, E., Giouli, V., Antonopoulos, V. & Piperidis, S. (2004). Building parallel corpora for {eContent} professionals. In COLING 2004. Geneva.

  45. Hungary Research Institute for Linguistics of the Hungarian Academy of Sciences Lemma List Frequency Information Translation equivalents EFNILEX 2008--2012 http://www.nytud.hu/depts/corpus/efnilex.html 2014--2015 http://corpus.nytud.hu/efnilex-vect/ Lexicographers do not yet directly use the results, which are at the research stage yet. Multiword expressions Grammatical patterns Sass, B lint and Pajzs, J lia. FDVC -- Creating a Corpus-driven Frequency Dictionary of Verb Phrase Constructions for Hungarian. In: Sylviane Granger, Magali Paquot (Eds.) eLexicography in the 21st century: New challenges, new applications. Proceedings of eLex 2009, Louvain-la-Neuve, 22-24 October 2009. Cahiers du CENTAL 7. Presses universitaires de Louvain, 2010., p. 263-272 Lexicographers manually added corpus based examples to the verb phrase constructions. Other Extending Hungarian WordNet With Selectional Preference Relations

  46. Italy European Academy of Bolzano/Bozen (EURAC) Word senses For the ELDIT project (www.eurac.edu/eldit) in an experimental study

  47. Italy University of Bologna, University of Pisa Multiword expressions Grammatical patterns CombiNet - Word Combinations in Italian (http://combinet.humnet.unipi.it/) We use the broad term \"Word combinations\" because we target both MWEs (e.g. phrasal lexemes, idioms, collocations) and more abstract combinatorial information (e.g. argument structure patterns, subcategorization frames, and selectional preferences).

  48. Netherlands Instituut voor Nederlandse Lexicologie Lemma List Frequency Information Example sentences (work in progress) Neologisms (work in progress) Grammatical patterns (work in progress) Linguistic labels (work in progress) Algemeen Nederlands Woordenboek (ANW) http://anw.inl.nl/show?page=help#overhetANW Schoonheim, Tanneke en Rob Tempelaars (2014), Algemeen Nederlands Woordenboek (ANW), A Dictionary of Contemporary Dutch . In: www.elexicography.eu/wp-content/uploads/2014/11/Bled- ANW-2014.pdf Schoonheim, Tanneke and Rob Tempelaars (2010), \'Dutch Lexicography in Progress, The Algemeen Nederlands Woordenboek (ANW)\'. In: Anne Dykstra and Tanneke Schoonheim (eds.), Proceedings of the XIV Euralex International Congress. Ljouwert, Fryske Akademy/Af k, 179 (abstract), de volledige tekst op de bijgevoegde cd-rom. http://www.euralex.org/elx_proceedings/Euralex2010/059_Euralex_2010_3_SCHOONHEIM TEMPELAARS_Dutch Lexicography in Progress_the Algemeen Nederlands Woordenboek_ANW.pdf

  49. Poland Institute of the Polish Language PAS AKA types not further specified in survey

  50. Poland Institute of the Polish Language at the Polish Academy of Sciences (IJP PAN) Frequency Information Form variation Example sentences Neologisms Word senses Grammatical patterns Linguistic labels Great Dictionary of Polish, www.wsjp.pl Multiword expressions Great Dictionary of Polish, www.wsjp.pl (idioms, proverbs, scientific multiword terms, other discontinuous textual units - so called functional units).

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#