Advisory Board Meeting Summary in Sofia

Advisory Board Meeting Summary in Sofia
Slide Note
Embed
Share

Advisory board meeting held in Sofia, Bulgaria discussed various language resources, processing modules, and integration strategies. Topics included text archives, annotated corpora, bilingual corpora, dictionaries, ontology processing, and more. The meeting also focused on tools like analyzers, document extraction modules, ontology processing systems, and the BulTreeBank project.

  • Language Resources
  • Processing Modules
  • Text Archives
  • Bilingual Corpora
  • Ontology Processing

Uploaded on Mar 02, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Kiril Simov and Petya Osenova Advisory Board Meeting 4-5 July 2019 Sofia, Bulgaria

  2. BLARK (Basic Language Resource Kit - Steven Krauwer) for CLaDA-BG Current State Integration of Language Resources Language Processing Modules Conclusions

  3. 1. Text Archive for Bulgarian (minimum 100 million words), searchable on the Internet 2. Text Archive for Bulgarian with example uses of all the running words in a corpus of 25 billion words 3. Morphologically Annotated Corpus (1 million words) 4. Syntactically annotated corpus (1 million running words) 5. Semantically annotated corpus with ontological and fact information (1 million words) 6. Representative bilingual corpora for some of the European language pairs 7. Representative sets of data from main dictionaries 8. Representative sets of data from specialized electronic dictionaries

  4. Representative sets of data from bi- and multilingual dictionaries 10. Representative lists of names in Bulgarian 11. Representative corpus of transcribed speech 12. Test sets for the basic technologies for information access 13. Information extraction 14. Document extraction 15. Question answering 16. Historical Language Resources 17. Collection of learning objects, annotated with semantic information, etc 9.

  5. Morphological, shallow syntactic, deep syntactic and semantic analyzers; Modules for document extraction; Creation of dictionaries and access to them; Structured analyses of language data, named entities and other information extraction; System for concordance preparation (a concordance is a table in which the word is viewed in its left and right context); Modules for ontology processing, etc. 1. 2. 3. 4. 5. 6.

  6. BulTreeBank in the Universal Dependency representation - several versions BulTreeBank POS Corpus BulTreeBank WordNet (BTB-WN) version 1 Bulgarian CLEF Corpus - Question Answering and Information Retrieval: a corpus of about 20 million news items, a set of 1000 questions with marked answers, and 50 topics for IR BulTreeBank Frequency List Bulgarian National Reference Corpus BulTreeBank. About 70 million running words BulTreeBank Stopword List

  7. Tokenizer for Bulgarian Morphosyntactic tagger for Bulgarian Named Entity Recognizer for Bulgarian Lemmatizer for Bulgarian Dependency parser for Bulgarian Word Sense Disambiguation tool for Bulgarian (knowledge-based WSD)

  8. WebCLaRK Bulgarian National Reference Corpus - www.webclarg.org Political speech corpus www.political.webclark.org Corpus of documents in the domain of giving in the sphere of education: www.dar.webclark.org WebCLaRK web-based concordance over the

  9. HPSG-based Treebank of Bulgarina Sentences are extracted from: Bulgarian Grammar Books: 1750 sentences Randomly selected sentences and paragraphs: 2027 sentences Complete news articles, other texts: 11 300 sentences The annotation is constituent-based, but the head is mark (implicitly or explicitly) There are annotations of Named Entities, Coreference, Ellipses

  10. The host of the party was Toma Sprostranov.

  11. Three different ways: (1) by manual translation of English synsets from Core WordNet subset of Princeton WordNet (2) by identification of senses used in BTB that then have been mapped to the conceptual structure of PWN (3) by sense extension, which includes activities: A semi-automatic extraction of information from the Bulgarian Wiktionary Detection of the missing senses of processed lemmas in BulTreeBank, and Selection of lemmas from BulTreeBank Frequency List BTB-WN Extended with about 10 000 entries Mapping to English PWN

  12. Generalization over verb usage within the Treebank Annotated with senses from BTB-WN

  13. We integrate the following resources Treebank Syntax, MorphoSyntax, Lemmas, Coreference, Ellipses WordNet Valency Lexicon Wikipedia Interface between Language and Encyclopedia End-to-End Training Models

Related


More Related Content