Introduction to arTenTen: A New Vast Corpus for Arabic Linguistic Processing

Slide Note
Embed
Share

arTenTen is a new corpus for Arabic containing a vast array of text types, rich metadata, and clean linguistic processing capabilities. It offers a significant improvement over existing Arabic corpora, presenting a larger dataset with a variety of linguistic features. The corpus is fully processed, tokenized, lemmatized, and tagged, making it a valuable resource for linguistic analysis and research.


Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. arTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

  2. We all want corpora to be Bigger Better More text types Richer metadata Cleaner Better linguistic processing

  3. Arabic Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire

  4. arTenTen TenTen family See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)

  5. Size 5.8 b space-separated tokens Fully processed: 200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don t know yet

  6. Availability In Sketch Engine demo

  7. Encoding Vertical format Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML

  8. For each word pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1

More Related Content