Introduction to arTenTen: A New Vast Corpus for Arabic Linguistic Processing
arTenTen is a new corpus for Arabic containing a vast array of text types, rich metadata, and clean linguistic processing capabilities. It offers a significant improvement over existing Arabic corpora, presenting a larger dataset with a variety of linguistic features. The corpus is fully processed, tokenized, lemmatized, and tagged, making it a valuable resource for linguistic analysis and research.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
arTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz
We all want corpora to be Bigger Better More text types Richer metadata Cleaner Better linguistic processing
Arabic Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire
arTenTen TenTen family See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)
Size 5.8 b space-separated tokens Fully processed: 200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don t know yet
Availability In Sketch Engine demo
Encoding Vertical format Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML
For each word pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1