Introduction to arTenTen: A New Vast Corpus for Arabic Linguistic Processing

Slide Note

arTenTen is a new corpus for Arabic containing a vast array of text types, rich metadata, and clean linguistic processing capabilities. It offers a significant improvement over existing Arabic corpora, presenting a larger dataset with a variety of linguistic features. The corpus is fully processed, tokenized, lemmatized, and tagged, making it a valuable resource for linguistic analysis and research.

parin Follow

Uploaded on Sep 30, 2024 | 2 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

arTenTen A new, vast corpus for Arabic Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

We all want corpora to be Bigger Better More text types Richer metadata Cleaner Better linguistic processing

Arabic Since 2003: Arabic Gigaword Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire

arTenTen TenTen family See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)

Size 5.8 b space-separated tokens Fully processed: 200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don t know yet

Availability In Sketch Engine demo

Encoding Vertical format Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML

For each word pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1

Introduction to arTenTen: A New Vast Corpus for Arabic Linguistic Processing

Download Presentation

Presentation Transcript

Related

More Related Content