Welsh Natural Language Toolkit Overview

Slide Note

The Welsh Natural Language Toolkit (WNLT) is an open-source software for Welsh NLP, offering features like tokenization, lemmatization, part-of-speech tagging, and named entity recognition. With a user-friendly GUI and CLI, as well as an accessible API, WNLT simplifies NLP tasks for both technical and non-technical users. It also aids in sentiment analysis and processing Welsh tweets on Twitter.

hlan Follow

Uploaded on Sep 23, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Welsh Natural Language Toolkit Daniel Williams 1

Overview Quick overview of WNLT (version 1) WNLT accessibility CymrIE for Twitter Improvements to WNLT Experiment using WNLT 2

Welsh Natural Language Toolkit Open source software for Welsh NLP CymrIE Tokenisation Lemmatisation Part of Speech Named Entity Recognition (NER) Person, Location, Organization, Money, Percent, Date and Address annotations 3

GATE Developer (GUI) *Annotations produce by WNLT version 2.3.2 4

WNLT GUI 5

WNLT CLI (Command line interface) 6

WNLT API (Application Programming Interface) 7

WNLT Accessibility GATE Developer GUI Easier for nontechnical users Quick results No configuration / setup needed CLI Scripts Helps automation API Software developers GATE Embedded developers Integration with existing web, desktop, mobile applications 8

Twitter Facilitates sentiment analysis and NLP of Welsh tweets Noisy Twitter specific metadata Twitter URLs, Usernames and Hashtags 9

TwitIE Processing Resources 10

Annotation Set Transfer { "text": "Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ?", "truncated": true, "in_reply_to_user_id": null, "in_reply_to_status_id": null, "favorited": false, "source": "<a href=\"http://twitter.com/\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_screen_name": null, "in_reply_to_status_id_str": null, "id_str": "54691802283900928", "entities": { "user_mentions": [ { "indices": [ 3, 19 ], "screen_name": "PostGradProblem", "id_str": "271572434", "name": "PostGradProblems", "id": 271572434 Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ? 11

Tweet Language Identification English French German Dutch Spanish Welsh Average Accuracy 95.20% 96.36% 95.69% 97.34% 97.02% 99.07% 96.78% Precision 83.00% 93.59% 80.99% 89.59% 93.97% 99.38% 90.09% Recall 85.46% 84.25% 92.84% 91.32% 88.40% 96.40% 89.78% NPV 97.43% 96.86% 98.72% 98.60% 97.62% 98.98% 98.03% Specificity 96.92% 98.82% 96.19% 98.30% 98.82% 99.83% 98.15% F1 Score 84.21% 88.67% 86.51% 90.45% 91.10% 97.87% 89.80% ~5,000 tweets 12

Emoticons Gazetteer & Hashtag Tokenizer =) 8) -> :) #ColegauCymru Camel case Eurfa Dictionary 13

UserID, Hashtag and URL 14

Performance evaluation Annotation Precision Recall F-Measure Token 87.7% 99.9% 93.4% Token.category 71.4% 81.3% 76.0% Token.lemma 69.7% 79.4% 74.2% VBD, VBDP, VBDI, VDI, VBF NNS, NNP, NNPS, NNM, NNF JJR, JJS -> VB -> NN -> JJ ~2,200 Tokens 15

Twitter API 16

Twitter CymrIE in GUI 17

Improvements for NER Added more gazetteers Obtained from Y Lolfa Cyf Special events Welsh chapels Choirs Papurau-bro Welsh magazines Person names 18

Improvements Improved JAPE rules for identifying Date, Person and Location annotations Interfaces in Welsh WNLT GUI WNLT CLI Welsh user guide 19

GUI in Welsh 20

CLI in Welsh 21

CymrIE NER Experiment Find: Person role Location kind Date kind ~ 7 heuristics for Person role ~ 18 heuristics for all Papurau-bro newsletters 22

CymrIE NER Experiment BRIDE GROOM SERVICE SERVICE MOTHER & FATHER OF BRIDE ~ 7 heuristics Papurau-bro newsletters 23

Wedding NER Annotation Bride BrideFather BrideMother Groom GroomFather GroomMother ServiceLocation ServiceDate Precision Recall 100% 91.7% 100% 95.2% 91.7% 90% 66.7% 96.3% 67.7% 52.4% 54.2% 62.5% 57.9% 47.4% 58.3% 89.7% * 32 news articles were used in the gold standard 24

Future work Add co-referencing processing resource to CymrIE Collaborate with users to further develop CymrIE s NER capabilities Further develop the NER pipeline for Wedding announcements: More heuristics for identifying Person role, Location kind and Date kind Machine learning Ensemble methods 25

Acknowledgements Welsh Language Unit, Welsh Government Gareth Morlais for further funding of the WNLT project Department of Information Studies University Collage London Andres Vlachidis for his help creating the Gold Standards and evaluation School of Welsh Cardiff University Benjamin Screen for his help creating the Twitter Gold Standard Y Lolfa Y Lolfa for supplying a lot of gazetteers http://techiaith.cymru Bangor University Resource for Welsh tweets 26

Welsh Natural Language Toolkit Daniel Williams Dr. Daniel Cunliffe Prof. Douglas Tudhope Hypermedia research unit http://hypermedia.research.southwales.ac.uk/kos/wnlt/ Sourceforge https://sourceforge.net/projects/wnlt-project/ 27

Extras slides 28

WNLT 1 CymrIE Performance - Evaluation Gold Standard 2221 Tokens 230 Entities (Date, Location, Organization, Percent, Person) Results Tokenizer : Recall-99%, Precision-98%, F1-99% POS: Recall-82%, Precision-81%, F1-81% Lemma: Recall-80%, Precision-79%, F1-80% NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as half-matches (average mode) 29

Welsh Natural Language Toolkit Overview

Download Presentation

Presentation Transcript

Related

More Related Content