Welsh Natural Language Toolkit Overview
The Welsh Natural Language Toolkit (WNLT) is an open-source software for Welsh NLP, offering features like tokenization, lemmatization, part-of-speech tagging, and named entity recognition. With a user-friendly GUI and CLI, as well as an accessible API, WNLT simplifies NLP tasks for both technical and non-technical users. It also aids in sentiment analysis and processing Welsh tweets on Twitter.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Welsh Natural Language Toolkit Daniel Williams 1
Overview Quick overview of WNLT (version 1) WNLT accessibility CymrIE for Twitter Improvements to WNLT Experiment using WNLT 2
Welsh Natural Language Toolkit Open source software for Welsh NLP CymrIE Tokenisation Lemmatisation Part of Speech Named Entity Recognition (NER) Person, Location, Organization, Money, Percent, Date and Address annotations 3
GATE Developer (GUI) *Annotations produce by WNLT version 2.3.2 4
WNLT GUI 5
WNLT CLI (Command line interface) 6
WNLT API (Application Programming Interface) 7
WNLT Accessibility GATE Developer GUI Easier for nontechnical users Quick results No configuration / setup needed CLI Scripts Helps automation API Software developers GATE Embedded developers Integration with existing web, desktop, mobile applications 8
Twitter Facilitates sentiment analysis and NLP of Welsh tweets Noisy Twitter specific metadata Twitter URLs, Usernames and Hashtags 9
TwitIE Processing Resources 10
Annotation Set Transfer { "text": "Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ?", "truncated": true, "in_reply_to_user_id": null, "in_reply_to_status_id": null, "favorited": false, "source": "<a href=\"http://twitter.com/\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_screen_name": null, "in_reply_to_status_id_str": null, "id_str": "54691802283900928", "entities": { "user_mentions": [ { "indices": [ 3, 19 ], "screen_name": "PostGradProblem", "id_str": "271572434", "name": "PostGradProblems", "id": 271572434 Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ? 11
Tweet Language Identification English French German Dutch Spanish Welsh Average Accuracy 95.20% 96.36% 95.69% 97.34% 97.02% 99.07% 96.78% Precision 83.00% 93.59% 80.99% 89.59% 93.97% 99.38% 90.09% Recall 85.46% 84.25% 92.84% 91.32% 88.40% 96.40% 89.78% NPV 97.43% 96.86% 98.72% 98.60% 97.62% 98.98% 98.03% Specificity 96.92% 98.82% 96.19% 98.30% 98.82% 99.83% 98.15% F1 Score 84.21% 88.67% 86.51% 90.45% 91.10% 97.87% 89.80% ~5,000 tweets 12
Emoticons Gazetteer & Hashtag Tokenizer =) 8) -> :) #ColegauCymru Camel case Eurfa Dictionary 13
Performance evaluation Annotation Precision Recall F-Measure Token 87.7% 99.9% 93.4% Token.category 71.4% 81.3% 76.0% Token.lemma 69.7% 79.4% 74.2% VBD, VBDP, VBDI, VDI, VBF NNS, NNP, NNPS, NNM, NNF JJR, JJS -> VB -> NN -> JJ ~2,200 Tokens 15
Twitter API 16
Improvements for NER Added more gazetteers Obtained from Y Lolfa Cyf Special events Welsh chapels Choirs Papurau-bro Welsh magazines Person names 18
Improvements Improved JAPE rules for identifying Date, Person and Location annotations Interfaces in Welsh WNLT GUI WNLT CLI Welsh user guide 19
GUI in Welsh 20
CLI in Welsh 21
CymrIE NER Experiment Find: Person role Location kind Date kind ~ 7 heuristics for Person role ~ 18 heuristics for all Papurau-bro newsletters 22
CymrIE NER Experiment BRIDE GROOM SERVICE SERVICE MOTHER & FATHER OF BRIDE ~ 7 heuristics Papurau-bro newsletters 23
Wedding NER Annotation Bride BrideFather BrideMother Groom GroomFather GroomMother ServiceLocation ServiceDate Precision Recall 100% 91.7% 100% 95.2% 91.7% 90% 66.7% 96.3% 67.7% 52.4% 54.2% 62.5% 57.9% 47.4% 58.3% 89.7% * 32 news articles were used in the gold standard 24
Future work Add co-referencing processing resource to CymrIE Collaborate with users to further develop CymrIE s NER capabilities Further develop the NER pipeline for Wedding announcements: More heuristics for identifying Person role, Location kind and Date kind Machine learning Ensemble methods 25
Acknowledgements Welsh Language Unit, Welsh Government Gareth Morlais for further funding of the WNLT project Department of Information Studies University Collage London Andres Vlachidis for his help creating the Gold Standards and evaluation School of Welsh Cardiff University Benjamin Screen for his help creating the Twitter Gold Standard Y Lolfa Y Lolfa for supplying a lot of gazetteers http://techiaith.cymru Bangor University Resource for Welsh tweets 26
Welsh Natural Language Toolkit Daniel Williams Dr. Daniel Cunliffe Prof. Douglas Tudhope Hypermedia research unit http://hypermedia.research.southwales.ac.uk/kos/wnlt/ Sourceforge https://sourceforge.net/projects/wnlt-project/ 27
WNLT 1 CymrIE Performance - Evaluation Gold Standard 2221 Tokens 230 Entities (Date, Location, Organization, Percent, Person) Results Tokenizer : Recall-99%, Precision-98%, F1-99% POS: Recall-82%, Precision-81%, F1-81% Lemma: Recall-80%, Precision-79%, F1-80% NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as half-matches (average mode) 29