Welsh Natural Language Toolkit Overview

Slide Note
Embed
Share

The Welsh Natural Language Toolkit (WNLT) is an open-source software for Welsh NLP, offering features like tokenization, lemmatization, part-of-speech tagging, and named entity recognition. With a user-friendly GUI and CLI, as well as an accessible API, WNLT simplifies NLP tasks for both technical and non-technical users. It also aids in sentiment analysis and processing Welsh tweets on Twitter.


Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Welsh Natural Language Toolkit Daniel Williams 1

  2. Overview Quick overview of WNLT (version 1) WNLT accessibility CymrIE for Twitter Improvements to WNLT Experiment using WNLT 2

  3. Welsh Natural Language Toolkit Open source software for Welsh NLP CymrIE Tokenisation Lemmatisation Part of Speech Named Entity Recognition (NER) Person, Location, Organization, Money, Percent, Date and Address annotations 3

  4. GATE Developer (GUI) *Annotations produce by WNLT version 2.3.2 4

  5. WNLT GUI 5

  6. WNLT CLI (Command line interface) 6

  7. WNLT API (Application Programming Interface) 7

  8. WNLT Accessibility GATE Developer GUI Easier for nontechnical users Quick results No configuration / setup needed CLI Scripts Helps automation API Software developers GATE Embedded developers Integration with existing web, desktop, mobile applications 8

  9. Twitter Facilitates sentiment analysis and NLP of Welsh tweets Noisy Twitter specific metadata Twitter URLs, Usernames and Hashtags 9

  10. TwitIE Processing Resources 10

  11. Annotation Set Transfer { "text": "Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ?", "truncated": true, "in_reply_to_user_id": null, "in_reply_to_status_id": null, "favorited": false, "source": "<a href=\"http://twitter.com/\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_screen_name": null, "in_reply_to_status_id_str": null, "id_str": "54691802283900928", "entities": { "user_mentions": [ { "indices": [ 3, 19 ], "screen_name": "PostGradProblem", "id_str": "271572434", "name": "PostGradProblems", "id": 271572434 Os chi ddim yn rhy siwr am rywbeth yn ymwneud a'ch canlyniadau #LefelA mae tim @ucas_online ar gal i'ch helpu http://t.co/pZqVSlcJrn ? 11

  12. Tweet Language Identification English French German Dutch Spanish Welsh Average Accuracy 95.20% 96.36% 95.69% 97.34% 97.02% 99.07% 96.78% Precision 83.00% 93.59% 80.99% 89.59% 93.97% 99.38% 90.09% Recall 85.46% 84.25% 92.84% 91.32% 88.40% 96.40% 89.78% NPV 97.43% 96.86% 98.72% 98.60% 97.62% 98.98% 98.03% Specificity 96.92% 98.82% 96.19% 98.30% 98.82% 99.83% 98.15% F1 Score 84.21% 88.67% 86.51% 90.45% 91.10% 97.87% 89.80% ~5,000 tweets 12

  13. Emoticons Gazetteer & Hashtag Tokenizer =) 8) -> :) #ColegauCymru Camel case Eurfa Dictionary 13

  14. UserID, Hashtag and URL 14

  15. Performance evaluation Annotation Precision Recall F-Measure Token 87.7% 99.9% 93.4% Token.category 71.4% 81.3% 76.0% Token.lemma 69.7% 79.4% 74.2% VBD, VBDP, VBDI, VDI, VBF NNS, NNP, NNPS, NNM, NNF JJR, JJS -> VB -> NN -> JJ ~2,200 Tokens 15

  16. Twitter API 16

  17. Twitter CymrIE in GUI 17

  18. Improvements for NER Added more gazetteers Obtained from Y Lolfa Cyf Special events Welsh chapels Choirs Papurau-bro Welsh magazines Person names 18

  19. Improvements Improved JAPE rules for identifying Date, Person and Location annotations Interfaces in Welsh WNLT GUI WNLT CLI Welsh user guide 19

  20. GUI in Welsh 20

  21. CLI in Welsh 21

  22. CymrIE NER Experiment Find: Person role Location kind Date kind ~ 7 heuristics for Person role ~ 18 heuristics for all Papurau-bro newsletters 22

  23. CymrIE NER Experiment BRIDE GROOM SERVICE SERVICE MOTHER & FATHER OF BRIDE ~ 7 heuristics Papurau-bro newsletters 23

  24. Wedding NER Annotation Bride BrideFather BrideMother Groom GroomFather GroomMother ServiceLocation ServiceDate Precision Recall 100% 91.7% 100% 95.2% 91.7% 90% 66.7% 96.3% 67.7% 52.4% 54.2% 62.5% 57.9% 47.4% 58.3% 89.7% * 32 news articles were used in the gold standard 24

  25. Future work Add co-referencing processing resource to CymrIE Collaborate with users to further develop CymrIE s NER capabilities Further develop the NER pipeline for Wedding announcements: More heuristics for identifying Person role, Location kind and Date kind Machine learning Ensemble methods 25

  26. Acknowledgements Welsh Language Unit, Welsh Government Gareth Morlais for further funding of the WNLT project Department of Information Studies University Collage London Andres Vlachidis for his help creating the Gold Standards and evaluation School of Welsh Cardiff University Benjamin Screen for his help creating the Twitter Gold Standard Y Lolfa Y Lolfa for supplying a lot of gazetteers http://techiaith.cymru Bangor University Resource for Welsh tweets 26

  27. Welsh Natural Language Toolkit Daniel Williams Dr. Daniel Cunliffe Prof. Douglas Tudhope Hypermedia research unit http://hypermedia.research.southwales.ac.uk/kos/wnlt/ Sourceforge https://sourceforge.net/projects/wnlt-project/ 27

  28. Extras slides 28

  29. WNLT 1 CymrIE Performance - Evaluation Gold Standard 2221 Tokens 230 Entities (Date, Location, Organization, Percent, Person) Results Tokenizer : Recall-99%, Precision-98%, F1-99% POS: Recall-82%, Precision-81%, F1-81% Lemma: Recall-80%, Precision-79%, F1-80% NER: Recall-89%, Precision-86%, F1-87% *Partial matches weight as half-matches (average mode) 29

Related


More Related Content