Weak Supervision for NLP: Overcoming Labelled Data Challenges

Slide Note
Embed
Share

Addressing the challenge of acquiring labelled data for NLP models, weak supervision techniques offer solutions through alternative annotation methods and leveraging diverse data sources. This talk highlights the importance of overcoming the scarcity of labelled data in machine learning and NLP tasks, particularly for resource-poor languages and evolving domain-specific categories. Possible approaches include collecting and annotating training data, fine-tuning existing models, and utilizing weak supervision techniques for automatic data annotation in target domains.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. No labelled data? No problem! Weak supervision for NLP Pierre Lison Norwegian Computing Center (NR) & University of Oslo plison@nr.no Invited talk, Peace Science Infrastructure August 15, 2023

  2. The data problem How can I get labelled data to train my model? Lack of labelled data remains one of the most critical challenges when applying ML/NLP today! Especially acute for: Resource-poor languages Less common text domains Tasks relying on domain-specific categories (which may change over time!) 2

  3. Possible solutions Provides high-quality data that can be used for analysis, training and evaluation + a. Collect & annotate our own training data and use it to train a new ML model Expensive and time- consuming to collect - (especially if the task is not static!) 3

  4. Possible solutions + Requires fewer training examples a. Collect & annotate our own training data and use it to train a new ML model b. Use fine-tuning to adapt an existing model (thanks to the transfer of knowledge from the initial model to the new one) - Still require training data, & hard to apply to the really large LLMs 4

  5. Possible solutions a. Collect & annotate our own training data and use it to train a new ML model b. Use fine-tuning to adapt an existing model c. Do in-context learning by adding examples to LLM prompts + Easy to apply, surprisingly good performance even with just a few examples ( few-shot learning ) - Does not work for all tasks & no interpretability 5

  6. Possible solutions a. Collect & annotate our own training data and use it to train a new ML model b. Use fine-tuning to adapt an existing model c. Do in-context learning by adding examples to LLM prompts d. Use weak supervision to automatically annotate data from our target domain Our focus here 6

  7. Weak supervision Goal: train machine learning through weaker / noisier supervision signals obtained from different sources Instead of relying on a single gold standard Approach: 1. We start with some raw, unlabelled data 2. We apply several labelling functions (heuristics etc.) to obtain some (imperfect) labels of our data 3. We then aggregate together the labels 4. And we train our machine learning model on these aggregated labels Note: This approach is not specific to NLP, it can be applied to a wide range of ML tasks (classification, regression, sequence labelling, relation extraction, etc.) 7

  8. Illustration for sequence labelling 8

  9. Limitations Weak supervision makes it easier to obtain labels for existing data But it makes two assumptions: 1. You are able to collect some raw (unlabelled) input data for your task 2. You can define labelling functions based on expert knowledge / resources If those assumptions are not met, other approaches should be explored

  10. The skweak toolkit A versatile Python toolkit to automatically annotate training data using labelling functions The results of all labelling functions are aggregated using a generative model Can be applied to any type of sequence labelling or classification task! Tightly integrated with spacy, an all-round Python library for running NLP pipelines 10

  11. Labelling functions Skweak comes along with a full-fledged, Python-based API to define & combine labelling functions for NLP tasks: Heuristic rules Gazetteers ML models Document-level constraints Etc. Input: Output: Spacy Doc Labelled text spans (for text classification, the span is the full text) 11

  12. Labelling functions Crucially, labelling functions are allowed to abstain from predicting a label for a given data point Some labelling functions may rely on external data sources that are not available at runtime Example: historical data where the labelling function peeks into the future Crowdsourced labels may also be viewed as labelling functions Often beneficial to apply a broad range of labelling functions, with different trade-offs between precision and coverage

  13. Very simple example 13

  14. Aggregation model Generative model where the states are the true labels and are associated with multiple observations (one per LF) This statistical model is estimated iteratively in an unsupervised manner, without access to the true (gold standard) labels! 14

  15. Some experimental results Named Entity Recognition on the MUC-6 corpus (7 entity types) Results using 52 labelling functions: Lison, P, Barnes, J. and Hubin, A. (2021) skweak: Weak Supervision Made Easy for NLP. In ACL 2021 (Demonstrations). 15

  16. Some experimental results (2) Sentiment analysis on the Norwegian Review Corpus data: Lison, P, Barnes, J. and Hubin, A. (2021) skweak: Weak Supervision Made Easy for NLP. In ACL 2021 (Demonstrations). 16

  17. Conclusion Weak supervision is a powerful approach to train machine learning models in the absence of labelled data Allow us to inject expert knowledge (from heuristics, existing resources, etc.) into the model Skweak provides an easy-to-use, yet powerful API to label training data for NLP tasks! 17

  18. Conclusion Interested? Test our toolkit at https://github.com/NorskRegnesentral/skweak Ideas about new tasks/datasets/models are most welcome If you are interested, get in touch! We have several ideas for extensions and improvements in the pipeline

Related


More Related Content