Gradual Fine-Tuning for Low-Resource Domain Adaptation: Methods and Experiments

 
Gradual Fine-Tuning for Low-
Resource Domain Adaptation
 
 
 
Haoran Xu, Seth Ebner, Mahsa Yarmohammadi, Aaron White, Benjamin Van Durme, Kenton Murray
Dec. 2020
 
Roadmap
 
Background
Methods
Experiments
Conclusion
Background: Fine-Tuning
Fine-tuning a pre-trained model is usually better than training from scratch.
People would like to fine-tune pre-trained models to improve the performance of models in some specific
tasks.
 
Example 1:
Pre-trained BERT
Specific Task Model
Replace
Randomly initialized
embedding
 
e.g., sentiment analysis,
machine translation
 
Background: Fine-Tuning
 
Example 2: Domain Adaptation for Neural Machine Translation (
Chu et al., 2017)
Methods: Gradual Fine-tuning
Fine-tuning stage is often performed in 
one-step
: the pretrained model is directly trained on the
in-domain data.
 
Gradual fine-tuning
: a model is iteratively trained to convergence on data whose distribution progressively
approaches that of the in-domain data.
Pre-trained Model
Trained by out-of-
domain data (maybe
include in-domain data)
Fine-tuning
 
Trained by some out-of-
domain + in-domain data
Fine-tuning
Trained by in-domain
data
Fine-tuning
 
Trained by in-domain
data
 
 
😀
 
We found that 
the model performs better if it is eased toward the target domain rather than abruptly
shifting to it.
 
Methods: Gradual Fine-tuning
 
Inspired by curriculum learning (Bengio et al., 2009), we begin by training the model on data that contains a mix
of out-of-domain and in-domain instances, and then increase the concentration of target domain data in each
fine-tuning step.
Methods: Gradual Fine-tuning
 
Step 1: Mixed Domain Training
Find out out-of-domain data (mapped into the same
format as the in-domain data) and concatenate with
in-domain data to form a mixed domain dataset.
Portions of the target task schema corresponding to
fields not available in the out-of-domain data could
be masked in the mapped data.
 
Step 3: Fine-Tune on only in-domain data
Experiments: Dialogue State Tracking
 
Target Domain
 
4K
 
2K
 
0.5K
Experiments: Dialogue State Tracking
 
Metric:
Slot accuracy: the accuracy of predicting each
slot separately.
Joint accuracy: the percentage of turns in
which all slots are predicted correctly.
Results
Experiments: Dialogue State Tracking
Results
Experiments: Event Extraction
Event extraction involves predicting event triggers, event arguments, and argument roles.
Experiments: Event Extraction
 
Baseline:
trained only on Arabic data
trained on mixed data (Arabic + 1K English data)
trained on mixed data  plus one-step fine-tuning.
 
Metric:
TrigID
: a trigger is correctly identified if its
offsets find a match in the ground truth,
TrigC
: and it is correctly classified if their event
types match.
ArgID
: An argument is correctly identified if its
offsets and event type find a match in the
ground truth,
ArgC
: and it is correctly classified if their event
roles match.
Conclusion
 
Improvement
: gradual finetuning outperforms standard one-step
fine-tuning and can substantially improve the performance of models.
Easy to implement:
 gradual fine-tuning can be straightforwardly
applied to an existing codebase without changing the model
architecture or learning objective.
 
Thank you!
 
Background: Fine-Tuning
 
Example 2: Claim Detection (
Chakrabarty et al., 2019
)
Experiments: Dialogue State Tracking
Dialogue state tracking (DST): estimating at each dialogue turn the probability distribution over slot-
values enumerated in an ontology.
 
Ontology:
"
restaurant-price range
": [
        "expensive",
        "cheap",
],
"
restaurant-area
": [
        "south",
        "north",
        "east",
        "west",
]
 
Example
:
 
Dialogues:
Person 1: where do we eat tonight?
Person 2: Let’s go a noodle restaurant in
the 
east
 of Chinatown.
Person 1: Does it 
cheap
?
Person 2: Yes!
 
Slots:
"
restaurant-price range
": 
cheap
"
restaurant-area
": 
east
Slide Note
Embed
Share

This study presents the effectiveness of gradual fine-tuning in low-resource domain adaptation, highlighting the benefits of gradually easing a model towards the target domain rather than abrupt shifts. Inspired by curriculum learning, the approach involves training the model on a mix of out-of-domain and in-domain data, gradually increasing the concentration of target domain data in each fine-tuning step. The methodology involves mixed domain training, iteratively fine-tuning with reduced out-of-domain data, and final fine-tuning on in-domain data. Experimental results demonstrate improved model performance with this gradual fine-tuning strategy.


Uploaded on Jul 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Gradual Fine-Tuning for Low- Resource Domain Adaptation Haoran Xu, Seth Ebner, Mahsa Yarmohammadi, Aaron White, Benjamin Van Durme, Kenton Murray Dec. 2020

  2. Roadmap Background Methods Experiments Conclusion

  3. Background: Fine-Tuning Fine-tuning a pre-trained model is usually better than training from scratch. People would like to fine-tune pre-trained models to improve the performance of models in some specific tasks. Example 1: e.g., sentiment analysis, machine translation Specific Task Model Randomly initialized embedding Pre-trained BERT Replace

  4. Background: Fine-Tuning Example 2: Domain Adaptation for Neural Machine Translation (Chu et al., 2017)

  5. Methods: Gradual Fine-tuning Fine-tuning stage is often performed in one-step: the pretrained model is directly trained on the in-domain data. Trained by in-domain data Trained by out-of- domain data (maybe include in-domain data) Fine-tuning Pre-trained Model Trained by some out-of- domain + in-domain data Fine-tuning Trained by in-domain data Fine-tuning We found that the model performs better if it is eased toward the target domain rather than abruptly shifting to it. Gradual fine-tuning: a model is iteratively trained to convergence on data whose distribution progressively approaches that of the in-domain data.

  6. Methods: Gradual Fine-tuning Inspired by curriculum learning (Bengio et al., 2009), we begin by training the model on data that contains a mix of out-of-domain and in-domain instances, and then increase the concentration of target domain data in each fine-tuning step. Out-of-domain data: ?? In-domain data: ??

  7. Methods: Gradual Fine-tuning Step 1: Mixed Domain Training Find out out-of-domain data (mapped into the same format as the in-domain data) and concatenate with in-domain data to form a mixed domain dataset. Portions of the target task schema corresponding to fields not available in the out-of-domain data could be masked in the mapped data. Step 2: Iteratively Fine-Tuning Define a data schedule ? for decreasing amounts of out-of- domain data in each step, where S is defined as randomly down-sampling from the out-of-domain data used in the previous iteration. Fine-tune on the data selected by ?. Step 3: Fine-Tune on only in-domain data

  8. Experiments: Dialogue State Tracking Dataset: MultiWOZ v2.0 dataset (Budzianowski et al., 2018), which is a multi-domain conversational corpus with seven domains and 35 slots. Following Wu et al. (2019), we focus on five domains: restaurant, hotel, attraction, taxi, and train. 2198 single-domain dialogues and 5459 multi-domain dialogues. Settings: In-domain data for restaurant: 523 single-domain dialogues. In-domain data for hotel: 513 single-domain dialogues. Out-of-domain data: the rest of dialogues exluding the target domain. data schedule ?= 4K 2K 0.5K 0. Model: Slot-Utterance Matching Belief Tracker model (SUMBT) (Lee et al. 2019). Target Domain 4K 2K 0.5K

  9. Experiments: Dialogue State Tracking Results Baseline: the model trained only on in-domain data (no data augmentation) the model trained with the same settings as Lee et al. (2019), which has seen the full training set. one-step fine-tuning strategy (?: 4K 0) Metric: Slot accuracy: the accuracy of predicting each slot separately. Joint accuracy: the percentage of turns in which all slots are predicted correctly.

  10. Experiments: Dialogue State Tracking Results Blue ?: 4K 2K 0.5K 0 Purple ?: 2K 0.5K 0

  11. Experiments: Event Extraction Event extraction involves predicting event triggers, event arguments, and argument roles. Settings: Dataset: ACE 2005 corpus by considering Arabic as the target domain and English as the auxiliary domain. Data Processing: train/dev/test sets for Arabic are randomly selected. Model: we use the DYGIE++ framework (Wadden et al., 2019), which has shown state-of-the-art results. Model Modification: replace the BERT encoder with XLM-R to train models on monolingual and mixed bilingual datasets. Data Schedule ?: 1K 0.5K 0.2K 0 (refer to 85%, 35%, and 5% of total events/args in the English train set).

  12. Experiments: Event Extraction Baseline: trained only on Arabic data trained on mixed data (Arabic + 1K English data) trained on mixed data plus one-step fine-tuning. Metric: TrigID: a trigger is correctly identified if its offsets find a match in the ground truth, TrigC: and it is correctly classified if their event types match. ArgID: An argument is correctly identified if its offsets and event type find a match in the ground truth, ArgC: and it is correctly classified if their event roles match.

  13. Conclusion Improvement: gradual finetuning outperforms standard one-step fine-tuning and can substantially improve the performance of models. Easy to implement: gradual fine-tuning can be straightforwardly applied to an existing codebase without changing the model architecture or learning objective.

  14. Thank you!

  15. Background: Fine-Tuning Example 2: Claim Detection (Chakrabarty et al., 2019)

  16. Experiments: Dialogue State Tracking Dialogue state tracking (DST): estimating at each dialogue turn the probability distribution over slot- values enumerated in an ontology. Example: Ontology: Dialogues: Slots: "restaurant-price range": [ "expensive", "cheap", ], "restaurant-area": [ "south", "north", "east", "west", ] Person 1: where do we eat tonight? Person 2: Let s go a noodle restaurant in the east of Chinatown. Person 1: Does it cheap? Person 2: Yes! "restaurant-price range": cheap "restaurant-area": east

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#