Applications of Transformer Neural Networks in Assessment Overview

Slide Note
Embed
Share

Dive into the world of Transformer Neural Networks with insights on their applications in assessment tasks. Explore the evolution of NLP methods and understand why transformers enable more accurate scoring and feedback. Uncover key concepts and processes involved in model pretraining for language tasks.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Applications of Transformer Neural Networks in Assessment Sue Lottridge

  2. Agenda Overview of Transformers Transformer applications Lessons learned 2

  3. Evolution of NLP methods at CAI 2017 2015 2018 2022 2020 Crisis Paper Detection TRF RNN LSA Essay Scoring TRF LSA Short Answer Scoring RNN TRF Speech Scoring TRF Writing Feedback In progress 3

  4. Why transformers? They allow us to: score things that we could not otherwise score more accurately than we could before expand our work into more than just scoring, like feedback High-powered LSA, taking into consideration Word usage in context (word embeddings) Word order Almost all words in a response (i.e., minimizes the out of vocabulary problem) 4

  5. Transformer Models: Overview

  6. Transformer Key Concepts Concept Concept Description Description Model pretrained on one task and then fine-tuned on new task Transfer Learning Assign numeric value to word in each position and use in modeling task Word Order Tokenization Vocabulary is mix of words, subwords, and characters Token modeling uses full sentence versus small window of words Contextual Embeddings Supports back propagation and batch gradient descent (i.e., more stable estimation of weights) Feed-Forward Networks Uses novel techniques (essentially dot products) to allow engine to learn relationships between words Self-Attention 6

  7. Model Pretraining (Building a language model) Identify Corpus Identify a corpus of text (e.g., Wikipedia, Books Corpus or Common Crawl) Prepare Data Divide into sentences or pairs of sentences and cleanse data. Tokenize Build tokenization model/scheme to define a vocabulary and word order Define Task Determine task: predict masked words, predict next sentence Train Train and monitor results (typically, perplexity or cross entropy) 7

  8. Model Pretraining Details There is code available to do this (e.g., huggingfacelibraries) Need large corpus (many Gb) Train on GPU/TPU Training can take multiple days Perplexity metric depends upon task and corpus Inverse of probability of best prediction No clear thresholds on quality Decision on when model is good enough is qualitative 8

  9. Vocabulary Creation via Byte Pair Encoding Recursive technique to build words and sub-words based upon frequency of occurrence (not morphology) using work from Gage (1994) Iteratively: Tokenize text into words Divide each word character into spaces Get counts of all pairs of consecutive characters Pick top pair and rebuild vocab using that pair Continue for N iterations Pick top N tokens (often 30,000 50,000) Map to an index (i.e., token A mapped to index 0) 9

  10. Vocabulary after 100 Passes ' ', ',', 'And', 't', '.', 'you', 'I', 'been', 'no', 'the', 'ain', 's , 'on', 'se', 'for', 'me', 'crystal', 'stair', 'in', 'it', ' ', 'climbin , 'goin', 'Don', 'still', 'Well', 'son', 'll', 'tell', ':', 'Life', 'It , 'had', 'tacks', 'splinters', 'boards', 'torn', 'up', 'places', 'with , 'car p et', 'f l o or', 'B ar e', 'B u t', 'all', 'ti me', 'a ', '- , 'r e ac h in', 'l a nd in', 'tur n in', 'c or n ers', 's o m e ti m es , 'd ar k ', 'W h ere', 'th ere', 'li g h t', 'S o ', 'bo y', 'd on , 'tur n', 'b ack ', 's et', 'd o w n', 'st e p s', 'C a u se', 'f ind s , 'k ind e r', 'h ar d', 'f all', 'no w ', 'F or', 'h o n e y', 'li fe' Vocabulary after 200 Passes ' ', ',', 'And', 't', '.', 'you', 'I', 'been', 'no', 'the', 'ain', 's', 'on , 'se', 'for', 'me', 'crystal', 'stair', 'in', 'it', ' ', 'climbin', 'goin', 'Don , 'still', 'Well', 'son', 'll', 'tell', ':', 'Life', 'It', 'had', 'tacks', 'splinters , 'boards', 'torn', 'up', 'places', 'with', 'carpet', 'floor', 'Bare', 'But', 'all , 'time', 'a', '-', 'reachin', 'landin', 'turnin', 'corners', 'sometimes', 'dark , 'Where', 'there', 'light', 'So', 'boy', 'don', 'turn', 'back', 'set', 'down', 'steps , 'Cause', 'finds', 'kinder', 'hard', 'fall', 'now', 'For', 'honey', 'life' 10

  11. Model Fine Tuning Remove the layer for the original training task and replace with layer that predicts the new task Use only one token as prediction. This token is assumed to hold the information of the text. Use initial weights from the original training task and update weights via back- propagation. https://mccormickml.com/2019/07/22/BERT-fine-tuning/ 11

  12. Performance What evidence is there that these models work? Perplexity quality of prediction on a test set Performance on language understanding tasks Performance on more generalized tasks Issues Perplexity is difficulty to compare across corpora and tasks Researchers have used the GLUE (Wang et al, 2019)/SuperGLUE(Wang et al., 2020) benchmarks to examine fine-tuned transformer performance 12

  13. Language Understanding Tasks Single Single- -Sentence Tasks Sentence Tasks Sentence Sentence Inference Tasks Inference Tasks Similarity/Paraphrasing Similarity/Paraphrasing Is sentence grammatical? CoLA Is this a paraphrase of this sentence? MRPC Does that sentence follow this sentence? MNLI What is sentiment of sentence? SST-2 Is this pair of questions semantically similar? QQP Does this paragraph contain the answer to this question? QNLI How similar are these pairs of sentences on a scale from 1 to 5? STS-B Is the relationship between sentence 2 and sentence 1 one of entailment, contradiction, or neutrality? RTE Is this the referent for the pronoun in this sentence? WNLI 13

  14. The Many Flavors of Transformers Type Type Data Data Model Model Task Task BERT (Devlin et al., 2018) Wikipedia BooksCorpus (3,200M words) Self-attention Predict masked word Next sentence prediction ELECTRA (Clark, et al., 2020) Wikipedia BooksCorpus Self-attention Predicted which tokens were replaced by a masked language model ConvBERT (Jiang, et al., 2020) OpenWebText (32G) Use convolutional heads to replace self-attention Predicted which tokens were replaced by a masked language model XLNet (Yang, et al., 2020) Wikipedia, Uses autoregression (product of word probabilities) Predict masked word BooksCorpus, Giga5, ClueWeb, Common Crawl (33B words) DeBERTa (He, et al., 2021) Wikipedia, BooksCorpus, OPENWEBTEXT, STORIES (78G) Separating out position and word embeddings in prediction Predict masked word 14

  15. Transformer Performance on GLUE Tasks (Wang et al., 2020) 15

  16. Transformer Models: Performance at CAI

  17. Crisis Paper Detection ELECTRA Threshold Normal Responses (%) Alerts (%) Threshold set of ~200k responses used to set thresholds for when to flag a piece of text as a potential alert 0.0001527 4% 99.5% 0.0004495 2% 99.0% 0.0012648 1% 98.8% Alert set of 1000 responses which were labeled as alerts. We examine how many of the response are labeled as potential alerts at various thresholds 0.0030495 0.50% 98.2% 0.0058723 0.30% 97.6% 0.0149332 0.15% 96.2% 0.0239027 0.10% 94.5% 0.0497377 0.05% 91.6% 17

  18. Reading Comprehension (NAEP Results) https://github.com/NAEP-AS- Challenge/info/blob/952ee82cc52163b17bccc1f12174ce246b8dff9b/results.md 18

  19. EL Speech Scoring Quadratic Weighted Kappa Word Error Rate Exact Agreement % Grade Span Wav2Vec 2.0 Unispeech H1H2 HSAS Diff H1H2 HSAS Diff 1 71% 40% 0.88 0.82 -0.06 81% 74% -7% 2-3 71% 34% 0.87 0.89 0.02 74% 79% 5% 4-5 67% 30% 0.81 0.83 0.02 68% 74% 6% 6-8 65% 25% 0.82 0.85 0.03 65% 74% 9% 9-12 59% 22% 0.79 0.84 0.05 67% 77% 9% All Grades 64% 27% 0.81 0.84 0.03 68% 75% 7% 19

  20. Argument Modeling Demarking elements of argumentation text in examinee writing for the persuasive genre Argumentation Label Description Lead An introduction that begins with a statistic, a quotation, a description, or some other device to grab the reader s attention and point toward the thesis. An opinion or conclusion on the main question. Position Claim A claim that supports the position. Counterclaim A claim that refutes another claim or gives an opposing reason to the position. A claim that refutes a counterclaim. Rebuttal Evidence Ideas or examples that support claims, counterclaims, or rebuttals. A concluding statement that restates the claims. Concluding Statement 20

  21. Argumentation Results with ELECTRA Argumentation Label Average Number of Labels Acc. F1 Prec. Rec. N Kappa No label 1672 96% 0.40 0.35 0.44 0.37 15.2 claim 1672 84% 0.39 0.47 0.48 0.54 48.8 concluding 1672 95% 0.73 0.69 0.70 0.72 40.6 counterclaim 1672 97% 0.25 0.10 0.11 0.11 7.4 evidence 1672 80% 0.57 0.78 0.80 0.79 166.1 lead 1672 97% 0.64 0.38 0.40 0.39 21.3 position 1672 96% 0.64 0.65 0.69 0.67 18.1 rebuttal 1672 98% 0.17 0.06 0.07 0.06 5.5 Average 1672 93% 0.53 0.43 0.46 0.46 Weighted average 1672 85% 0.58 0.72 0.77 0.71 21

  22. Transformer Models: Learnings

  23. Lessons Learned (1) In the aggregate, the various English-language models perform comparably across item types We make choice by use case needs (i.e., small and memory- efficient) after many experiments Most models use similar corpora and slightly different architectures Training data are important Corpora language similar to scoring task Size of data used to pretrain (many GB) Size of sample used to fine tune (n~2000) 23

  24. Lessons Learned (2) Models vary in their performance by item type ELA: Perform above humans Math: Perform similar to humans Social Studies: Slightly underperform Science: Moderately underperform Speech Models perform comparably to humans for scoring Transcribers have widely varying performance And vary by age of student 24

  25. Lessons Learned: When models fail Spanish Language items: Models greatly underperform Poor match of student entered text against pretrained model vocabulary Too little data on which to build a pretrained model Fine tune dataset very small (n ~ 800) Generic models: Generally, these don t work when meaning is critical Except grammar/conventions 25

  26. Generic Modeling Results NAEP https://github.com/NAEP-AS- Challenge/info/blob/952ee82cc52163b17bccc1f12174ce246b8dff9b/results.md 26

  27. Benefits Fine-tuning Transformers involves fewer decisions: Model choice, learning rate, batch size, number of epochs Consistent performance across items within a task type If works well for a few items, transformers will work likely work well across items Flexible across a variety of predictive tasks Lots of open-source resources (code, models, data) 27

  28. Issues to Consider: Length and Auditing Limitations on length (256, 512) Responses will be truncated unless this is managed Debugging/auditing When models do fail or when questions arise about them, it can be difficult to discern source of the issue Saliency maps via LIME/SHAP or gradient methods Examination of tokenization of into vocabulary Are key terms divided into word pieces? 28

  29. Issues to Consider: Deployment Depends upon use case Number of models to support Latency/throughput requirements Reproducibility of training and scoring results Lack of common pythonic approaches for web service deployment (parallel scoring, memory management, caching, etc.) 29

  30. References (1) Clark, K., Luong, M-T., Le, Q., & Manning, C. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv. https://doi.org/10.48550/arXiv.2003.10555 Devlin, J., Chang, M-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805 Gage, (1994). A new Algorithm for Data Compression. C Users Journal. He, P. Liu, X., Gao, J., & Chen,W. (2021). DeBERTa: Decoding-enhanced BERT with distangled attention. arXiv. https://arxiv.org/pdf/2006.03654.pdf Jiang, Z. W., Yu, W., Zhou, D., Chen, Y., Feng, J., & Yan, S. (2020). Convbert: Improving bert with span-based dynamic convolution. Paper presented at the 34thConfernce on Neural Information Processing Systems (NeurIPS), Vancouver, CA. arXiv. https://arxiv.org/pdf/2008.02496.pdf 30

  31. References (2) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762 Wang, A., Pruksachatkun, U., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). SuperGLUE: A stickier benchmark for general-purpose language understanding sytsems. arXiv. https://arxiv.org/pdf/1905.00537.pdf Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv. https://doi.org/10.48550/arXiv.1804.07461 Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q., (2020). XLNet: Generalized autoregressive pretraining for language understanding. arXiv. https://arxiv.org/pdf/1906.08237.pdf 31

Related


More Related Content