Enhancing Spoken Language Understanding with Word Confusion Networks
Explore the integration of word confusion networks into large language models to improve spoken language understanding by addressing ASR errors and transcription ambiguities. The research focuses on leveraging ASR lattices for richer input representations and investigating the performance variations based on model sizes and input engineering approaches.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke 1
Motivation Large language model(s) yield increasingly strong results on a wide array of natural language understanding (NLU) tasks However, many applications need spoken language understanding (SLU) Baseline approach for applying NLU techniques to SLU: ASR transcript model generation ASR LLM ??? (noisy, 1-best) what is the opposite of nuclear fission? what is the opposite of nuclear fishing? LLM output is sensitive to ASR errors Lattices from ASR output contain rich information about transcript candidates and potential word confusions. Can we represent these lattices in LLM input? 2
An example (from ChatGPT-3.5) ASR error what is the opposite of nuclear fission? what is the opposite of nuclear fishing? The opposite of nuclear fission is nuclear fusion... The term nuclear fishing is not a widely recognized or commonly used term. Therefore, there isn t a widely accepted term for it. However, if we break down the term nuclear fishing into its components, we can explore potential opposites 3
An example (from ChatGPT-3.5) Simple representation of ASR confusion what is the opposite of nuclear fission/fishing? The term you might be looking for is nuclear fusion. While nuclear fission involves splitting atomic nuclei to release energy, nuclear fusion is the process of combining atomic nuclei to form a heavier nucleus and release even greater amounts of energy 4
An example (from smaller model) But, encoding alternatives doesn t always work for smaller models what is the opposite of nuclear fission/fishing? bloomz- 560m nuclear weapons 5
Research questions How well do LLMs deal with ASR ambiguity? How can we improve performance (without fine-tuning)? Input representations Prompt engineering How does performance vary with model size? 6
Lattices vs. word confusion networks Lattice Word confusion network (WCN) 7
Extracting lattices/WCNs Extract n-best (n=10) hypotheses using off-the-shelf Emformer-based RNN-T model1 Convert n-best lattices WCNs[1] where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 1https://pytorch.org/audio/main/generated/torchaudio.models.emformer_rnnt_model 8 [1] Povey et al., The kaldi speech recognition toolkit, ASRU 2011.
From WCNs to LLM inputs Ref: where was super bowl fifty held where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 join word options with / or | p 0: where|were|wear was|with|a super bowl|superbowl fifty held p 0.3: where|were was super bowl|superbowl fifty held 9
NMSQA dataset [2] Originally developed for extractive spoken question answering Passed SQuAD data (contexts with questions and answers) through TTS system speech text Train Train TTS speech text Dev Dev TTS SQuAD v1.1 NMSQA Our application: document-grounded QA with transcribed speech questions Evaluation metrics: unigram F1, span exact match WER: 22.6% (1-best), 15.9% (n-best oracle w/ n=10) [2] Lin et al., DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. https://arxiv.org/pdf/2203.04911, Interspeech 2022. 10
LLM input (0-shot) Transcript source 0-shot with WCN instruction ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: question: what type of closing/clothing does parkwood topshop athletic ltd produce? Transcript source {ground truth, n-best oracle, 1-best, WCN} 11
LLM input (1-shot) 1-shot with WCN instruction Transcript source You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer question: what type of closing/clothing does parkwood topshop athletic ltd produce? Context: {ic_c} Question: {ic_q} Answer: {ic_a} Context: {context} Question: {question} Answer: Transcript source {ground truth, n-best oracle, 1-best, WCN} matching between ic_q and question 12
Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth upper bounds 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline 13
Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline Models used: bloomz-{3b}: decoder-only model, pretrained on multilingual natural language and programming languages, plus supervised fine-tuning on task-oriented datasets ChatGPT (gpt-3.5-turbo) 14
Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs 15
Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs Posterior filtering of WCNs helps, but not enough 16
1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best Closed 52% of the gap between 1-best and n-best oracle 69.6 44.1 | WCNs with | separator 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 17
1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best 69.6 44.1 | WCNs with | separator Closed 81% of the gap between 1-best and n-best oracle 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 18
Cases where confusion networks help E.g. when WCNs recover words that are missing from 1-best Transcript source Question Answer (from ChatGPT) which article allows the european council to govern mergers between firms? article one hundred two Ground truth which article allows the european council to govern murders between fern? article one hundred thirty nine 1-best which article allows the european council to govern murders|merders|merjurs between fern|fir|f|firm|far? article one hundred two | 19
Cases where confusion networks hurt E.g. when the 1-best hypothesis is close to the reference 1-best hypothesis == n-best oracle in 83% of questions Transcript source Question Answer (from ChatGPT) what are the eight comb rows on the outer surface called? swimming plates Ground truth what are the eight comb rows on the outer surface? swimming plates 1-best what are the eight|a calm|comb rows|roads on the outer surface? eight comb rows | 20
Cases where filtering by posteriors helps E.g. when the raw WCN has many word alternatives Transcript source Question Answer (from ChatGPT) when did tesla go to tomingaj? in eighteen seventy four Ground truth when did tessa go to mean guards? there is no information provided about tesla going to mean guards in the given context 1-best when did tessler|tessa go to mean menage|manguard|engage|mingart|menguot|manguards|guards? in tomingaj | in eighteen seventy four when did tessler|tessa go to mean menage? | (p 0.3) 21
Prompt engineering: WCN instruction 0-shot with WCN instruction {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: NMSQA dev set performance CN instruction in prompt? ChatGPT result F1 Transcript source Exact match 57.4 26.8 | (p 0.3) 63.2 35.3 | (p 0.3) Providing confusion network instructions in prompt greatly improves results 22
Prompt engineering: in-context examples 1-shot with WCN instruction You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. Providing an in-context example in the prompt could help via: 1. Demonstration of QA task 2. Demonstration of the confusion network notation Context: {ic_c} Question: {ic_q} Answer: {ic_a} Using ground-truth transcript does 1 Using WCN does 1+2 Context: {context} Question: {question} Answer: 23
Prompt engineering: in-context examples NMSQA dev set performance 1-shot transcript type 1-shot with WCN instruction ChatGPT result F1 Transcript source Exact match You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. 70.2 43.0 1-best 1-best 70.2 44.8 | (p 0.3) Ground truth 70.5 44.7 | (p 0.3) | (p 0.3) Context: {ic_c} Question: {ic_q} Answer: {ic_a} All 1-shot For F1, using WCN in the in-context example is helpful For exact match, using either ground truth or WCN in the in-context example produce similar results Context: {context} Question: {question} Answer: 24
Conclusions Strong LLMs can process ASR confusion networks as inputs for downstream tasks Recipe One-shot instruction by example is necessary to improve upon 1-best results Use WCN instructions in the prompt Prune WCNs based on word posteriors It s hard to beat 1-best ASR output for low error rates, but error recovery occurs at intermediate-to-high error rates 25
Thank you! 26
Downstream effects of ASR accuracy ChatGPT 1-shot EM% by WER bin (ticks indicate bin upper bounds) Below ~70%, ground truth > n-best oracle > | > 1-best Among very high WER% examples, confusion net transcripts do the worst 27
Examples of varying WER 0% WER Transcript source Question Answer (from ChatGPT) what do these teachers not do? teach by rote Ground truth what do these teachers not do? teach by rote 1-best what do these teacher knocked|knock to|due? knock | Confusion nets have the potential to add noise to a perfect 1-best, and in some cases can even delete words 28
Examples of varying WER Intermediate (~ 25%) WER Transcript source Question Answer (from ChatGPT) when was his article published in century magazine? nineteen hundred Ground truth one was his article published in century magaz? the problem of increasing human energy 1-best one|when was his article published in century|sanctary magazine|magaz? nineteen hundred | Confusion networks can recover words which were deleted in the 1-best and present them as options 29
Examples of varying WER Very high WER Transcript source Question Answer (from ChatGPT) who did denver beat in the afc championship? the new england patriots Ground truth who did them for beating the a f sea champions? the new england patriots 1-best who did them for beating in the a|aye f i see|sea champions|championship? the arizona cardinals | In very high WER scenarios, the confusion networks can be full of options (even after posterior filtering) which leads to excessive noise 30