Enhancing Spoken Language Understanding with Word Confusion Networks

 
Towards ASR Robust Spoken Language
Understanding Through In-Context
Learning With Word Confusion Networks
 
Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar,
Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini
Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke
 
1
Motivation
 
Large language model(s) yield increasingly strong results on a wide
array of natural language understanding (NLU) tasks
However, many applications need 
spoken
 language understanding (SLU)
Baseline approach for applying NLU techniques to SLU:
 
 
 
LLM output is sensitive to ASR errors
Lattices from ASR output contain rich information about transcript
candidates and potential word confusions. Can we represent these
lattices in LLM input?
ASR
 
(noisy, 1-best)
 
ASR transcript
 
model generation
LLM
 
what is the opposite of nuclear fission?
 
what is the opposite of nuclear fishing?
 
???
2
An example (from ChatGPT-3.5)
 
ASR error
what is the opposite of nuclear fission?
The opposite of nuclear fission is nuclear fusion...
 
what is the opposite of nuclear fishing?
 
The term “nuclear fishing” is not a widely
recognized or commonly used term. Therefore,
there isn’t a widely accepted term for it.
However, if we break down the term “nuclear
fishing” into its components, we can explore
potential opposites…
3
An example (from ChatGPT-3.5)
Simple representation of ASR confusion
what is the opposite of nuclear fission/fishing?
 
The term you might be looking for is “nuclear fusion.” While nuclear fission involves
splitting atomic nuclei to release energy, nuclear fusion is the process of combining atomic
nuclei to form a heavier nucleus and release even greater amounts of energy…
4
An example (from smaller model)
But, encoding alternatives doesn’t always work for smaller models
bloomz-
560m
 
nuclear weapons
what is the opposite of nuclear fission/fishing?
5
 
Research questions
 
How well do LLMs deal with ASR ambiguity?
How can we improve performance (without fine-tuning)?
Input representations
Prompt engineering
How does performance vary with model size?
 
6
Lattices vs. word confusion networks
Lattice
 
Word confusion network (WCN)
7
 
Extracting lattices/WCNs
 
Extract n-best (n=10) hypotheses using off-the-shelf Emformer-based
RNN-T model
1
Convert n-best 
 lattices 
 WCNs
 
[1]
 
1
https://pytorch.org/audio/main/generated/torchaudio.models.emformer_rnnt_model
 
8
 
[1] Povey et al., “The kaldi speech recognition toolkit,” ASRU 2011.
From WCNs to LLM inputs
Ref: where was super bowl fifty held
 
 
p≥0: 
where|were|wear  was|with|a  super  bowl|superbowl  fifty  held
 
join word options with “/” or “|”
9
 
p≥0.3: 
where|were  was  super  bowl|superbowl  fifty  held
NMSQA dataset [2]
[2] Lin et al., “DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering.” 
https://arxiv.org/pdf/2203.04911
, Interspeech 2022.
Train
Dev
SQuAD v1.1
TTS
TTS
Train
Dev
NMSQA
Originally developed for
extractive spoken question
answering
Passed SQuAD data
(contexts with questions
and answers) through TTS
system
text
text
speech
speech
Our application: 
document-grounded QA with transcribed speech questions
Evaluation metrics: unigram F1, span exact match
WER: 22.6% (1-best), 15.9% (n-best oracle w/ n=10)
10
LLM input (0-shot)
c
ontext
: 
i
n october 2014, it was announced that beyonce
with her management company parkwood entertainment
would be partnering with london-based fashion retailer…
question: 
w
hat type of closing/clothing does parkwood
topshop athletic 
l
td produce?
11
LLM input (1-shot)
12
c
ontext
: 
i
n october 2014, it was announced that beyonce
with her management company parkwood entertainment
would be partnering with london-based fashion retailer…
question: 
w
hat type of closing/clothing does parkwood
topshop athletic 
l
td produce?
 
Zero-shot: baseline/oracle results
 
NMSQA dev set performance
 
upper bounds
Ground truth
 transcripts are the absolute upper bound, but we consider using 
n-best oracle
 transcripts to be
a more realistic upper bound for a fixed ASR system
Using 
1-best
 transcripts serves as a baseline
 
 
 
 
13
 
Zero-shot: baseline/oracle results
 
NMSQA dev set performance
Ground truth
 transcripts are the absolute upper bound, but we consider using 
n-best oracle
 transcripts to be
a more realistic upper bound for a fixed ASR system
Using 
1-best
 transcripts serves as a baseline
Models used:
bloomz-{3b}: decoder-only model, pretrained on multilingual natural language and programming
languages, plus supervised fine-tuning on task-oriented datasets
ChatGPT (gpt-3.5-turbo)
 
14
 
Zero-shot results
 
NMSQA dev set performance
With both models, using 1-best transcripts yields better results than using WCNs
 
WCNs with
”|” separator
 
15
 
Zero-shot results
 
NMSQA dev set performance
With both models, using 1-best transcripts yields better results than using WCNs
Posterior filtering of WCNs helps, but not enough
 
WCNs with
”|” separator
 
16
1-shot results
NMSQA dev set performance
ChatGPT results improve across the board
Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18%
(abs) for confusion network transcripts
Posterior filtering still helps
WCNs with
”|” separator
Closed 52% of the gap between
1-best and n-best oracle
17
 
1-shot results
 
NMSQA dev set performance
ChatGPT results improve across the board
Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18%
(abs) for confusion network transcripts
Posterior filtering still helps
 
WCNs with
”|” separator
Closed 81% of the gap between
1-best and n-best oracle
 
18
 
Cases where confusion networks help
 
E.g. when WCNs recover words that are missing from 1-best
 
19
 
Cases where confusion networks hurt
 
E.g. when the 1-best hypothesis is close to the reference
1-best hypothesis == n-best oracle in 
83%
 of questions
 
20
 
Cases where filtering by posteriors helps
 
E.g. when the raw WCN has many word alternatives
 
21
Prompt engineering: WCN instruction
 
NMSQA dev set performance
Providing confusion network instructions in prompt greatly improves results
22
Prompt engineering: in-context examples
Providing an in-context example in the prompt could help via:
1.
Demonstration of QA task
2.
Demonstration of the confusion network notation
 
Using ground-truth transcript does 1
Using WCN does 1+2
23
 
Prompt engineering: in-context examples
 
NMSQA dev set performance
All 1-shot
For F1, using WCN in the in-context example is helpful
For exact match, using either ground truth or WCN in the in-context
example produce similar results
 
24
 
Conclusions
 
Strong LLMs can process ASR confusion networks as inputs for
downstream tasks
Recipe
One-shot instruction by example is necessary to improve upon 1-best results
Use WCN instructions in the prompt
Prune WCNs based on word posteriors
It’s hard to beat 1-best ASR output for low error rates, but error recovery
occurs at intermediate-to-high error rates
 
25
 
Thank you!
 
26
 
Downstream effects of ASR accuracy
Below ~70%, ground truth > n-best oracle > | > 1-best
Among very high WER% examples, confusion net transcripts do the worst
 
ChatGPT 1-shot EM% by WER bin
 
(ticks indicate bin upper bounds)
 
27
 
Examples of varying WER
 
0% WER
Confusion nets have the potential to add noise to a perfect 1-best, and in some cases can even delete words
 
28
 
Examples of varying WER
 
Intermediate (~ 25%) WER
Confusion networks can recover words which were deleted in the 1-best and present them as options
 
29
 
Examples of varying WER
 
Very high WER
In very high WER scenarios, the confusion networks can be full of options (even after posterior filtering) which
leads to excessive noise
 
30
 
ATIS intent classification dataset [3]
 
[3] Hemphill et al., “
The 
ATIS Spoken Language Systems Pilot Corpus.” 
https://aclanthology.org/H90-1021
, HLT 1990.
 
Contains spoken requests designed to be answers via a
corresponding database
 
 
 
 
 
 
Heavily skewed class distribution
Using off-the-shelf RNN-T ASR model
1-best WER: 8.40%
n-best oracle WER: 5.31%
 
31
 
ATIS intent classification zero-shot results (acc)
bloomz-560m: 1-best is best
bloomz-3b: 1-best and confusion network performance are similar
ChatGPT: confusion network is best
 
32
Slide Note
Embed
Share

Explore the integration of word confusion networks into large language models to improve spoken language understanding by addressing ASR errors and transcription ambiguities. The research focuses on leveraging ASR lattices for richer input representations and investigating the performance variations based on model sizes and input engineering approaches.

  • ASR
  • Language Models
  • Spoken Language Understanding
  • Word Confusion Networks
  • NLU

Uploaded on Aug 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke 1

  2. Motivation Large language model(s) yield increasingly strong results on a wide array of natural language understanding (NLU) tasks However, many applications need spoken language understanding (SLU) Baseline approach for applying NLU techniques to SLU: ASR transcript model generation ASR LLM ??? (noisy, 1-best) what is the opposite of nuclear fission? what is the opposite of nuclear fishing? LLM output is sensitive to ASR errors Lattices from ASR output contain rich information about transcript candidates and potential word confusions. Can we represent these lattices in LLM input? 2

  3. An example (from ChatGPT-3.5) ASR error what is the opposite of nuclear fission? what is the opposite of nuclear fishing? The opposite of nuclear fission is nuclear fusion... The term nuclear fishing is not a widely recognized or commonly used term. Therefore, there isn t a widely accepted term for it. However, if we break down the term nuclear fishing into its components, we can explore potential opposites 3

  4. An example (from ChatGPT-3.5) Simple representation of ASR confusion what is the opposite of nuclear fission/fishing? The term you might be looking for is nuclear fusion. While nuclear fission involves splitting atomic nuclei to release energy, nuclear fusion is the process of combining atomic nuclei to form a heavier nucleus and release even greater amounts of energy 4

  5. An example (from smaller model) But, encoding alternatives doesn t always work for smaller models what is the opposite of nuclear fission/fishing? bloomz- 560m nuclear weapons 5

  6. Research questions How well do LLMs deal with ASR ambiguity? How can we improve performance (without fine-tuning)? Input representations Prompt engineering How does performance vary with model size? 6

  7. Lattices vs. word confusion networks Lattice Word confusion network (WCN) 7

  8. Extracting lattices/WCNs Extract n-best (n=10) hypotheses using off-the-shelf Emformer-based RNN-T model1 Convert n-best lattices WCNs[1] where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 1https://pytorch.org/audio/main/generated/torchaudio.models.emformer_rnnt_model 8 [1] Povey et al., The kaldi speech recognition toolkit, ASRU 2011.

  9. From WCNs to LLM inputs Ref: where was super bowl fifty held where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 join word options with / or | p 0: where|were|wear was|with|a super bowl|superbowl fifty held p 0.3: where|were was super bowl|superbowl fifty held 9

  10. NMSQA dataset [2] Originally developed for extractive spoken question answering Passed SQuAD data (contexts with questions and answers) through TTS system speech text Train Train TTS speech text Dev Dev TTS SQuAD v1.1 NMSQA Our application: document-grounded QA with transcribed speech questions Evaluation metrics: unigram F1, span exact match WER: 22.6% (1-best), 15.9% (n-best oracle w/ n=10) [2] Lin et al., DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. https://arxiv.org/pdf/2203.04911, Interspeech 2022. 10

  11. LLM input (0-shot) Transcript source 0-shot with WCN instruction ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: question: what type of closing/clothing does parkwood topshop athletic ltd produce? Transcript source {ground truth, n-best oracle, 1-best, WCN} 11

  12. LLM input (1-shot) 1-shot with WCN instruction Transcript source You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer question: what type of closing/clothing does parkwood topshop athletic ltd produce? Context: {ic_c} Question: {ic_q} Answer: {ic_a} Context: {context} Question: {question} Answer: Transcript source {ground truth, n-best oracle, 1-best, WCN} matching between ic_q and question 12

  13. Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth upper bounds 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline 13

  14. Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline Models used: bloomz-{3b}: decoder-only model, pretrained on multilingual natural language and programming languages, plus supervised fine-tuning on task-oriented datasets ChatGPT (gpt-3.5-turbo) 14

  15. Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs 15

  16. Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs Posterior filtering of WCNs helps, but not enough 16

  17. 1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best Closed 52% of the gap between 1-best and n-best oracle 69.6 44.1 | WCNs with | separator 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 17

  18. 1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best 69.6 44.1 | WCNs with | separator Closed 81% of the gap between 1-best and n-best oracle 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 18

  19. Cases where confusion networks help E.g. when WCNs recover words that are missing from 1-best Transcript source Question Answer (from ChatGPT) which article allows the european council to govern mergers between firms? article one hundred two Ground truth which article allows the european council to govern murders between fern? article one hundred thirty nine 1-best which article allows the european council to govern murders|merders|merjurs between fern|fir|f|firm|far? article one hundred two | 19

  20. Cases where confusion networks hurt E.g. when the 1-best hypothesis is close to the reference 1-best hypothesis == n-best oracle in 83% of questions Transcript source Question Answer (from ChatGPT) what are the eight comb rows on the outer surface called? swimming plates Ground truth what are the eight comb rows on the outer surface? swimming plates 1-best what are the eight|a calm|comb rows|roads on the outer surface? eight comb rows | 20

  21. Cases where filtering by posteriors helps E.g. when the raw WCN has many word alternatives Transcript source Question Answer (from ChatGPT) when did tesla go to tomingaj? in eighteen seventy four Ground truth when did tessa go to mean guards? there is no information provided about tesla going to mean guards in the given context 1-best when did tessler|tessa go to mean menage|manguard|engage|mingart|menguot|manguards|guards? in tomingaj | in eighteen seventy four when did tessler|tessa go to mean menage? | (p 0.3) 21

  22. Prompt engineering: WCN instruction 0-shot with WCN instruction {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: NMSQA dev set performance CN instruction in prompt? ChatGPT result F1 Transcript source Exact match 57.4 26.8 | (p 0.3) 63.2 35.3 | (p 0.3) Providing confusion network instructions in prompt greatly improves results 22

  23. Prompt engineering: in-context examples 1-shot with WCN instruction You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. Providing an in-context example in the prompt could help via: 1. Demonstration of QA task 2. Demonstration of the confusion network notation Context: {ic_c} Question: {ic_q} Answer: {ic_a} Using ground-truth transcript does 1 Using WCN does 1+2 Context: {context} Question: {question} Answer: 23

  24. Prompt engineering: in-context examples NMSQA dev set performance 1-shot transcript type 1-shot with WCN instruction ChatGPT result F1 Transcript source Exact match You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. 70.2 43.0 1-best 1-best 70.2 44.8 | (p 0.3) Ground truth 70.5 44.7 | (p 0.3) | (p 0.3) Context: {ic_c} Question: {ic_q} Answer: {ic_a} All 1-shot For F1, using WCN in the in-context example is helpful For exact match, using either ground truth or WCN in the in-context example produce similar results Context: {context} Question: {question} Answer: 24

  25. Conclusions Strong LLMs can process ASR confusion networks as inputs for downstream tasks Recipe One-shot instruction by example is necessary to improve upon 1-best results Use WCN instructions in the prompt Prune WCNs based on word posteriors It s hard to beat 1-best ASR output for low error rates, but error recovery occurs at intermediate-to-high error rates 25

  26. Thank you! 26

  27. Downstream effects of ASR accuracy ChatGPT 1-shot EM% by WER bin (ticks indicate bin upper bounds) Below ~70%, ground truth > n-best oracle > | > 1-best Among very high WER% examples, confusion net transcripts do the worst 27

  28. Examples of varying WER 0% WER Transcript source Question Answer (from ChatGPT) what do these teachers not do? teach by rote Ground truth what do these teachers not do? teach by rote 1-best what do these teacher knocked|knock to|due? knock | Confusion nets have the potential to add noise to a perfect 1-best, and in some cases can even delete words 28

  29. Examples of varying WER Intermediate (~ 25%) WER Transcript source Question Answer (from ChatGPT) when was his article published in century magazine? nineteen hundred Ground truth one was his article published in century magaz? the problem of increasing human energy 1-best one|when was his article published in century|sanctary magazine|magaz? nineteen hundred | Confusion networks can recover words which were deleted in the 1-best and present them as options 29

  30. Examples of varying WER Very high WER Transcript source Question Answer (from ChatGPT) who did denver beat in the afc championship? the new england patriots Ground truth who did them for beating the a f sea champions? the new england patriots 1-best who did them for beating in the a|aye f i see|sea champions|championship? the arizona cardinals | In very high WER scenarios, the confusion networks can be full of options (even after posterior filtering) which leads to excessive noise 30

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#