Enhancing Spoken Language Understanding with Word Confusion Networks

Towards ASR Robust Spoken Language

Understanding Through In-Context

Learning With Word Confusion Networks

Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar,

Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini

Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke

Motivation

•

Large language model(s) yield increasingly strong results on a wide

array of natural language understanding (NLU) tasks

•

However, many applications need

spoken

 language understanding (SLU)

•

Baseline approach for applying NLU techniques to SLU:

•

LLM output is sensitive to ASR errors

•

Lattices from ASR output contain rich information about transcript

candidates and potential word confusions. Can we represent these

lattices in LLM input?

ASR

(noisy, 1-best)

ASR transcript

model generation

LLM

what is the opposite of nuclear fission?

what is the opposite of nuclear fishing?

???

An example (from ChatGPT-3.5)

ASR error

what is the opposite of nuclear fission?

The opposite of nuclear fission is nuclear fusion...

what is the opposite of nuclear fishing?

The term “nuclear fishing” is not a widely

recognized or commonly used term. Therefore,

there isn’t a widely accepted term for it.

However, if we break down the term “nuclear

fishing” into its components, we can explore

potential opposites…

An example (from ChatGPT-3.5)

•

Simple representation of ASR confusion

what is the opposite of nuclear fission/fishing?

The term you might be looking for is “nuclear fusion.” While nuclear fission involves

splitting atomic nuclei to release energy, nuclear fusion is the process of combining atomic

nuclei to form a heavier nucleus and release even greater amounts of energy…

An example (from smaller model)

•

But, encoding alternatives doesn’t always work for smaller models

bloomz-

560m

nuclear weapons

what is the opposite of nuclear fission/fishing?

Research questions

•

How well do LLMs deal with ASR ambiguity?

•

How can we improve performance (without fine-tuning)?

•

Input representations

•

Prompt engineering

•

How does performance vary with model size?

Lattices vs. word confusion networks

Lattice

Word confusion network (WCN)

Extracting lattices/WCNs

•

Extract n-best (n=10) hypotheses using off-the-shelf Emformer-based

RNN-T model

•

Convert n-best

➔

 lattices

➔

 WCNs

[1]

https://pytorch.org/audio/main/generated/torchaudio.models.emformer_rnnt_model

[1] Povey et al., “The kaldi speech recognition toolkit,” ASRU 2011.

From WCNs to LLM inputs

Ref: where was super bowl fifty held

•

p≥0:

where|were|wear  was|with|a  super  bowl|superbowl  fifty  held

join word options with “/” or “|”

•

p≥0.3:

where|were  was  super  bowl|superbowl  fifty  held

NMSQA dataset [2]

[2] Lin et al., “DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering.”

https://arxiv.org/pdf/2203.04911

, Interspeech 2022.

Train

Dev

SQuAD v1.1

TTS

TTS

Train

Dev

NMSQA

︷

︷

•

Originally developed for

extractive spoken question

answering

•

Passed SQuAD data

(contexts with questions

and answers) through TTS

system

text

text

speech

speech

Our application:

document-grounded QA with transcribed speech questions

•

Evaluation metrics: unigram F1, span exact match

•

WER: 22.6% (1-best), 15.9% (n-best oracle w/ n=10)

LLM input (0-shot)

ontext

n october 2014, it was announced that beyonce

with her management company parkwood entertainment

would be partnering with london-based fashion retailer…

question:

hat type of closing/clothing does parkwood

topshop athletic

td produce?

LLM input (1-shot)

ontext

n october 2014, it was announced that beyonce

with her management company parkwood entertainment

would be partnering with london-based fashion retailer…

question:

hat type of closing/clothing does parkwood

topshop athletic

td produce?

Zero-shot: baseline/oracle results

NMSQA dev set performance

upper bounds

•

Ground truth

 transcripts are the absolute upper bound, but we consider using

n-best oracle

 transcripts to be

a more realistic upper bound for a fixed ASR system

•

Using

1-best

 transcripts serves as a baseline

Zero-shot: baseline/oracle results

NMSQA dev set performance

•

Ground truth

 transcripts are the absolute upper bound, but we consider using

n-best oracle

 transcripts to be

a more realistic upper bound for a fixed ASR system

•

Using

1-best

 transcripts serves as a baseline

•

Models used:

•

bloomz-{3b}: decoder-only model, pretrained on multilingual natural language and programming

languages, plus supervised fine-tuning on task-oriented datasets

•

ChatGPT (gpt-3.5-turbo)

Zero-shot results

NMSQA dev set performance

•

With both models, using 1-best transcripts yields better results than using WCNs

WCNs with

”|” separator

Zero-shot results

NMSQA dev set performance

•

With both models, using 1-best transcripts yields better results than using WCNs

•

Posterior filtering of WCNs helps, but not enough

WCNs with

”|” separator

1-shot results

NMSQA dev set performance

•

ChatGPT results improve across the board

•

Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18%

(abs) for confusion network transcripts

•

Posterior filtering still helps

WCNs with

”|” separator

Closed 52% of the gap between

1-best and n-best oracle

1-shot results

NMSQA dev set performance

•

ChatGPT results improve across the board

•

Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18%

(abs) for confusion network transcripts

•

Posterior filtering still helps

WCNs with

”|” separator

Closed 81% of the gap between

1-best and n-best oracle

Cases where confusion networks help

•

E.g. when WCNs recover words that are missing from 1-best

Cases where confusion networks hurt

•

E.g. when the 1-best hypothesis is close to the reference

•

1-best hypothesis == n-best oracle in

83%

 of questions

Cases where filtering by posteriors helps

•

E.g. when the raw WCN has many word alternatives

Prompt engineering: WCN instruction

NMSQA dev set performance

•

Providing confusion network instructions in prompt greatly improves results

Prompt engineering: in-context examples

Providing an in-context example in the prompt could help via:

1.

Demonstration of QA task

2.

Demonstration of the confusion network notation

Using ground-truth transcript does 1

Using WCN does 1+2

Prompt engineering: in-context examples

NMSQA dev set performance

•

All 1-shot

•

For F1, using WCN in the in-context example is helpful

•

For exact match, using either ground truth or WCN in the in-context

example produce similar results

Conclusions

•

Strong LLMs can process ASR confusion networks as inputs for

downstream tasks

•

Recipe

•

One-shot instruction by example is necessary to improve upon 1-best results

•

Use WCN instructions in the prompt

•

Prune WCNs based on word posteriors

•

It’s hard to beat 1-best ASR output for low error rates, but error recovery

occurs at intermediate-to-high error rates

Thank you!

Downstream effects of ASR accuracy

•

Below ~70%, ground truth > n-best oracle > | > 1-best

•

Among very high WER% examples, confusion net transcripts do the worst

ChatGPT 1-shot EM% by WER bin

(ticks indicate bin upper bounds)

Examples of varying WER

•

0% WER

•

Confusion nets have the potential to add noise to a perfect 1-best, and in some cases can even delete words

Examples of varying WER

•

Intermediate (~ 25%) WER

•

Confusion networks can recover words which were deleted in the 1-best and present them as options

Examples of varying WER

•

Very high WER

•

In very high WER scenarios, the confusion networks can be full of options (even after posterior filtering) which

leads to excessive noise

ATIS intent classification dataset [3]

[3] Hemphill et al., “

The

ATIS Spoken Language Systems Pilot Corpus.”

https://aclanthology.org/H90-1021

, HLT 1990.

•

Contains spoken requests designed to be answers via a

corresponding database

•

Heavily skewed class distribution

Using off-the-shelf RNN-T ASR model

•

1-best WER: 8.40%

•

n-best oracle WER: 5.31%

ATIS intent classification zero-shot results (acc)

•

bloomz-560m: 1-best is best

•

bloomz-3b: 1-best and confusion network performance are similar

•

ChatGPT: confusion network is best

Slide Note

Embed Share

Download

Explore the integration of word confusion networks into large language models to improve spoken language understanding by addressing ASR errors and transcription ambiguities. The research focuses on leveraging ASR lattices for richer input representations and investigating the performance variations based on model sizes and input engineering approaches.

evander Follow

Uploaded on Aug 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke 1

Motivation Large language model(s) yield increasingly strong results on a wide array of natural language understanding (NLU) tasks However, many applications need spoken language understanding (SLU) Baseline approach for applying NLU techniques to SLU: ASR transcript model generation ASR LLM ??? (noisy, 1-best) what is the opposite of nuclear fission? what is the opposite of nuclear fishing? LLM output is sensitive to ASR errors Lattices from ASR output contain rich information about transcript candidates and potential word confusions. Can we represent these lattices in LLM input? 2

An example (from ChatGPT-3.5) ASR error what is the opposite of nuclear fission? what is the opposite of nuclear fishing? The opposite of nuclear fission is nuclear fusion... The term nuclear fishing is not a widely recognized or commonly used term. Therefore, there isn t a widely accepted term for it. However, if we break down the term nuclear fishing into its components, we can explore potential opposites 3

An example (from ChatGPT-3.5) Simple representation of ASR confusion what is the opposite of nuclear fission/fishing? The term you might be looking for is nuclear fusion. While nuclear fission involves splitting atomic nuclei to release energy, nuclear fusion is the process of combining atomic nuclei to form a heavier nucleus and release even greater amounts of energy 4

An example (from smaller model) But, encoding alternatives doesn t always work for smaller models what is the opposite of nuclear fission/fishing? bloomz- 560m nuclear weapons 5

Research questions How well do LLMs deal with ASR ambiguity? How can we improve performance (without fine-tuning)? Input representations Prompt engineering How does performance vary with model size? 6

Lattices vs. word confusion networks Lattice Word confusion network (WCN) 7

Extracting lattices/WCNs Extract n-best (n=10) hypotheses using off-the-shelf Emformer-based RNN-T model1 Convert n-best lattices WCNs[1] where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 1https://pytorch.org/audio/main/generated/torchaudio.models.emformer_rnnt_model 8 [1] Povey et al., The kaldi speech recognition toolkit, ASRU 2011.

From WCNs to LLM inputs Ref: where was super bowl fifty held where 0.6 was 0.8 super 0.7 bowl 0.7 were 0.3 with 0.1 50 1 held 1 <eps> 0.3 superbowl 0.3 wear 0.1 a 0.1 join word options with / or | p 0: where|were|wear was|with|a super bowl|superbowl fifty held p 0.3: where|were was super bowl|superbowl fifty held 9

NMSQA dataset [2] Originally developed for extractive spoken question answering Passed SQuAD data (contexts with questions and answers) through TTS system speech text Train Train TTS speech text Dev Dev TTS SQuAD v1.1 NMSQA Our application: document-grounded QA with transcribed speech questions Evaluation metrics: unigram F1, span exact match WER: 22.6% (1-best), 15.9% (n-best oracle w/ n=10) [2] Lin et al., DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering. https://arxiv.org/pdf/2203.04911, Interspeech 2022. 10

LLM input (0-shot) Transcript source 0-shot with WCN instruction ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: question: what type of closing/clothing does parkwood topshop athletic ltd produce? Transcript source {ground truth, n-best oracle, 1-best, WCN} 11

LLM input (1-shot) 1-shot with WCN instruction Transcript source You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. ground truth context: in october 2014, it was announced that beyonce with her management company parkwood entertainment would be partnering with london-based fashion retailer question: what type of closing/clothing does parkwood topshop athletic ltd produce? Context: {ic_c} Question: {ic_q} Answer: {ic_a} Context: {context} Question: {question} Answer: Transcript source {ground truth, n-best oracle, 1-best, WCN} matching between ic_q and question 12

Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth upper bounds 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline 13

Zero-shot: baseline/oracle results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best Ground truth transcripts are the absolute upper bound, but we consider using n-best oracle transcripts to be a more realistic upper bound for a fixed ASR system Using 1-best transcripts serves as a baseline Models used: bloomz-{3b}: decoder-only model, pretrained on multilingual natural language and programming languages, plus supervised fine-tuning on task-oriented datasets ChatGPT (gpt-3.5-turbo) 14

Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs 15

Zero-shot results NMSQA dev set performance bloomz-3b ChatGPT F1 F1 Transcript source Exact match Exact match 67.1 35.6 71.7 42.5 Ground truth 61.5 33.7 68.4 40.4 n-best oracle 58.4 32.5 64.5 36.5 1-best 53.6 31.5 57.4 31.0 | WCNs with | separator 56.4 31.0 63.2 35.3 | (p 0.3) With both models, using 1-best transcripts yields better results than using WCNs Posterior filtering of WCNs helps, but not enough 16

1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best Closed 52% of the gap between 1-best and n-best oracle 69.6 44.1 | WCNs with | separator 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 17

1-shot results NMSQA dev set performance ChatGPT Transcript source F1 Exact match 76.2 47.9 Ground truth 72.4 45.1 n-best oracle 70.2 43.0 1-best 69.6 44.1 | WCNs with | separator Closed 81% of the gap between 1-best and n-best oracle 70.5 44.7 | (p 0.3) ChatGPT results improve across the board Compared to zero-shot results (previous slide), 4-7% (abs) for non-confusion network transcripts, 9-18% (abs) for confusion network transcripts Posterior filtering still helps 18

Cases where confusion networks help E.g. when WCNs recover words that are missing from 1-best Transcript source Question Answer (from ChatGPT) which article allows the european council to govern mergers between firms? article one hundred two Ground truth which article allows the european council to govern murders between fern? article one hundred thirty nine 1-best which article allows the european council to govern murders|merders|merjurs between fern|fir|f|firm|far? article one hundred two | 19

Cases where confusion networks hurt E.g. when the 1-best hypothesis is close to the reference 1-best hypothesis == n-best oracle in 83% of questions Transcript source Question Answer (from ChatGPT) what are the eight comb rows on the outer surface called? swimming plates Ground truth what are the eight comb rows on the outer surface? swimming plates 1-best what are the eight|a calm|comb rows|roads on the outer surface? eight comb rows | 20

Cases where filtering by posteriors helps E.g. when the raw WCN has many word alternatives Transcript source Question Answer (from ChatGPT) when did tesla go to tomingaj? in eighteen seventy four Ground truth when did tessa go to mean guards? there is no information provided about tesla going to mean guards in the given context 1-best when did tessler|tessa go to mean menage|manguard|engage|mingart|menguot|manguards|guards? in tomingaj | in eighteen seventy four when did tessler|tessa go to mean menage? | (p 0.3) 21

Prompt engineering: WCN instruction 0-shot with WCN instruction {context} Based on the above context, answer the following question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your answer as concise as possible, including only the words which answer the question. Q: {question} A: NMSQA dev set performance CN instruction in prompt? ChatGPT result F1 Transcript source Exact match 57.4 26.8 | (p 0.3) 63.2 35.3 | (p 0.3) Providing confusion network instructions in prompt greatly improves results 22

Prompt engineering: in-context examples 1-shot with WCN instruction You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. Providing an in-context example in the prompt could help via: 1. Demonstration of QA task 2. Demonstration of the confusion network notation Context: {ic_c} Question: {ic_q} Answer: {ic_a} Using ground-truth transcript does 1 Using WCN does 1+2 Context: {context} Question: {question} Answer: 23

Prompt engineering: in-context examples NMSQA dev set performance 1-shot transcript type 1-shot with WCN instruction ChatGPT result F1 Transcript source Exact match You will be given sets of contexts and questions, and you are expected to generate an answer to the question. The question comes from a speech recognition system, and the | symbol indicates potential options for each word. Please keep your response as concise as possible, including only the words which answer the question. 70.2 43.0 1-best 1-best 70.2 44.8 | (p 0.3) Ground truth 70.5 44.7 | (p 0.3) | (p 0.3) Context: {ic_c} Question: {ic_q} Answer: {ic_a} All 1-shot For F1, using WCN in the in-context example is helpful For exact match, using either ground truth or WCN in the in-context example produce similar results Context: {context} Question: {question} Answer: 24

Conclusions Strong LLMs can process ASR confusion networks as inputs for downstream tasks Recipe One-shot instruction by example is necessary to improve upon 1-best results Use WCN instructions in the prompt Prune WCNs based on word posteriors It s hard to beat 1-best ASR output for low error rates, but error recovery occurs at intermediate-to-high error rates 25

Thank you! 26

Downstream effects of ASR accuracy ChatGPT 1-shot EM% by WER bin (ticks indicate bin upper bounds) Below ~70%, ground truth > n-best oracle > | > 1-best Among very high WER% examples, confusion net transcripts do the worst 27

Examples of varying WER 0% WER Transcript source Question Answer (from ChatGPT) what do these teachers not do? teach by rote Ground truth what do these teachers not do? teach by rote 1-best what do these teacher knocked|knock to|due? knock | Confusion nets have the potential to add noise to a perfect 1-best, and in some cases can even delete words 28

Examples of varying WER Intermediate (~ 25%) WER Transcript source Question Answer (from ChatGPT) when was his article published in century magazine? nineteen hundred Ground truth one was his article published in century magaz? the problem of increasing human energy 1-best one|when was his article published in century|sanctary magazine|magaz? nineteen hundred | Confusion networks can recover words which were deleted in the 1-best and present them as options 29

Examples of varying WER Very high WER Transcript source Question Answer (from ChatGPT) who did denver beat in the afc championship? the new england patriots Ground truth who did them for beating the a f sea champions? the new england patriots 1-best who did them for beating in the a|aye f i see|sea champions|championship? the arizona cardinals | In very high WER scenarios, the confusion networks can be full of options (even after posterior filtering) which leads to excessive noise 30

Enhancing Spoken Language Understanding with Word Confusion Networks

Download Presentation

Presentation Transcript

Related

More Related Content