HyPoradise: Open Baseline for Generative Speech Recognition
Learn about HyPoradise, a dataset with 334K+ hypotheses-transcription pairs for speech recognition. Discover how large language models are used for error correction in both zero-shot and fine-tuning scenarios.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models Chen Chen, Yuchen Hu, Chao-Han Huck Yang, Sabato Macro Siniscalchi, Pin-Yu Chen, Eng Siong Chng Nanyang Technological University, Singapore; Georgia Institute of Technology, USA; Norwegian University of Science and Technology, Norway; IBM Research AI, USA https://arxiv.org/abs/2309.15701 Accepted to NeurIPS 2023. Datasets and Benchmarks Track 1
Introduction The use of a Language Model (LM) in ASR helps compensate for linguistic knowledge gaps, improving the accuracy of the recognition result. i) the discarded utterances(2~N-best) contain a better candidate with lower word error rate (WER) ii) the other discarded hypotheses can provide the right answer for the wrong tokens in 1st utterance 2
HyPoradise The dataset contains more than 334K hypotheses-transcription pairs are collected from the various ASR corpus in most common speech domains They use large language models (LLMs) for error correction in various scenarios, emulating the deployment of ASR systems in real-world settings. Zero-shot Learning Few-shot Learning Fine-tuning 3
ASR System WavLM based well-trained ASR model on LibriSpeech but suffering from domain mismatch 433M parameters fine-tune on 960-hour LibriSpeech with external LM rescoring Whisper universal ASR model but lacking domain specificity 1,550M parameters 680,000 hours of multilingual-weakly labeled speech data beam size was set to 60 They select top-5 utterances with highest probabilities as N-best list 4
LLMs for error correction (Zero-shot or few-shot scenarios) 6
LLMs for error correction (Fine-tuning scenarios) H2T-ft: denotes fine-tuning all parameters of a neural model They add an new item criterion to encourage the correction model to preferentially consider tokens into the N-best hypotheses list ? ??????(? |??,?) ??2?= ?=1 7
LLMs for error correction (Fine-tuning scenarios) H2T-LoRA Original forward pass = ?0? Modified forward pass = ?0? + ??? 8
LLM Configurations In-context learning GPT-3.5-turbo (175B) H2T-ft, H2T-LoRA T5-large (0.75B) LLaMA-13B 9
Results of H2T-ft and H2T-LoRA (ASR: Whisper) ??????is a vanilla transformer trained using in-domain transcription of the training set, and then it re-ranks the hypotheses according to perplexity the n-best oracle (???): WER of the best candidate in N-best hypotheses list the compositional oracle (???): achievable WER using all tokens in N-best hypotheses list 10
Results of H2T-ft and H2T-LoRA (ASR: Whisper) significant improvement on ATIS and WSJ improves the robustness of on background noise and speaker accent (CHiME-4 and CV-accent) H2T-LoRA usually generate better WER results than H2T-ft, as the low-rank adapter allows LLMs to keep pre-trained knowledge and avoid over-fitting problem 11
Results of H2T-ft and H2T-LoRA (ASR: Whisper) an over-fitting phenomenon existing in the correction techniques, especially in H2T-ft where all parameters are tunable the mean and variance of the utterance length can potentially influence the WER result SwitchBoard (large variance in length), CORAAL (long-form speech) 12
Results of In-context Learning (ASR: WavLM) LLM shows human-like correction it successfully infers the correct token based on the pronunciation of xinnepec", as well as the context of China s petrochemical". In fact, Sinopec is a petrochemical-related Chinese company. 14
Future work As LLMs tend to perform error correction using tokens with similar pronunciation => including more acoustic information in HP dataset, such as token-level confidence provided by ASR engine considering different data amount of each domain, more parameter-efficient training methods besides low-rank adaptation should be discussed for LLMs tuning e.g., model reprogramming, prompting and cross-modal adaptation 15
Conclusion This work introduces a new ASR benchmark that utilizes LLMs for transcription prediction from N-best hypotheses. The dataset consisting of more than 334K hypotheses-transcription pairs that are collected from 9 different public ASR corpora 16