OWSM-CTC: An Open Encoder-Only Speech Foundation Model
"Explore OWSM-CTC, an innovative encoder-only model for diverse language speech-to-text tasks inspired by Whisper and OWSM. Learn about its non-autoregressive approach and implications for multilingual ASR, ST, and LID."
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe Carnegie Mellon University, Honda Research Institute Japan Speaker : Yu-Chen Kuan
OUTLINE Introduction OWSM-CTC Experiments Conclusion 2
Introduction The great success of LLMs has sparked a growing interest in developing foundation models in various modalities Recent studies have explored different approaches towards multilingual and multi- tasking speech foundation models OpenAI s Whisper achieves strong results in multilingual ASR, ST and LID Recent work releases Open Whisper-style Speech Models (OWSM) with the aim of reproducing Whisper-style training using public data and open-source toolkits 4
Introduction Whisper and OWSM adopt the encoder-decoder architecture in an autoregressive manner, they might hallucinate during inference, and the speed can be slow Can we build a non-autoregressive encoder-only model for speech-to-text generation in diverse languages and multiple tasks like Whisper/OWSM ? Propose OWSM-CTC, a novel encoder-only speech foundation model 5
OWSM-CTC 7
OWSM-CTC 8
Speech encoder Let Xspeech RT d be the downsampled feature sequence, prepend two special tokens to the sequence: If the spoken language is known, the true language token will be used as input, special token <nolang> denoting unknown language During training, we randomly replace the true language with according to probability 0.5 so that either can be used for inference 9
OWSM-CTC 10
Speech encoder The task token is <asr> for speech recognition and <st_lang> for translation to a target language Encoder layers: The encoder is E-Branchformer 11
OWSM-CTC 12
Speech encoder Compute the CTC loss using the final encoder output X(N) and an augmented reference ytask Reference ytask preprend <lang> and <task> to the original groundtruth text of the desired task CTC loss: 13
OWSM-CTC 14
Speech encoder Apply self-conditioned CTC at intermediate layers to alleviate the conditional independence assumption of CTC: Intermediate CTC loss: 15
Speech encoder The choice of the reference text depends on the task If the task is ST, we empirically find that the model cannot converge if we use the translated text as the reference at all intermediate layers Utilize the ASR transcript at the first NASR layers and the ST text at the remaining NST layers, where NASR + NST= |S| N 1 16
OWSM-CTC 17
Speech encoder First NASR CTC layers always perform ASR regardless of the task token (named ASR-only CTC ) Other CTC layers are multi-tasking: perform ASR or ST according to the task token (named task-specific or task-dependent CTC ) Overall training loss: 18
OWSM-CTC 19
Prompt encoder Whisper-style models generate text conditioned on an optional text prompt For encoder-decoder models like Whisper, the text prompt is a prefix to the autoregressive decoder For our encoder-only model, we leverage a separate Transformer encoder to process the prompt and inject it to the speech encoder through cross-attention 20
Prompt encoder If no prompt is provided, a special token <na> will be used Let Xprompt RT d be the output of the prompt encoder, insert a cross-attention layer at a subset of layers T of the speech encoder: 21
Prompt encoder Training data is a mixture of public ASR and ST datasets, some of them provide unsegmented long audio, but the others only release segmented short audio At training time, if the sample does not have a previous sentence, we will use <na> Otherwise, we use either or the previous sentence as the prompt according to 0.5 probability 22
Model Size 24
Data format Training data is prepared using scripts publicly released by OWSM v3.1 A mixture of more than 25 public ASR and ST corpora covering 151 languages and various translation directions Total audio duration is 180k hours To create long-form data, consecutive utterances from the same audio recording are concatenated to a duration of no more than 30 seconds The input audio to the model is always padded to a fixed length of 30 seconds 25
Model Architecture Speech encoder: 27 layer E-Branchformer with a hidden size of 1024 and 16 attention heads 4 intermediate layers (6, 12, 15, and 21) are used for self-conditioned CTC First three are ASR only, while the others are task-specific Prompt encoder 4 layer Transformer with a hidden size of 512 and 8 attention heads Injected into the speech encoder at every third layer 26
Implementation Toolkit: ESPnet Batch size per GPU is 4, and 64 NVIDIA A100 GPUs Training time is approximately 300 hours Adam optimizer, piece-wise linear learning rate schedule 27
Evaluation Compare encoder-only OWSM-CTC with the previously released encoder-decoder OWSM v3.1 models since they are trained on the same data 28
Long-form speech recognition OWSM-CTC performs chunk-wise recognition in a fully parallel manner. First split the entire audio into overlapped chunks of 30s, where the overlapped region serves as the left and right context. 34
Robustness Whisper and OWSM v3.1 tend to generate some text that looks meaningful, while our OWSM-CTC only generates some punctuation marks without actual meaning. 36
Conclusion Propose OWSM-CTC, a novel encoder-only speech foundation model built upon 180k hours of public audio data and open-source toolkits OWSM-CTC employs multi-task self-conditioned CTC for multilingual ASR, any- to-any ST, and LID Achieves competitive performance on ASR and superior performance on ST, while being more robust and 3 to 4 times faster at inference time Improves the long-form ASR WER with 20 times faster inference due to the batched parallel decoding 38