OWSM-CTC: An Open Encoder-Only Speech Foundation Model

Slide Note
Embed
Share

"Explore OWSM-CTC, an innovative encoder-only model for diverse language speech-to-text tasks inspired by Whisper and OWSM. Learn about its non-autoregressive approach and implications for multilingual ASR, ST, and LID."


Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe Carnegie Mellon University, Honda Research Institute Japan Speaker : Yu-Chen Kuan

  2. OUTLINE Introduction OWSM-CTC Experiments Conclusion 2

  3. Introduction

  4. Introduction The great success of LLMs has sparked a growing interest in developing foundation models in various modalities Recent studies have explored different approaches towards multilingual and multi- tasking speech foundation models OpenAI s Whisper achieves strong results in multilingual ASR, ST and LID Recent work releases Open Whisper-style Speech Models (OWSM) with the aim of reproducing Whisper-style training using public data and open-source toolkits 4

  5. Introduction Whisper and OWSM adopt the encoder-decoder architecture in an autoregressive manner, they might hallucinate during inference, and the speed can be slow Can we build a non-autoregressive encoder-only model for speech-to-text generation in diverse languages and multiple tasks like Whisper/OWSM ? Propose OWSM-CTC, a novel encoder-only speech foundation model 5

  6. OWSM-CTC

  7. OWSM-CTC 7

  8. OWSM-CTC 8

  9. Speech encoder Let Xspeech RT d be the downsampled feature sequence, prepend two special tokens to the sequence: If the spoken language is known, the true language token will be used as input, special token <nolang> denoting unknown language During training, we randomly replace the true language with according to probability 0.5 so that either can be used for inference 9

  10. OWSM-CTC 10

  11. Speech encoder The task token is <asr> for speech recognition and <st_lang> for translation to a target language Encoder layers: The encoder is E-Branchformer 11

  12. OWSM-CTC 12

  13. Speech encoder Compute the CTC loss using the final encoder output X(N) and an augmented reference ytask Reference ytask preprend <lang> and <task> to the original groundtruth text of the desired task CTC loss: 13

  14. OWSM-CTC 14

  15. Speech encoder Apply self-conditioned CTC at intermediate layers to alleviate the conditional independence assumption of CTC: Intermediate CTC loss: 15

  16. Speech encoder The choice of the reference text depends on the task If the task is ST, we empirically find that the model cannot converge if we use the translated text as the reference at all intermediate layers Utilize the ASR transcript at the first NASR layers and the ST text at the remaining NST layers, where NASR + NST= |S| N 1 16

  17. OWSM-CTC 17

  18. Speech encoder First NASR CTC layers always perform ASR regardless of the task token (named ASR-only CTC ) Other CTC layers are multi-tasking: perform ASR or ST according to the task token (named task-specific or task-dependent CTC ) Overall training loss: 18

  19. OWSM-CTC 19

  20. Prompt encoder Whisper-style models generate text conditioned on an optional text prompt For encoder-decoder models like Whisper, the text prompt is a prefix to the autoregressive decoder For our encoder-only model, we leverage a separate Transformer encoder to process the prompt and inject it to the speech encoder through cross-attention 20

  21. Prompt encoder If no prompt is provided, a special token <na> will be used Let Xprompt RT d be the output of the prompt encoder, insert a cross-attention layer at a subset of layers T of the speech encoder: 21

  22. Prompt encoder Training data is a mixture of public ASR and ST datasets, some of them provide unsegmented long audio, but the others only release segmented short audio At training time, if the sample does not have a previous sentence, we will use <na> Otherwise, we use either or the previous sentence as the prompt according to 0.5 probability 22

  23. Experiments

  24. Model Size 24

  25. Data format Training data is prepared using scripts publicly released by OWSM v3.1 A mixture of more than 25 public ASR and ST corpora covering 151 languages and various translation directions Total audio duration is 180k hours To create long-form data, consecutive utterances from the same audio recording are concatenated to a duration of no more than 30 seconds The input audio to the model is always padded to a fixed length of 30 seconds 25

  26. Model Architecture Speech encoder: 27 layer E-Branchformer with a hidden size of 1024 and 16 attention heads 4 intermediate layers (6, 12, 15, and 21) are used for self-conditioned CTC First three are ASR only, while the others are task-specific Prompt encoder 4 layer Transformer with a hidden size of 512 and 8 attention heads Injected into the speech encoder at every third layer 26

  27. Implementation Toolkit: ESPnet Batch size per GPU is 4, and 64 NVIDIA A100 GPUs Training time is approximately 300 hours Adam optimizer, piece-wise linear learning rate schedule 27

  28. Evaluation Compare encoder-only OWSM-CTC with the previously released encoder-decoder OWSM v3.1 models since they are trained on the same data 28

  29. Language identification 29

  30. Speech recognition: En 30

  31. Speech recognition: Multilingual 31

  32. Speech translation: X-En 32

  33. Speech translation: En-X 33

  34. Long-form speech recognition OWSM-CTC performs chunk-wise recognition in a fully parallel manner. First split the entire audio into overlapped chunks of 30s, where the overlapped region serves as the left and right context. 34

  35. Effect of text prompt 35

  36. Robustness Whisper and OWSM v3.1 tend to generate some text that looks meaningful, while our OWSM-CTC only generates some punctuation marks without actual meaning. 36

  37. Conclusion

  38. Conclusion Propose OWSM-CTC, a novel encoder-only speech foundation model built upon 180k hours of public audio data and open-source toolkits OWSM-CTC employs multi-task self-conditioned CTC for multilingual ASR, any- to-any ST, and LID Achieves competitive performance on ASR and superior performance on ST, while being more robust and 3 to 4 times faster at inference time Improves the long-form ASR WER with 20 times faster inference due to the batched parallel decoding 38

  39. Thank You For Listening

Related


More Related Content