Knowledge Distillation for Streaming ASR Encoder with Non-streaming Layer
The research introduces a novel knowledge distillation (KD) method for transitioning from non-streaming to streaming ASR encoders by incorporating auxiliary non-streaming layers and a special KD loss function. This approach enhances feature extraction, improves robustness to frame misalignment, and allows for the use of unlabeled data. Experimental results show a 16% relative Word Error Rate (WER) reduction on the LibriSpeech test-other dataset, with an additional 3% reduction when leveraging large unlabeled data.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Knowledge Distillation from Non- streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer Kyuhong Shim, Jinkyu Lee, Simyung Chang and Kyuwoong Hwang Qualcomm AI Research , Qualcomm Korea YH, Seoul, Republic of Korea {kshim, jinkyu, simychan, kyuwoong}@qti.qualcomm.com Accepted to Interspeech 2023
Outline 1. Introduction 2. Related Work 3. Distillation from Non-streaming Teacher to Streaming Student 4. Experimental Results 5. Discussion 6. Conclusion
1. Introduction Streaming ASR models face significant limitations: On-device Real-time speech recognition
1. Introduction Simple approach: Minimize Kullback-Leibler divergence (KLD) between the output token emission probabilities of the teacher and student Direct frame-to-frame distillation misguide the streaming model The audio-text frame alignments are different between the non- streaming and streaming models
1. Introduction Dual-mode: a type of self-KD training, which shares the same model for both streaming and non-streaming cases Limitation that the student model size should be the same as the teacher
1. Introduction This work propose a novel KD method that applies layer- wise distillation on the encoder part of the model Advantages over existing methods: Faster training Better robustness to frame misalignment Ability to utilize unlabeled data
1. Introduction First, insert auxiliary non- streaming layers into the streaming model
1. Introduction Second, design a special KD loss function to enhance the feature extraction process
1. Introduction Compared to the baseline, our method achieves 16% relative WER reduction on LibriSpeech test-other dataset By incorporating the large unlabeled data as an additional training resource, we could additionally reduce the WER by 3%
2. Related Work - Distillation for Improving Streaming ASR The most popular approach is to distill the output token probabilities Manual shifting Multi-stage alignment matching CTC alignments
2. Related Work - Distillation for Improving Streaming ASR Another line of research called dual-mode, where the non- streaming teacher and streaming student share the same parameter set The teacher and the student should be the same (large) architecture Has not been tested for chunk-wise streaming setting
2. Related Work - Distillation for ASR Model Compression Transducer-based end-to-end ASR models have been actively studied for compression Large self-supervised models have also tried KD to squeeze their model size
2. Related Work - Cascaded Non-streaming and Streaming Layers Cascade encoder: Special encoder design that stack several non- streaming layers on top of the streaming layers This work also attaches additional non-streaming layers, but they are only used for KD purpose and then removed after training is finished
3. Distillation from Non-streaming Teacher to Streaming Student Input: Speech frames ? = ?1,?2, ,?? Target: Text tokens ? = ?1,?2, ,?? Goal: Maximize the likelihood ? ?|?
3. Distillation from Non-streaming Teacher to Streaming Student Output feature of the encoder -th layer: = 1 Next layer s ?-th frame output: ? ? +1 1 ?? and ?? are the left/right-context size , , ? , 2 +1= ? +1 ? ?? , , ?+?? , , ? ??? ????????? ??? ???????????? for the teacher and ? for the student
3. Distillation from Non-streaming Teacher to Streaming Student Layer-wise KD from the teacher encoder to the student encoder In order to solve the context mismatch, this paper insert auxiliary non-streaming layers at selected layers of the student
3. Distillation from Non-streaming Teacher to Streaming Student At -th layer, the output frame features of auxiliary layer: ?? = ? ?1 Three parts of ? : 1. A simple projection layer 2. A single Transformer layer 3. A unidirectional LSTM layer , ,?? Then KD between the teacher layer and the auxiliary layer ?
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Feature similarity loss: ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1 ?: Feature vector dimension ?: Sigmoid activation function cos , : Cosine similarity
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Self-Attention loss: ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?: Number of attention heads ?? ?,?,?? ?,?: Teacher/Student attention probability distribution of the ?-th frame computed at ?-th attention head
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?????= ???????? ? ?? ?,? ??,???,? ?? ??= ??,?: Query vector of ?-th frame The self-attention loss: ????= ???? ? ?: Attention head dimension ?????+ ???? ???+ ???? ?????
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Future prediction loss: Inspired by the autoregressive predictive coding (APC) Using previous history to predict the future feature (after several frames) APC: Unidirectional LSTMs and train those layers to minimize the difference Originally proposed for self-supervised speech pretraining
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss The output features of unidirectional LSTM layer: ??= LSTM ?1,?2, ,?? APC loss is computed as: ? 1 ? ????= ?+? ?? 1 log? cos ?+?,?? ?=1 ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss To maximize the efficiency of APC loss, this work modify the Transformer attention mask applied for auxiliary non-streaming layers
3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Total loss: 1 ?? ?total= ?????? ? ?? 1 + ?????+ ?????+ ????? ??+ ?? ? ??,?? ??,??: The number of labeled and unlabeled samples ? = 0.01,? = 0.0005,? = 0.005
4. Experimental Results-Setup Dataset LibriSpeech-960h Exploit 6K hours of unlabeled data from the same domain (i.e., audio book reading)
4. Experimental Results-Setup Model Architecture: Conformer-Transducer ASR models Student encoder: A stack of 16 Conformer layers 256-dim feature 4 attention heads Teacher encoder: A stack of 16 Conformer layers 512-dim feature 8 attention heads
4. Experimental Results-Setup For the streaming student Convolution layer is set to be causal and the attention mask is restricted to only access chunks within the left context limit For the auxiliary layers Use a single Transformer layer (self-attention + feed-forward) Apply KD from the 4, 8, 12, 16-th layers of the teacher to 4, 8, 12, 16-th layers of the student
4. Experimental Results-Setup Training Details Batch size: 1024 Peak learning rate: 1.25? 3 The streaming student uses a chunk-wise training and inference Chunk size: 160ms Left/Right context size: 640ms and 0ms Mix the labeled and unlabeled data to a 1:1 ratio to construct an epoch
5. Discussion Teacher-to-Student Layer Connection find the optimal connection between layer identify the layers of similar behaviors and connect them
5. Discussion Relationship to Pseudo Label Generation To utilize unlabeled data, several studies have suggested generating pseudo labels for such data using the out-of-the-shelf large ASR mode Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data, in ICASSP 2021
6. Conclusion In this paper: A novel KD method for improving the small streaming student using the large non-streaming teacher key idea is to insert non-streaming auxiliary layers on the student Matching the context of the target features for knowledge distillation Designed a special layer-wise KD loss function, which includes future prediction task using APC loss