Knowledge Distillation for Streaming ASR Encoder with Non-streaming Layer

 
Knowledge Distillation from Non-
streaming to Streaming ASR Encoder
using Auxiliary Non-streaming Layer
 
Kyuhong Shim, Jinkyu Lee, Simyung Chang and
Kyuwoong Hwang
 
Qualcomm AI Research† , Qualcomm Korea YH, Seoul,
Republic of Korea
{kshim, jinkyu, simychan, kyuwoong}@qti.qualcomm.com
Accepted to Interspeech 2023
 
Outline
 
1.
Introduction
2.
Related Work
3.
Distillation from Non-streaming Teacher to Streaming Student
4.
Experimental Results
5.
Discussion
6.
Conclusion
 
1. Introduction
 
Streaming ASR models face significant limitations:
On-device
Real-time speech recognition
 
1. Introduction
 
Simple approach: 
Minimize Kullback-Leibler divergence
(KLD)
 between the output token emission probabilities of
the teacher and student
 
Direct frame-to-frame distillation misguide the streaming model
The 
audio-text frame alignments 
are different between the non-
streaming and streaming models
 
1. Introduction
 
Dual-mode: a type of 
self-KD training
, which shares the
same model for both streaming and non-streaming cases
 
Limitation that the student model size should be the same as the
teacher
 
1. Introduction
 
This work propose a novel KD method that applies 
layer-
wise distillation on the encoder part of the model
Advantages over existing methods:
Faster training
Better robustness to frame misalignment
Ability to utilize unlabeled data
 
1. Introduction
 
First, insert auxiliary non-
streaming layers into the
streaming model
 
1. Introduction
 
Second, design a special KD loss function to enhance the
feature extraction process
 
1. Introduction
 
Compared to the baseline, our method 
achieves 16%
relative WER reduction
 on LibriSpeech 
test-other
 dataset
 
By incorporating the 
large unlabeled data as an additional
training resource
, we could additionally 
reduce the WER by
3%
 
2. Related Work
- Distillation for Improving Streaming ASR
 
The most popular approach is to distill the output token
probabilities
Manual shifting
Multi-stage alignment matching
CTC alignments
 
2. Related Work
- Distillation for Improving Streaming ASR
 
Another line of research called dual-mode, where the non-
streaming teacher and streaming student 
share the same
parameter set
The teacher and the student should be the same (large)
architecture
Has not been tested for chunk-wise streaming setting
 
 
2. Related Work
- Distillation for ASR Model Compression
 
Transducer-based end-to-end ASR models have been
actively studied for compression
 
Large self-supervised models have also tried KD to squeeze
their model size
 
2. Related Work
- Cascaded Non-streaming and Streaming Layers
 
Cascade encoder: Special encoder
design that stack several non-
streaming layers on top of the
streaming layers
 
This work also attaches additional
non-streaming layers, but they are
only used for KD purpose 
and then
removed after training is finished
 
3. Distillation from Non-streaming Teacher to
Streaming Student
 
3. Distillation from Non-streaming Teacher to
Streaming Student
 
3. Distillation from Non-streaming Teacher to
Streaming Student
 
Layer-wise KD from the teacher encoder to the student
encoder
 
In order to solve the context mismatch, this paper insert
auxiliary non-streaming layers at selected layers of the
student
 
3. Distillation from Non-streaming Teacher to
Streaming Student
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
Future prediction loss:
Inspired by the autoregressive predictive coding (APC)
Using previous history to predict the future feature (after several
frames)
 
APC:
Unidirectional LSTMs 
and train those layers to minimize the difference
Originally
 proposed for 
self-supervised speech pretraining
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
To maximize the efficiency of
APC loss, this work modify
the Transformer attention
mask applied for auxiliary
non-streaming layers
 
3. Distillation from Non-streaming Teacher to
Streaming Student
-Distillation Loss
 
4. Experimental Results
-Setup
 
Dataset
 
LibriSpeech-960h
Exploit 6K hours of unlabeled data from the same domain (i.e.,
audio book reading)
 
4. Experimental Results
-Setup
 
Model Architecture: Conformer-Transducer ASR models
 
Teacher encoder:
 A stack of 16 Conformer layers
512-dim feature
8 attention heads
 
Student encoder:
A stack of 16 Conformer layers
256-dim feature
4 attention heads
 
4. Experimental Results
-Setup
 
For the streaming student
Convolution layer is set to be causal 
and the 
attention mask 
is
restricted to only access chunks within the left context limit
 
For the auxiliary layers
Use a single Transformer layer (self-attention + feed-forward)
Apply KD from the 4, 8, 12, 16-th layers of the teacher to 4, 8, 12,
16-th layers of the student
 
4. Experimental Results
-Setup
 
4. Experimental Results
-ASR Performance
 
4. Experimental Results
-ASR Performance
 
4. Experimental Results
-ASR Performance
 
5. Discussion
 
Teacher-to-Student Layer Connection
 
find the optimal connection between layer
 
identify the layers of similar behaviors and connect them
 
5. Discussion
 
Relationship to Pseudo Label Generation
To utilize unlabeled data, several studies have suggested
generating pseudo labels
 for such data 
using the out-of-the-shelf
large ASR mode
 
“Improving streaming automatic speech recognition with non-streaming
model distillation on unsupervised data,” in ICASSP 2021
 
6. Conclusion
 
In this paper:
A novel 
KD method for improving the small streaming student 
using the
large non-streaming teacher
 
key idea is to 
insert non-streaming auxiliary layers on the student
Matching the context of the target features for knowledge distillation
 
Designed a 
special layer-wise KD loss function
, which 
includes future
prediction task using APC loss
Slide Note
Embed
Share

The research introduces a novel knowledge distillation (KD) method for transitioning from non-streaming to streaming ASR encoders by incorporating auxiliary non-streaming layers and a special KD loss function. This approach enhances feature extraction, improves robustness to frame misalignment, and allows for the use of unlabeled data. Experimental results show a 16% relative Word Error Rate (WER) reduction on the LibriSpeech test-other dataset, with an additional 3% reduction when leveraging large unlabeled data.

  • Knowledge Distillation
  • Streaming ASR Encoder
  • Feature Extraction
  • Unlabeled Data
  • Speech Recognition

Uploaded on May 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Knowledge Distillation from Non- streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer Kyuhong Shim, Jinkyu Lee, Simyung Chang and Kyuwoong Hwang Qualcomm AI Research , Qualcomm Korea YH, Seoul, Republic of Korea {kshim, jinkyu, simychan, kyuwoong}@qti.qualcomm.com Accepted to Interspeech 2023

  2. Outline 1. Introduction 2. Related Work 3. Distillation from Non-streaming Teacher to Streaming Student 4. Experimental Results 5. Discussion 6. Conclusion

  3. 1. Introduction Streaming ASR models face significant limitations: On-device Real-time speech recognition

  4. 1. Introduction Simple approach: Minimize Kullback-Leibler divergence (KLD) between the output token emission probabilities of the teacher and student Direct frame-to-frame distillation misguide the streaming model The audio-text frame alignments are different between the non- streaming and streaming models

  5. 1. Introduction Dual-mode: a type of self-KD training, which shares the same model for both streaming and non-streaming cases Limitation that the student model size should be the same as the teacher

  6. 1. Introduction This work propose a novel KD method that applies layer- wise distillation on the encoder part of the model Advantages over existing methods: Faster training Better robustness to frame misalignment Ability to utilize unlabeled data

  7. 1. Introduction First, insert auxiliary non- streaming layers into the streaming model

  8. 1. Introduction Second, design a special KD loss function to enhance the feature extraction process

  9. 1. Introduction Compared to the baseline, our method achieves 16% relative WER reduction on LibriSpeech test-other dataset By incorporating the large unlabeled data as an additional training resource, we could additionally reduce the WER by 3%

  10. 2. Related Work - Distillation for Improving Streaming ASR The most popular approach is to distill the output token probabilities Manual shifting Multi-stage alignment matching CTC alignments

  11. 2. Related Work - Distillation for Improving Streaming ASR Another line of research called dual-mode, where the non- streaming teacher and streaming student share the same parameter set The teacher and the student should be the same (large) architecture Has not been tested for chunk-wise streaming setting

  12. 2. Related Work - Distillation for ASR Model Compression Transducer-based end-to-end ASR models have been actively studied for compression Large self-supervised models have also tried KD to squeeze their model size

  13. 2. Related Work - Cascaded Non-streaming and Streaming Layers Cascade encoder: Special encoder design that stack several non- streaming layers on top of the streaming layers This work also attaches additional non-streaming layers, but they are only used for KD purpose and then removed after training is finished

  14. 3. Distillation from Non-streaming Teacher to Streaming Student Input: Speech frames ? = ?1,?2, ,?? Target: Text tokens ? = ?1,?2, ,?? Goal: Maximize the likelihood ? ?|?

  15. 3. Distillation from Non-streaming Teacher to Streaming Student Output feature of the encoder -th layer: = 1 Next layer s ?-th frame output: ? ? +1 1 ?? and ?? are the left/right-context size , , ? , 2 +1= ? +1 ? ?? , , ?+?? , , ? ??? ????????? ??? ???????????? for the teacher and ? for the student

  16. 3. Distillation from Non-streaming Teacher to Streaming Student Layer-wise KD from the teacher encoder to the student encoder In order to solve the context mismatch, this paper insert auxiliary non-streaming layers at selected layers of the student

  17. 3. Distillation from Non-streaming Teacher to Streaming Student At -th layer, the output frame features of auxiliary layer: ?? = ? ?1 Three parts of ? : 1. A simple projection layer 2. A single Transformer layer 3. A unidirectional LSTM layer , ,?? Then KD between the teacher layer and the auxiliary layer ?

  18. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Feature similarity loss: ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1 ?: Feature vector dimension ?: Sigmoid activation function cos , : Cosine similarity

  19. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Self-Attention loss: ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?: Number of attention heads ?? ?,?,?? ?,?: Teacher/Student attention probability distribution of the ?-th frame computed at ?-th attention head

  20. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?????= ???????? ? ?? ?,? ??,???,? ?? ??= ??,?: Query vector of ?-th frame The self-attention loss: ????= ???? ? ?: Attention head dimension ?????+ ???? ???+ ???? ?????

  21. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Future prediction loss: Inspired by the autoregressive predictive coding (APC) Using previous history to predict the future feature (after several frames) APC: Unidirectional LSTMs and train those layers to minimize the difference Originally proposed for self-supervised speech pretraining

  22. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss The output features of unidirectional LSTM layer: ??= LSTM ?1,?2, ,?? APC loss is computed as: ? 1 ? ????= ?+? ?? 1 log? cos ?+?,?? ?=1 ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1

  23. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss To maximize the efficiency of APC loss, this work modify the Transformer attention mask applied for auxiliary non-streaming layers

  24. 3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Total loss: 1 ?? ?total= ?????? ? ?? 1 + ?????+ ?????+ ????? ??+ ?? ? ??,?? ??,??: The number of labeled and unlabeled samples ? = 0.01,? = 0.0005,? = 0.005

  25. 4. Experimental Results-Setup Dataset LibriSpeech-960h Exploit 6K hours of unlabeled data from the same domain (i.e., audio book reading)

  26. 4. Experimental Results-Setup Model Architecture: Conformer-Transducer ASR models Student encoder: A stack of 16 Conformer layers 256-dim feature 4 attention heads Teacher encoder: A stack of 16 Conformer layers 512-dim feature 8 attention heads

  27. 4. Experimental Results-Setup For the streaming student Convolution layer is set to be causal and the attention mask is restricted to only access chunks within the left context limit For the auxiliary layers Use a single Transformer layer (self-attention + feed-forward) Apply KD from the 4, 8, 12, 16-th layers of the teacher to 4, 8, 12, 16-th layers of the student

  28. 4. Experimental Results-Setup Training Details Batch size: 1024 Peak learning rate: 1.25? 3 The streaming student uses a chunk-wise training and inference Chunk size: 160ms Left/Right context size: 640ms and 0ms Mix the labeled and unlabeled data to a 1:1 ratio to construct an epoch

  29. 4. Experimental Results-ASR Performance

  30. 4. Experimental Results-ASR Performance

  31. 4. Experimental Results-ASR Performance

  32. 5. Discussion Teacher-to-Student Layer Connection find the optimal connection between layer identify the layers of similar behaviors and connect them

  33. 5. Discussion Relationship to Pseudo Label Generation To utilize unlabeled data, several studies have suggested generating pseudo labels for such data using the out-of-the-shelf large ASR mode Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data, in ICASSP 2021

  34. 6. Conclusion In this paper: A novel KD method for improving the small streaming student using the large non-streaming teacher key idea is to insert non-streaming auxiliary layers on the student Matching the context of the target features for knowledge distillation Designed a special layer-wise KD loss function, which includes future prediction task using APC loss

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#