Knowledge Distillation for Streaming ASR Encoder with Non-streaming Layer

Knowledge Distillation from Non-

streaming to Streaming ASR Encoder

using Auxiliary Non-streaming Layer

Kyuhong Shim, Jinkyu Lee, Simyung Chang and

Kyuwoong Hwang

Qualcomm AI Research† , Qualcomm Korea YH, Seoul,

Republic of Korea

{kshim, jinkyu, simychan, kyuwoong}@qti.qualcomm.com

Accepted to Interspeech 2023

Outline

1.

Introduction

2.

Related Work

3.

Distillation from Non-streaming Teacher to Streaming Student

4.

Experimental Results

5.

Discussion

6.

Conclusion

1. Introduction

•

Streaming ASR models face significant limitations:

•

On-device

•

Real-time speech recognition

1. Introduction

•

Simple approach:

Minimize Kullback-Leibler divergence

(KLD)

 between the output token emission probabilities of

the teacher and student

•

Direct frame-to-frame distillation misguide the streaming model

•

The

audio-text frame alignments

are different between the non-

streaming and streaming models

1. Introduction

•

Dual-mode: a type of

self-KD training

, which shares the

same model for both streaming and non-streaming cases

•

Limitation that the student model size should be the same as the

teacher

1. Introduction

•

This work propose a novel KD method that applies

layer-

wise distillation on the encoder part of the model

•

Advantages over existing methods:

•

Faster training

•

Better robustness to frame misalignment

•

Ability to utilize unlabeled data

1. Introduction

First, insert auxiliary non-

streaming layers into the

streaming model

1. Introduction

Second, design a special KD loss function to enhance the

feature extraction process

1. Introduction

•

Compared to the baseline, our method

achieves 16%

relative WER reduction

 on LibriSpeech

test-other

 dataset

•

By incorporating the

large unlabeled data as an additional

training resource

, we could additionally

reduce the WER by

3%

2. Related Work

- Distillation for Improving Streaming ASR

•

The most popular approach is to distill the output token

probabilities

•

Manual shifting

•

Multi-stage alignment matching

•

CTC alignments

2. Related Work

- Distillation for Improving Streaming ASR

•

Another line of research called dual-mode, where the non-

streaming teacher and streaming student

share the same

parameter set

•

The teacher and the student should be the same (large)

architecture

•

Has not been tested for chunk-wise streaming setting

2. Related Work

- Distillation for ASR Model Compression

•

Transducer-based end-to-end ASR models have been

actively studied for compression

•

Large self-supervised models have also tried KD to squeeze

their model size

2. Related Work

- Cascaded Non-streaming and Streaming Layers

•

Cascade encoder: Special encoder

design that stack several non-

streaming layers on top of the

streaming layers

•

This work also attaches additional

non-streaming layers, but they are

only used for KD purpose

and then

removed after training is finished

3. Distillation from Non-streaming Teacher to

Streaming Student

3. Distillation from Non-streaming Teacher to

Streaming Student

3. Distillation from Non-streaming Teacher to

Streaming Student

•

Layer-wise KD from the teacher encoder to the student

encoder

•

In order to solve the context mismatch, this paper insert

auxiliary non-streaming layers at selected layers of the

student

3. Distillation from Non-streaming Teacher to

Streaming Student

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

•

Future prediction loss:

•

Inspired by the autoregressive predictive coding (APC)

•

Using previous history to predict the future feature (after several

frames)

•

APC:

•

Unidirectional LSTMs

and train those layers to minimize the difference

•

Originally

 proposed for

self-supervised speech pretraining

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

•

To maximize the efficiency of

APC loss, this work modify

the Transformer attention

mask applied for auxiliary

non-streaming layers

3. Distillation from Non-streaming Teacher to

Streaming Student

-Distillation Loss

4. Experimental Results

-Setup

•

Dataset

•

LibriSpeech-960h

•

Exploit 6K hours of unlabeled data from the same domain (i.e.,

audio book reading)

4. Experimental Results

-Setup

•

Model Architecture: Conformer-Transducer ASR models

•

Teacher encoder:

•

 A stack of 16 Conformer layers

•

512-dim feature

•

8 attention heads

•

Student encoder:

•

A stack of 16 Conformer layers

•

256-dim feature

•

4 attention heads

4. Experimental Results

-Setup

•

For the streaming student

•

Convolution layer is set to be causal

and the

attention mask

is

restricted to only access chunks within the left context limit

•

For the auxiliary layers

•

Use a single Transformer layer (self-attention + feed-forward)

•

Apply KD from the 4, 8, 12, 16-th layers of the teacher to 4, 8, 12,

16-th layers of the student

4. Experimental Results

-Setup

4. Experimental Results

-ASR Performance

4. Experimental Results

-ASR Performance

4. Experimental Results

-ASR Performance

5. Discussion

•

Teacher-to-Student Layer Connection

•

find the optimal connection between layer

•

identify the layers of similar behaviors and connect them

5. Discussion

•

Relationship to Pseudo Label Generation

•

To utilize unlabeled data, several studies have suggested

generating pseudo labels

 for such data

using the out-of-the-shelf

large ASR mode

•

“Improving streaming automatic speech recognition with non-streaming

model distillation on unsupervised data,” in ICASSP 2021

6. Conclusion

•

In this paper:

•

A novel

KD method for improving the small streaming student

using the

large non-streaming teacher

•

key idea is to

insert non-streaming auxiliary layers on the student

•

Matching the context of the target features for knowledge distillation

•

Designed a

special layer-wise KD loss function

, which

includes future

prediction task using APC loss

Slide Note

Embed Share

Download

The research introduces a novel knowledge distillation (KD) method for transitioning from non-streaming to streaming ASR encoders by incorporating auxiliary non-streaming layers and a special KD loss function. This approach enhances feature extraction, improves robustness to frame misalignment, and allows for the use of unlabeled data. Experimental results show a 16% relative Word Error Rate (WER) reduction on the LibriSpeech test-other dataset, with an additional 3% reduction when leveraging large unlabeled data.

neveah Follow

Uploaded on May 13, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Knowledge Distillation from Non- streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer Kyuhong Shim, Jinkyu Lee, Simyung Chang and Kyuwoong Hwang Qualcomm AI Research , Qualcomm Korea YH, Seoul, Republic of Korea {kshim, jinkyu, simychan, kyuwoong}@qti.qualcomm.com Accepted to Interspeech 2023

Outline 1. Introduction 2. Related Work 3. Distillation from Non-streaming Teacher to Streaming Student 4. Experimental Results 5. Discussion 6. Conclusion

1. Introduction Streaming ASR models face significant limitations: On-device Real-time speech recognition

1. Introduction Simple approach: Minimize Kullback-Leibler divergence (KLD) between the output token emission probabilities of the teacher and student Direct frame-to-frame distillation misguide the streaming model The audio-text frame alignments are different between the non- streaming and streaming models

1. Introduction Dual-mode: a type of self-KD training, which shares the same model for both streaming and non-streaming cases Limitation that the student model size should be the same as the teacher

1. Introduction This work propose a novel KD method that applies layer- wise distillation on the encoder part of the model Advantages over existing methods: Faster training Better robustness to frame misalignment Ability to utilize unlabeled data

1. Introduction First, insert auxiliary non- streaming layers into the streaming model

1. Introduction Second, design a special KD loss function to enhance the feature extraction process

1. Introduction Compared to the baseline, our method achieves 16% relative WER reduction on LibriSpeech test-other dataset By incorporating the large unlabeled data as an additional training resource, we could additionally reduce the WER by 3%

2. Related Work - Distillation for Improving Streaming ASR The most popular approach is to distill the output token probabilities Manual shifting Multi-stage alignment matching CTC alignments

2. Related Work - Distillation for Improving Streaming ASR Another line of research called dual-mode, where the non- streaming teacher and streaming student share the same parameter set The teacher and the student should be the same (large) architecture Has not been tested for chunk-wise streaming setting

2. Related Work - Distillation for ASR Model Compression Transducer-based end-to-end ASR models have been actively studied for compression Large self-supervised models have also tried KD to squeeze their model size

2. Related Work - Cascaded Non-streaming and Streaming Layers Cascade encoder: Special encoder design that stack several non- streaming layers on top of the streaming layers This work also attaches additional non-streaming layers, but they are only used for KD purpose and then removed after training is finished

3. Distillation from Non-streaming Teacher to Streaming Student Input: Speech frames ? = ?1,?2, ,?? Target: Text tokens ? = ?1,?2, ,?? Goal: Maximize the likelihood ? ?|?

3. Distillation from Non-streaming Teacher to Streaming Student Output feature of the encoder -th layer: = 1 Next layer s ?-th frame output: ? ? +1 1 ?? and ?? are the left/right-context size , , ? , 2 +1= ? +1 ? ?? , , ?+?? , , ? ??? ????????? ??? ???????????? for the teacher and ? for the student

3. Distillation from Non-streaming Teacher to Streaming Student Layer-wise KD from the teacher encoder to the student encoder In order to solve the context mismatch, this paper insert auxiliary non-streaming layers at selected layers of the student

3. Distillation from Non-streaming Teacher to Streaming Student At -th layer, the output frame features of auxiliary layer: ?? = ? ?1 Three parts of ? : 1. A simple projection layer 2. A single Transformer layer 3. A unidirectional LSTM layer , ,?? Then KD between the teacher layer and the auxiliary layer ?

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Feature similarity loss: ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1 ?: Feature vector dimension ?: Sigmoid activation function cos , : Cosine similarity

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Self-Attention loss: ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?: Number of attention heads ?? ?,?,?? ?,?: Teacher/Student attention probability distribution of the ?-th frame computed at ?-th attention head

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss ? ? ?????=1 ?????||?? ?,? ????? ???? ? ?=1 ??? ?? ?,? ?=1 ?????= ???????? ? ?? ?,? ??,???,? ?? ??= ??,?: Query vector of ?-th frame The self-attention loss: ????= ???? ? ?: Attention head dimension ?????+ ???? ???+ ???? ?????

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Future prediction loss: Inspired by the autoregressive predictive coding (APC) Using previous history to predict the future feature (after several frames) APC: Unidirectional LSTMs and train those layers to minimize the difference Originally proposed for self-supervised speech pretraining

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss The output features of unidirectional LSTM layer: ??= LSTM ?1,?2, ,?? APC loss is computed as: ? 1 ? ????= ?+? ?? 1 log? cos ?+?,?? ?=1 ? 1 ? ????= ? ?? 1 log? cos ?,?? ?=1

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss To maximize the efficiency of APC loss, this work modify the Transformer attention mask applied for auxiliary non-streaming layers

3. Distillation from Non-streaming Teacher to Streaming Student-Distillation Loss Total loss: 1 ?? ?total= ?????? ? ?? 1 + ?????+ ?????+ ????? ??+ ?? ? ??,?? ??,??: The number of labeled and unlabeled samples ? = 0.01,? = 0.0005,? = 0.005

4. Experimental Results-Setup Dataset LibriSpeech-960h Exploit 6K hours of unlabeled data from the same domain (i.e., audio book reading)

4. Experimental Results-Setup Model Architecture: Conformer-Transducer ASR models Student encoder: A stack of 16 Conformer layers 256-dim feature 4 attention heads Teacher encoder: A stack of 16 Conformer layers 512-dim feature 8 attention heads

4. Experimental Results-Setup For the streaming student Convolution layer is set to be causal and the attention mask is restricted to only access chunks within the left context limit For the auxiliary layers Use a single Transformer layer (self-attention + feed-forward) Apply KD from the 4, 8, 12, 16-th layers of the teacher to 4, 8, 12, 16-th layers of the student

4. Experimental Results-Setup Training Details Batch size: 1024 Peak learning rate: 1.25? 3 The streaming student uses a chunk-wise training and inference Chunk size: 160ms Left/Right context size: 640ms and 0ms Mix the labeled and unlabeled data to a 1:1 ratio to construct an epoch

4. Experimental Results-ASR Performance

4. Experimental Results-ASR Performance

4. Experimental Results-ASR Performance

5. Discussion Teacher-to-Student Layer Connection find the optimal connection between layer identify the layers of similar behaviors and connect them

5. Discussion Relationship to Pseudo Label Generation To utilize unlabeled data, several studies have suggested generating pseudo labels for such data using the out-of-the-shelf large ASR mode Improving streaming automatic speech recognition with non-streaming model distillation on unsupervised data, in ICASSP 2021

6. Conclusion In this paper: A novel KD method for improving the small streaming student using the large non-streaming teacher key idea is to insert non-streaming auxiliary layers on the student Matching the context of the target features for knowledge distillation Designed a special layer-wise KD loss function, which includes future prediction task using APC loss

Knowledge Distillation for Streaming ASR Encoder with Non-streaming Layer

Download Presentation

Presentation Transcript

Related

More Related Content