Factorized and progressive knowledge distillation for CTC-based ASR models

被引:0
|
作者
Tian, Sanli [1 ,2 ]
Li, Zehan [1 ,2 ]
Lyv, Zhaobiao [3 ]
Cheng, Gaofeng [1 ]
Xiao, Qing [3 ]
Li, Ta [1 ,2 ]
Zhao, Qingwei [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] China Unicom Guangdong Ind Internet Co Ltd, Dongguan, Guangdong, Peoples R China
关键词
End-to-end speech recognition; Connectionist temporal classification; Knowledge distillation; NEURAL-NETWORK; SEQUENCE;
D O I
10.1016/j.specom.2024.103071
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non -blank and blank frames differently for two main reasons. First, the non -blank frames in the teacher model's posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non -blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non -blank tokens in the teacher's blank -frame posteriors exhibit irregular probability distributions, negatively impacting the student model's learning. Thus, we propose to factorize the distillation of non -blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non -blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representationbased KD, in which hidden representations are divided into non -blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher's posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non -blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross -model topology KD.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] A CONTEXT-AWARE KNOWLEDGE TRANSFERRING STRATEGY FOR CTC-BASED ASR
    Lu, Ke-Han
    Chen, Kuan-Yu
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 60 - 67
  • [2] Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
    Andrusenko, Andrei
    Laptev, Andrei
    Bataev, Vladimir
    Lavrukhin, Vitaly
    Ginsburg, Boris
    INTERSPEECH 2024, 2024, : 757 - 761
  • [3] DISTILLING ATTENTION WEIGHTS FOR CTC-BASED ASR SYSTEMS
    Moriya, Takafumi
    Sato, Hiroshi
    Tanaka, Tomohiro
    Ashihara, Takanori
    Masumura, Ryo
    Shinohara, Yusuke
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6894 - 6898
  • [4] InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR
    Nakagome, Yu
    Komatsu, Tatsuya
    Fujita, Yusuke
    Ichimura, Shuta
    Kida, Yusuke
    INTERSPEECH 2022, 2022, : 5140 - 5144
  • [5] INTER-KD: INTERMEDIATE KNOWLEDGE DISTILLATION FOR CTC-BASED AUTOMATIC SPEECH RECOGNITION
    Yoon, Ji Won
    Woo, Beom Jun
    Ahn, Sunghwan
    Lee, Hyeonseung
    Kim, Nam Soo
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 280 - 286
  • [6] Knowledge Distillation For CTC-based Speech Recognition Via Consistent Acoustic Representation Learning
    Tian, Sanli
    Deng, Keqi
    Li, Zehan
    Ye, Lingxuan
    Cheng, Gaofeng
    Li, Ta
    Yan, Yonghong
    INTERSPEECH 2022, 2022, : 2633 - 2637
  • [7] Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions
    Nozaki, Jumon
    Komatsu, Tatsuya
    INTERSPEECH 2021, 2021, : 3735 - 3739
  • [8] Cons-KD: Dropout-Robust Knowledge Distillation for CTC-Based Automatic Speech Recognition
    Won Yoon, Ji
    Lee, Hyeonseung
    Yeon Kang, Ju
    Soo Kim, Nam
    IEEE ACCESS, 2024, 12 : 131136 - 131146
  • [9] ON LATTICE-FREE BOOSTED MMI TRAINING OF HMM AND CTC-BASED FULL-CONTEXT ASR MODELS
    Zhang, Xiaohui
    Manohar, Vimal
    Zhang, David
    Zhang, Frank
    Shi, Yangyang
    Singhal, Nayan
    Chan, Julian
    Peng, Fuchun
    Saraf, Yatharth
    Seltzer, Mike
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1026 - 1033
  • [10] Improving CTC-based Handwritten Chinese Text Recognition with Cross-Modality Knowledge Distillation and Feature Aggregation
    Wu, Shilian
    Li, Yongrui
    Wang, Zengfu
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 792 - 797