Factorized and progressive knowledge distillation for CTC-based ASR models

被引:0
|
作者
Tian, Sanli [1 ,2 ]
Li, Zehan [1 ,2 ]
Lyv, Zhaobiao [3 ]
Cheng, Gaofeng [1 ]
Xiao, Qing [3 ]
Li, Ta [1 ,2 ]
Zhao, Qingwei [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] China Unicom Guangdong Ind Internet Co Ltd, Dongguan, Guangdong, Peoples R China
关键词
End-to-end speech recognition; Connectionist temporal classification; Knowledge distillation; NEURAL-NETWORK; SEQUENCE;
D O I
10.1016/j.specom.2024.103071
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non -blank and blank frames differently for two main reasons. First, the non -blank frames in the teacher model's posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non -blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non -blank tokens in the teacher's blank -frame posteriors exhibit irregular probability distributions, negatively impacting the student model's learning. Thus, we propose to factorize the distillation of non -blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non -blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representationbased KD, in which hidden representations are divided into non -blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher's posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non -blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross -model topology KD.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] IMPROVING VERY DEEP TIME-DELAY NEURAL NETWORK WITH VERTICAL-ATTENTION FOR EFFECTIVELY TRAINING CTC-BASED ASR SYSTEMS
    Li, Sheng
    Lu, Xugang
    Takashima, Ryoichi
    Shen, Peng
    Kawahara, Tatsuya
    Kawai, Hisashi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 77 - 83
  • [32] Study on CTC-based heavy haul train dispatch system
    School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China
    Tiedao Xuebao, 2008, 4 (1-5):
  • [33] CTC-BASED GENE EXPRESSION FOR PREDICTING RESISTANCE TO ABIRATERONE AND ENZALUTAMIDE IN MCRPC
    Chung, Jae-Seung
    Wang, Yugang
    James, Henderson
    Singhal, Udit
    Qiao, Yuanyuan
    Zaslavsky, Alexander
    Hovelson, Dan
    Feng, Felix
    Palapattu, Ganesch
    Russell, Taichman
    Chinnaiyan, Arul
    Tomlins, Scott
    Morgan, Todd
    JOURNAL OF UROLOGY, 2017, 197 (04): : E1357 - E1357
  • [34] Text-Conditioned Character Segmentation for CTC-Based Text Recognition
    Tanaka, Ryohei
    Osada, Kunio
    Furuhata, Akio
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT III, 2021, 12823 : 142 - 156
  • [35] Cross-lingual adaptation of a CTC-based multilingual acoustic model
    Tong, Sibo
    Garner, Philip N.
    Bourlard, Herve
    SPEECH COMMUNICATION, 2018, 104 : 39 - 46
  • [36] INTERDECODER: USING ATTENTION DECODERS AS INTERMEDIATE REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
    Komatsu, Tatsuya
    Fujita, Yusuke
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 46 - 51
  • [37] DETECT III and IV - Individualized CTC-based therapy of metastatic breast cancer
    Polasik, A.
    Schramm, A.
    Friedl, T. W. P.
    Rack, B.
    Trapp, E.
    Fasching, P. A.
    Taran, F-A
    Hartkopf, A.
    Schneeweiss, A.
    Mueller, V.
    Aktas, B.
    Pantel, K.
    Meier-Stiegen, F.
    Wimberger, P.
    Janni, W.
    Fehm, T.
    CANCER RESEARCH, 2017, 77
  • [38] Multipatch Progressive Pansharpening With Knowledge Distillation
    Gong, Meiqi
    Zhang, Hao
    Xu, Han
    Tian, Xin
    Ma, Jiayi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [39] CTC-Based Learning of Chroma Features for Score-Audio Music Retrieval
    Zalkow, Frank
    Mueller, Meinard
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2957 - 2971
  • [40] Improving Efficiency and Performance Through CTC-Based Transformers for Mathematical Expression Recognition
    Anitei, Dan
    Parres, Daniel
    Sanchez, Joan Andreu
    Miguel Benedi, Jose
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT V, 2024, 14808 : 3 - 20