Factorized and progressive knowledge distillation for CTC-based ASR models

被引:0
|
作者
Tian, Sanli [1 ,2 ]
Li, Zehan [1 ,2 ]
Lyv, Zhaobiao [3 ]
Cheng, Gaofeng [1 ]
Xiao, Qing [3 ]
Li, Ta [1 ,2 ]
Zhao, Qingwei [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] China Unicom Guangdong Ind Internet Co Ltd, Dongguan, Guangdong, Peoples R China
关键词
End-to-end speech recognition; Connectionist temporal classification; Knowledge distillation; NEURAL-NETWORK; SEQUENCE;
D O I
10.1016/j.specom.2024.103071
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non -blank and blank frames differently for two main reasons. First, the non -blank frames in the teacher model's posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non -blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non -blank tokens in the teacher's blank -frame posteriors exhibit irregular probability distributions, negatively impacting the student model's learning. Thus, we propose to factorize the distillation of non -blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non -blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representationbased KD, in which hidden representations are divided into non -blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher's posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non -blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross -model topology KD.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Droplet microfluidics for CTC-based liquid biopsy: a review
    Jiang, Lin
    Yang, Hang
    Cheng, Weiqi
    Ni, Zhonghua
    Xiang, Nan
    ANALYST, 2023, 148 (02) : 203 - 221
  • [22] CTC-based Non-autoregressive Speech Translation
    Xu, Chen
    Liu, Xiaoqian
    Liu, Xiaowen
    Sun, Qingxuan
    Zhang, Yuhao
    Yang, Murun
    Dong, Qianqian
    Ko, Tom
    Wang, Mingxuan
    Xiao, Tong
    Ma, Anxiang
    Zhu, Jingbo
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13321 - 13339
  • [23] INTERMEDIATE LOSS REGULARIZATION FOR CTC-BASED SPEECH RECOGNITION
    Lee, Jaesong
    Watanabe, Shinji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6224 - 6228
  • [24] Compressed MoE ASR Model Based on Knowledge Distillation and Quantization
    Yuan, Yuping
    You, Zhao
    Feng, Shulin
    Su, Dan
    Liang, Yanchun
    Shi, Xiaohu
    Yu, Dong
    INTERSPEECH 2023, 2023, : 3337 - 3341
  • [25] INVESTIGATION OF SEQUENCE-LEVEL KNOWLEDGE DISTILLATION METHODS FOR CTC ACOUSTIC MODELS
    Takashima, Ryoichi
    Sheng, Li
    Kawai, Hisashi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6156 - 6160
  • [26] Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models
    Takashima, Ryoichi
    Sheng, Li
    Kawai, Hisashi
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2019, 2019-May : 6156 - 6160
  • [27] Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models
    Yoon, Ji Won
    Kim, Hyung Yong
    Lee, Hyeonseung
    Ahn, Sunghwan
    Kim, Nam Soo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2974 - 2987
  • [28] Adaptive Speaker Normalization for CTC-Based Speech Recognition
    Ding, Penguin
    Guo, Wu
    Gu, Bin
    Ling, Zhenhua
    Du, Jun
    INTERSPEECH 2020, 2020, : 1266 - 1270
  • [29] Enhancing CTC-based speech recognition with diverse modeling units
    Han, Shiyi
    Lei, Zhihong
    Xu, Mingbin
    Na, Xingyu
    Huang, Zhen
    INTERSPEECH 2024, 2024, : 4583 - 4587
  • [30] Challenges for CTC-based liquid biopsies: low CTC frequency and diagnostic leukapheresis as a potential solution
    Stoecklein, Nikolas H.
    Fischer, Johannes C.
    Niederacher, Dieter
    Terstappen, Leon W. M. M.
    EXPERT REVIEW OF MOLECULAR DIAGNOSTICS, 2016, 16 (02) : 147 - 164