IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR

被引:28
|
作者
Higuchi, Yosuke [1 ]
Inaguma, Hirofumi [2 ]
Watanabe, Shinji [3 ]
Ogawa, Tetsuji [1 ]
Kobayashi, Tetsunori [1 ]
机构
[1] Waseda Univ, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
Non-autoregressive sequence generation; connectionist temporal classification; end-to-end speech recognition; end-to-end speech translation; SPEECH RECOGNITION;
D O I
10.1109/ICASSP39728.2021.9414198
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% -> 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.
引用
收藏
页码:8363 / 8367
页数:5
相关论文
共 50 条
  • [31] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [32] FAST-MD: FAST MULTI-DECODER END-TO-END SPEECH TRANSLATION WITH NON-AUTOREGRESSIVE HIDDEN INTERMEDIATES
    Inaguma, Hirofumi
    Dalmia, Siddharth
    Yan, Brian
    Watanabe, Shinji
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 922 - 929
  • [33] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
    Bai, Ye
    Yi, Jiangyan
    Tao, Jianhua
    Tian, Zhengkun
    Wen, Zhengqi
    Zhang, Shuai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911
  • [34] Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM
    Futami, Hayato
    Inaguma, Hirofumi
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. INTERSPEECH 2022, 2022, : 3889 - 3893
  • [35] CTC-based Non-autoregressive Speech Translation
    Xu, Chen
    Liu, Xiaoqian
    Liu, Xiaowen
    Sun, Qingxuan
    Zhang, Yuhao
    Yang, Murun
    Dong, Qianqian
    Ko, Tom
    Wang, Mingxuan
    Xiao, Tong
    Ma, Anxiang
    Zhu, Jingbo
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13321 - 13339
  • [36] SPEAKER ADAPTATION FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Zhao, Yong
    Kumar, Kshitiz
    Gong, Yifan
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 542 - 549
  • [37] Towards Lifelong Learning of End-to-end ASR
    Chang, Heng-Jui
    Lee, Hung-yi
    Lee, Lin-shan
    [J]. INTERSPEECH 2021, 2021, : 2551 - 2555
  • [38] Contextual Biasing for End-to-End Chinese ASR
    Zhang, Kai
    Zhang, Qiuxia
    Wang, Chung-Che
    Jang, Jyh-Shing Roger
    [J]. IEEE ACCESS, 2024, 12 : 92960 - 92975
  • [39] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
    Sivaraman, Ganesh
    Casal, Ricardo
    Garland, Matt
    Khoury, Elie
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
  • [40] End-to-End Topic Classification without ASR
    Dong, Zexian
    Liu, Jia
    Zhang, Wei-Qiang
    [J]. 2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,