IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR

被引:28
|
作者
Higuchi, Yosuke [1 ]
Inaguma, Hirofumi [2 ]
Watanabe, Shinji [3 ]
Ogawa, Tetsuji [1 ]
Kobayashi, Tetsunori [1 ]
机构
[1] Waseda Univ, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
Non-autoregressive sequence generation; connectionist temporal classification; end-to-end speech recognition; end-to-end speech translation; SPEECH RECOGNITION;
D O I
10.1109/ICASSP39728.2021.9414198
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% -> 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.
引用
收藏
页码:8363 / 8367
页数:5
相关论文
共 50 条
  • [1] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
    Higuchi, Yosuke
    Watanabe, Shinji
    Chen, Nanxin
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. INTERSPEECH 2020, 2020, : 3655 - 3659
  • [2] Non-autoregressive Deliberation-Attention based End-to-End ASR
    Gao, Changfeng
    Cheng, Gaofeng
    Zhou, Jun
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [3] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
    Wang, Tianzi
    Fujita, Yuya
    Chang, Xuankai
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3755 - 3759
  • [4] BOUNDARY AND CONTEXT AWARE TRAINING FOR CIF-BASED NON-AUTOREGRESSIVE END-TO-END ASR
    Yu, Fan
    Luo, Haoneng
    Guo, Pengcheng
    Bang, Yuhao
    Yao, Zhuoyuan
    Xie, Lei
    Gao, Yingying
    Hou, Leijing
    Zhang, Shilei
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 328 - 334
  • [5] Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
    Chuang, Shun-Po
    Chuang, Yung-Sung
    Chang, Chih-Chiang
    Lee, Hung-yi
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1068 - 1077
  • [6] A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition
    Fan, Ruchao
    Chu, Wei
    Chang, Peng
    Alwan, Abeer
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1436 - 1448
  • [7] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [8] Non-autoregressive End-to-End TTS with Coarse-to-Fine Decoding
    Wang, Tao
    Liu, Xuefei
    Tao, Jianhua
    Yi, Jiangyan
    Fu, Ruibo
    Wen, Zhengqi
    [J]. INTERSPEECH 2020, 2020, : 3984 - 3988
  • [9] Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    [J]. INTERSPEECH 2022, 2022, : 4581 - 4585
  • [10] FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis
    Wang, Yongqi
    Zhao, Zhou
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5678 - 5687