IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR

被引:28
|
作者
Higuchi, Yosuke [1 ]
Inaguma, Hirofumi [2 ]
Watanabe, Shinji [3 ]
Ogawa, Tetsuji [1 ]
Kobayashi, Tetsunori [1 ]
机构
[1] Waseda Univ, Tokyo, Japan
[2] Kyoto Univ, Kyoto, Japan
[3] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
Non-autoregressive sequence generation; connectionist temporal classification; end-to-end speech recognition; end-to-end speech translation; SPEECH RECOGNITION;
D O I
10.1109/ICASSP39728.2021.9414198
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% -> 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation.
引用
收藏
页码:8363 / 8367
页数:5
相关论文
共 50 条
  • [21] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. INTERSPEECH 2022, 2022, : 5090 - 5094
  • [22] IMPROVING NON-AUTOREGRESSIVE END-TO-END SPEECH RECOGNITION WITH PRE-TRAINED ACOUSTIC AND LANGUAGE MODELS
    Deng, Keqi
    Yang, Zehui
    Watanabe, Shinji
    Higuchi, Yosuke
    Cheng, Gaofeng
    Zhang, Pengyuan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8522 - 8526
  • [23] Streaming End-to-End ASR Using CTC Decoder and DRA for Linguistic Information Substitution
    Takagi, Tatsunari
    Ogawa, Atsunori
    Kitaoka, Norihide
    Wakabayashi, Yukoh
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1779 - 1783
  • [24] INVESTIGATING SEQUENCE-LEVEL NORMALISATION FOR CTC-LIKE END-TO-END ASR
    Zhao, Zeyu
    Bell, Peter
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7792 - 7796
  • [25] HIERARCHICAL CONDITIONAL END-TO-END ASR WITH CTC AND MULTI-GRANULAR SUBWORD UNITS
    Higuchi, Yosuke
    Karube, Keita
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7797 - 7801
  • [26] TWO-STAGE AUGMENTATION AND ADAPTIVE CTC FUSION FOR IMPROVED ROBUSTNESS OF MULTI-STREAM END-TO-END ASR
    Li, Ruizhi
    Sell, Gregory
    Hermansky, Hynek
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 229 - 235
  • [27] Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
    Guo, Pengcheng
    Chang, Xuankai
    Watanabe, Shinji
    Xie, Lei
    [J]. INTERSPEECH 2021, 2021, : 3720 - 3724
  • [28] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [29] CTC-Based End-To-End ASR for the Low Resource Sanskrit Language with Spectrogram Augmentation
    Anoop, C. S.
    Ramakrishnan, A. G.
    [J]. 2021 NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2021, : 111 - 116
  • [30] Out-of-vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System
    Egorova, Ekaterina
    Vydana, Hari Krishna
    Burget, Lukas
    Cernocky, Jan
    [J]. INTERSPEECH 2021, 2021, : 2901 - 2905