A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

被引:6
|
作者
Fan, Ruchao [1 ]
Chu, Wei [2 ]
Chang, Peng [2 ]
Alwan, Abeer [1 ]
机构
[1] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA 90095 USA
[2] PAII Inc, Palo Alto, CA 94306 USA
关键词
CTC alignment; non-autoregressive transformer; end-to-end ASR; intermediate loss; MODELS;
D O I
10.1109/TASLP.2023.3263789
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this article, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a similar to 24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.
引用
收藏
页码:1436 / 1448
页数:13
相关论文
共 50 条
  • [1] CASS-NAT: CTC ALIGNMENT-BASED SINGLE STEP NON-AUTOREGRESSIVE TRANSFORMER FOR SPEECH RECOGNITION
    Fan, Ruchao
    Chu, Wei
    Chang, Peng
    Xiao, Jing
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5889 - 5893
  • [2] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Zhang, Shuai
    Wen, Zhengqi
    [J]. INTERSPEECH 2020, 2020, : 5026 - 5030
  • [3] Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
    Gao, Zhifu
    Zhang, Shiliang
    McLoughlin, Ian
    Yan, Zhijie
    [J]. INTERSPEECH 2022, 2022, : 2063 - 2067
  • [4] Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
    Chuang, Shun-Po
    Chuang, Yung-Sung
    Chang, Chih-Chiang
    Lee, Hung-yi
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1068 - 1077
  • [5] NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING
    Omachi, Motoi
    Fujita, Yuya
    Watanabe, Shinji
    Wang, Tianzi
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6772 - 6776
  • [6] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
    Li, Mohan
    Doddipatla, Rama
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397
  • [7] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
    Higuchi, Yosuke
    Watanabe, Shinji
    Chen, Nanxin
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. INTERSPEECH 2020, 2020, : 3655 - 3659
  • [8] IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR
    Higuchi, Yosuke
    Inaguma, Hirofumi
    Watanabe, Shinji
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8363 - 8367
  • [9] A Transformer-Based End-to-End Automatic Speech Recognition Algorithm
    Dong, Fang
    Qian, Yiyang
    Wang, Tianlei
    Liu, Peng
    Cao, Jiuwen
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 1592 - 1596
  • [10] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
    Miao, Haoran
    Cheng, Gaofeng
    Gao, Changfeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088