A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition

被引:6
|
作者
Fan, Ruchao [1 ]
Chu, Wei [2 ]
Chang, Peng [2 ]
Alwan, Abeer [1 ]
机构
[1] Univ Calif Los Angeles, Dept Elect & Comp Engn, Los Angeles, CA 90095 USA
[2] PAII Inc, Palo Alto, CA 94306 USA
关键词
CTC alignment; non-autoregressive transformer; end-to-end ASR; intermediate loss; MODELS;
D O I
10.1109/TASLP.2023.3263789
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this article, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a similar to 24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.
引用
收藏
页码:1436 / 1448
页数:13
相关论文
共 50 条
  • [21] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
    Wang, Tianzi
    Fujita, Yuya
    Chang, Xuankai
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3755 - 3759
  • [22] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
  • [23] NON-AUTOREGRESSIVE TRANSFORMER WITH UNIFIED BIDIRECTIONAL DECODER FOR AUTOMATIC SPEECH RECOGNITION
    Zhang, Chuan-Fei
    Liu, Yan
    Zhang, Tian-Hao
    Chen, Song-Lu
    Chen, Feng
    Yin, Xu-Cheng
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6527 - 6531
  • [24] END-TO-END AUTOMATIC SPEECH RECOGNITION INTEGRATED WITH CTC-BASED VOICE ACTIVITY DETECTION
    Yoshimura, Takenori
    Hayashi, Tomoki
    Takeda, Kazuya
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6999 - 7003
  • [25] Investigation of Transformer based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition
    Zhang, Shiliang
    Lei, Ming
    Yan, Zhijie
    [J]. INTERSPEECH 2019, 2019, : 2180 - 2184
  • [26] Hybrid Autoregressive and Non-Autoregressive Transformer Models for Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Zhang, Shuai
    Wen, Zhengqi
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 762 - 766
  • [27] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [28] CTC-based Non-autoregressive Speech Translation
    Xu, Chen
    Liu, Xiaoqian
    Liu, Xiaowen
    Sun, Qingxuan
    Zhang, Yuhao
    Yang, Murun
    Dong, Qianqian
    Ko, Tom
    Wang, Mingxuan
    Xiao, Tong
    Ma, Anxiang
    Zhu, Jingbo
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 13321 - 13339
  • [29] Semantic Mask for Transformer based End-to-End Speech Recognition
    Wang, Chengyi
    Wu, Yu
    Du, Yujiao
    Li, Jinyu
    Liu, Shujie
    Lu, Liang
    Ren, Shuo
    Ye, Guoli
    Zhao, Sheng
    Zhou, Ming
    [J]. INTERSPEECH 2020, 2020, : 971 - 975
  • [30] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
    Bai, Ye
    Yi, Jiangyan
    Tao, Jianhua
    Tian, Zhengkun
    Wen, Zhengqi
    Zhang, Shuai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911