Non-autoregressive Deliberation-Attention based End-to-End ASR

被引:1
|
作者
Gao, Changfeng [1 ,2 ]
Cheng, Gaofeng [1 ,2 ]
Zhou, Jun [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; end-to-end; non-autoregressive; highly paralleled;
D O I
10.1109/ISCSLP49672.2021.9362115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which replaces the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
    Maekaku, Takashi
    Fujita, Yuya
    Peng, Yifan
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 1071 - 1075
  • [32] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [33] A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
    Noh, Hyeon-Kyu
    Park, Hong-June
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [34] Cross-Attention End-to-End ASR for Two-Party Conversations
    Kim, Suyoun
    Dalmia, Siddharth
    Metze, Florian
    INTERSPEECH 2019, 2019, : 4380 - 4384
  • [35] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
    Bai, Ye
    Yi, Jiangyan
    Tao, Jianhua
    Tian, Zhengkun
    Wen, Zhengqi
    Zhang, Shuai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911
  • [36] Auxiliary feature based adaptation of end-to-end ASR systems
    Delcroix, Marc
    Watanabe, Shinji
    Ogawa, Atsunori
    Karita, Shigeki
    Nakatani, Tomohiro
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2444 - 2448
  • [37] Improving Attention-based End-to-end ASR by Incorporating an N-gram Neural Network
    Ao, Junyi
    Ko, Tom
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [38] End-to-End Joint Target and Non-Target Speakers ASR
    Masumura, Ryo
    Makishima, Naoki
    Yamane, Taiga
    Yamazaki, Yoshihiko
    Mizuno, Saki
    Ihori, Mana
    Uchida, Mihiro
    Suzuki, Keita
    Sato, Hiroshi
    Tanaka, Tomohiro
    Takashima, Akihiko
    Suzuki, Satoshi
    Moriya, Takafumi
    Hojo, Nobukatsu
    Ando, Atsushi
    INTERSPEECH 2023, 2023, : 2903 - 2907
  • [39] Toward Streaming ASR with Non-Autoregressive Insertion-based Model
    Fujita, Yuya
    Wang, Tianzi
    Watanabe, Shinji
    Omachi, Motoi
    INTERSPEECH 2021, 2021, : 3740 - 3744
  • [40] Regularizing cross-attention learning for end-to-end speech translation with ASR and MT attention matrices
    Zhao, Xiaohu
    Sun, Haoran
    Lei, Yikun
    Xiong, Deyi
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 247