Non-autoregressive Deliberation-Attention based End-to-End ASR

被引:1
|
作者
Gao, Changfeng [1 ,2 ]
Cheng, Gaofeng [1 ,2 ]
Zhou, Jun [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; end-to-end; non-autoregressive; highly paralleled;
D O I
10.1109/ISCSLP49672.2021.9362115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which replaces the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
    Li, Mohan
    Doddipatla, Rama
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397
  • [22] NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING
    Omachi, Motoi
    Fujita, Yuya
    Watanabe, Shinji
    Wang, Tianzi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6772 - 6776
  • [23] ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture
    Cheng, Gaofeng
    Miao, Haoran
    Yang, Runyan
    Deng, Keqi
    Yan, Yonghong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1360 - 1373
  • [24] IMPROVING NON-AUTOREGRESSIVE END-TO-END SPEECH RECOGNITION WITH PRE-TRAINED ACOUSTIC AND LANGUAGE MODELS
    Deng, Keqi
    Yang, Zehui
    Watanabe, Shinji
    Higuchi, Yosuke
    Cheng, Gaofeng
    Zhang, Pengyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8522 - 8526
  • [25] End-to-End ASR with Adaptive Span Self-Attention
    Chang, Xuankai
    Subramanian, Aswin Shanmugam
    Guo, Pengcheng
    Watanabe, Shinji
    Fujita, Yuya
    Omachi, Motoi
    INTERSPEECH 2020, 2020, : 3595 - 3599
  • [26] UNSUPERVISED SPEAKER ADAPTATION USING ATTENTION-BASED SPEAKER MEMORY FOR END-TO-END ASR
    Sari, Leda
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7384 - 7388
  • [27] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [28] IMPROVING ATTENTION-BASED END-TO-END ASR SYSTEMS WITH SEQUENCE-BASED LOSS FUNCTIONS
    Cui, Jia
    Weng, Chao
    Wang, Guangsen
    Wang, Jun
    Wang, Peidong
    Yu, Chengzhu
    Su, Dan
    Yu, Dong
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 353 - 360
  • [29] FAST-MD: FAST MULTI-DECODER END-TO-END SPEECH TRANSLATION WITH NON-AUTOREGRESSIVE HIDDEN INTERMEDIATES
    Inaguma, Hirofumi
    Dalmia, Siddharth
    Yan, Brian
    Watanabe, Shinji
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 922 - 929
  • [30] Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
    Maekaku, Takashi
    Fujita, Yuya
    Peng, Yifan
    Watanabe, Shinji
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 1071 - 1075