Non-autoregressive Deliberation-Attention based End-to-End ASR

被引:1
|
作者
Gao, Changfeng [1 ,2 ]
Cheng, Gaofeng [1 ,2 ]
Zhou, Jun [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; end-to-end; non-autoregressive; highly paralleled;
D O I
10.1109/ISCSLP49672.2021.9362115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which replaces the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
    Patil, Aditya
    Joshi, Vikas
    Agrawal, Purvi
    Mehta, Rupesh
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259
  • [42] Towards Lifelong Learning of End-to-end ASR
    Chang, Heng-Jui
    Lee, Hung-yi
    Lee, Lin-shan
    INTERSPEECH 2021, 2021, : 2551 - 2555
  • [43] Contextual Biasing for End-to-End Chinese ASR
    Zhang, Kai
    Zhang, Qiuxia
    Wang, Chung-Che
    Jang, Jyh-Shing Roger
    IEEE ACCESS, 2024, 12 : 92960 - 92975
  • [44] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
    Sivaraman, Ganesh
    Casal, Ricardo
    Garland, Matt
    Khoury, Elie
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
  • [45] End-to-End Topic Classification without ASR
    Dong, Zexian
    Liu, Jia
    Zhang, Wei-Qiang
    2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
  • [46] Phonemic competition in end-to-end ASR models
    ten Bosch, Louis
    Bentum, Martijn
    Boves, Lou
    INTERSPEECH 2023, 2023, : 586 - 590
  • [47] Streaming Align-Refine for Non-autoregressive Deliberation
    Wang, Weiran
    Hu, Ke
    Sainath, Tara N.
    INTERSPEECH 2022, 2022, : 1696 - 1700
  • [48] Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments
    Yang, Runyan
    Cheng, Gaofeng
    Miao, Haoran
    Li, Ta
    Zhang, Pengyuan
    Yan, Yonghong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3202 - 3215
  • [49] Spelling-Aware Word-Based End-to-End ASR
    Egorova, Ekaterina
    Vydana, Hari Krishna
    Burget, Lukas
    Cernocky, Jan Honza
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1729 - 1733
  • [50] A Lexical-aware Non-autoregressive Transformer-based ASR Model
    Lin, Chong-En
    Chen, Kuan-Yu
    INTERSPEECH 2023, 2023, : 1434 - 1438