A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

被引:0
|
作者
Ma, Zhengrui [1 ,3 ]
Fang, Qingkai [1 ,3 ]
Zhang, Shaolei [1 ,3 ]
Guo, Shoutao [1 ,3 ]
Feng, Yang [1 ,2 ,3 ]
Zhang, Min [4 ]
机构
[1] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China
[2] Chinese Acad Sci, Key Lab AI Safety, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
[4] Soochow Univ, Sch Future Sci & Engn, Suzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2x(1)), which integrates speechto-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2x outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28x decoding speedup in offline generation.(2)
引用
收藏
页码:1557 / 1575
页数:19
相关论文
共 50 条
  • [1] ORTHROS: NON-AUTOREGRESSIVE END-TO-END SPEECH TRANSLATION WITH DUAL-DECODER
    Inaguma, Hirofumi
    Higuchi, Yosuke
    Duh, Kevin
    Kawahara, Tatsuya
    Watanabe, Shinji
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7503 - 7507
  • [2] Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
    Chuang, Shun-Po
    Chuang, Yung-Sung
    Chang, Chih-Chiang
    Lee, Hung-yi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1068 - 1077
  • [3] FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis
    Wang, Yongqi
    Zhao, Zhou
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5678 - 5687
  • [4] End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification
    Libovicky, Jindrich
    Helcl, Jindrich
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 3016 - 3021
  • [5] Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Zhang, Shuai
    Wen, Zhengqi
    INTERSPEECH 2020, 2020, : 5026 - 5030
  • [6] Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
    Gao, Zhifu
    Zhang, Shiliang
    McLoughlin, Ian
    Yan, Zhijie
    INTERSPEECH 2022, 2022, : 2063 - 2067
  • [7] FAST-MD: FAST MULTI-DECODER END-TO-END SPEECH TRANSLATION WITH NON-AUTOREGRESSIVE HIDDEN INTERMEDIATES
    Inaguma, Hirofumi
    Dalmia, Siddharth
    Yan, Brian
    Watanabe, Shinji
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 922 - 929
  • [8] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Thebaud, Thomas
    Dehak, Najim
    Kowalczyk, Konrad
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
  • [9] Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System
    Shi, Xian
    Luo, Haoneng
    Gao, Zhifu
    Zhang, Shiliang
    Yan, Zhijie
    INTERSPEECH 2023, 2023, : 3247 - 3251
  • [10] NON-AUTOREGRESSIVE END-TO-END APPROACHES FOR JOINT AUTOMATIC SPEECH RECOGNITION AND SPOKEN LANGUAGE UNDERSTANDING
    Li, Mohan
    Doddipatla, Rama
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 390 - 397