A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

被引:0
|
作者
Ma, Zhengrui [1 ,3 ]
Fang, Qingkai [1 ,3 ]
Zhang, Shaolei [1 ,3 ]
Guo, Shoutao [1 ,3 ]
Feng, Yang [1 ,2 ,3 ]
Zhang, Min [4 ]
机构
[1] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China
[2] Chinese Acad Sci, Key Lab AI Safety, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
[4] Soochow Univ, Sch Future Sci & Engn, Suzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2x(1)), which integrates speechto-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2x outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28x decoding speedup in offline generation.(2)
引用
收藏
页码:1557 / 1575
页数:19
相关论文
共 50 条
  • [21] Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection
    Wadud, Md. Anwar Hussen
    Alatiyyah, Mohammed
    Mridha, M. F.
    APPLIED SCIENCES-BASEL, 2023, 13 (01):
  • [22] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
    Higuchi, Yosuke
    Watanabe, Shinji
    Chen, Nanxin
    Ogawa, Tetsuji
    Kobayashi, Tetsunori
    INTERSPEECH 2020, 2020, : 3655 - 3659
  • [23] SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation
    Ma, Xutai
    Pino, Juan
    Koehn, Philipp
    1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 582 - 587
  • [24] EXPLORING NON-AUTOREGRESSIVE END-TO-END NEURAL MODELING FOR ENGLISH MISPRONUNCIATION DETECTION AND DIAGNOSIS
    Wang, Hsin-Wei
    Yan, Bi-Cheng
    Chiu, Hsuan-Sheng
    Hsu, Yung-Chang
    Chen, Berlin
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6817 - 6821
  • [25] Achieving Timestamp Prediction While Recognizing with Non-autoregressive End-to-End ASR Model
    Shi, Xian
    Chen, Yanni
    Zhang, Shiliang
    Yan, Zhijie
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 89 - 100
  • [26] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
    Bai, Ye
    Yi, Jiangyan
    Tao, Jianhua
    Tian, Zhengkun
    Wen, Zhengqi
    Zhang, Shuai
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911
  • [27] AN EMPIRICAL STUDY OF END-TO-END SIMULTANEOUS SPEECH TRANSLATION DECODING STRATEGIES
    Ha Nguyen
    Esteve, Yannick
    Besacier, Laurent
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7528 - 7532
  • [28] Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
    Nguyen, Ha
    Esteve, Yannick
    Besacier, Laurent
    INTERSPEECH 2021, 2021, : 2371 - 2375
  • [29] BOUNDARY AND CONTEXT AWARE TRAINING FOR CIF-BASED NON-AUTOREGRESSIVE END-TO-END ASR
    Yu, Fan
    Luo, Haoneng
    Guo, Pengcheng
    Bang, Yuhao
    Yao, Zhuoyuan
    Xie, Lei
    Gao, Yingying
    Hou, Leijing
    Zhang, Shilei
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 328 - 334
  • [30] MULTILINGUAL END-TO-END SPEECH TRANSLATION
    Inaguma, Hirofumi
    Duh, Kevin
    Kawahara, Tatsuya
    Watanabe, Shinji
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 570 - 577