A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation

被引：0

作者：

Ma, Zhengrui ^{[1
,3
]}

Fang, Qingkai ^{[1
,3
]}

Zhang, Shaolei ^{[1
,3
]}

Guo, Shoutao ^{[1
,3
]}

Feng, Yang ^{[1
,2
,3
]}

Zhang, Min ^{[4
]}

机构：

[1] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Inst Comp Thchnol, Beijing, Peoples R China

[2] Chinese Acad Sci, Key Lab AI Safety, Beijing, Peoples R China

[3] Univ Chinese Acad Sci, Beijing, Peoples R China

[4] Soochow Univ, Sch Future Sci & Engn, Suzhou, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2x(1)), which integrates speechto-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2x outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28x decoding speedup in offline generation.(2)

引用

页码：1557 / 1575

页数：19

共 50 条

[21] Non-Autoregressive End-to-End Neural Modeling for Automatic Pronunciation Error Detection
Wadud, Md. Anwar Hussen
Alatiyyah, Mohammed
Mridha, M. F.
APPLIED SCIENCES-BASEL, 2023, 13 (01):
[22] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Higuchi, Yosuke
Watanabe, Shinji
Chen, Nanxin
Ogawa, Tetsuji
Kobayashi, Tetsunori
INTERSPEECH 2020, 2020, : 3655 - 3659
[23] SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation
Ma, Xutai
Pino, Juan
Koehn, Philipp
1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 582 - 587
[24] EXPLORING NON-AUTOREGRESSIVE END-TO-END NEURAL MODELING FOR ENGLISH MISPRONUNCIATION DETECTION AND DIAGNOSIS
Wang, Hsin-Wei
Yan, Bi-Cheng
Chiu, Hsuan-Sheng
Hsu, Yung-Chang
Chen, Berlin
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6817 - 6821
[25] Achieving Timestamp Prediction While Recognizing with Non-autoregressive End-to-End ASR Model
Shi, Xian
Chen, Yanni
Zhang, Shiliang
Yan, Zhijie
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 89 - 100
[26] Fast End-to-End Speech Recognition Via Non-Autoregressive Models and Cross-Modal Knowledge Transferring From BERT
Bai, Ye
Yi, Jiangyan
Tao, Jianhua
Tian, Zhengkun
Wen, Zhengqi
Zhang, Shuai
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1897 - 1911
[27] AN EMPIRICAL STUDY OF END-TO-END SIMULTANEOUS SPEECH TRANSLATION DECODING STRATEGIES
Ha Nguyen
Esteve, Yannick
Besacier, Laurent
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7528 - 7532
[28] Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation
Nguyen, Ha
Esteve, Yannick
Besacier, Laurent
INTERSPEECH 2021, 2021, : 2371 - 2375
[29] BOUNDARY AND CONTEXT AWARE TRAINING FOR CIF-BASED NON-AUTOREGRESSIVE END-TO-END ASR
Yu, Fan
Luo, Haoneng
Guo, Pengcheng
Bang, Yuhao
Yao, Zhuoyuan
Xie, Lei
Gao, Yingying
Hou, Leijing
Zhang, Shilei
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 328 - 334
[30] MULTILINGUAL END-TO-END SPEECH TRANSLATION
Inaguma, Hirofumi
Duh, Kevin
Kawahara, Tatsuya
Watanabe, Shinji
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 570 - 577

← 1 2 3 4 5 →