From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech-to-Speech Translation

被引：0

作者：

Liu, Danni ^{[1
]}

Wang, Changhan ^{[2
]}

Gong, Hongyu ^{[2
]}

Ma, Xutai ^{[2
,3
]}

Tang, Yun ^{[2
]}

Pino, Juan ^{[2
]}

机构：

[1] Maastricht Univ, Maastricht, Netherlands

[2] Meta AI, Menlo Pk, CA USA

[3] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

speech translation; text-to-speech; low-latency;

D O I：

10.21437/Interspeech.2022-10568

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech-to-speech translation (S2ST) converts input speech to speech in another language. A challenge of delivering S2ST in real time is the accumulated delay between the translation and speech synthesis modules. While recently incremental text-to-speech (iTTS) models have shown large quality improvements, they typically require additional future text inputs to reach optimal performance. In this work, we minimize the initial waiting time of iTTS by adapting the upstream speech translator to generate high-quality pseudo lookahead for the speech synthesizer. After mitigating the initial delay, we demonstrate that the duration of synthesized speech also plays a crucial role on latency. We formalize this as a latency metric and then present a simple yet effective duration-scaling approach for latency reduction. Our approaches consistently reduce latency by 0.2-0.5 second without sacrificing speech translation quality.(1)

引用

页码：1771 / 1775

页数：5

共 50 条

[1] Incremental Dialog Clustering For Speech-to-Speech Translation
Stallard, David
Tsakalidis, Stavros
Saleem, Shirin
[J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 428 - 431
[2] Impacts of machine translation and speech synthesis on speech-to-speech translation
Hashimoto, Kei
Yamagishi, Junichi
Byrne, William
King, Simon
Tokuda, Keiichi
[J]. SPEECH COMMUNICATION, 2012, 54 (07) : 857 - 866
[3] AN ANALYSIS OF MACHINE TRANSLATION AND SPEECH SYNTHESIS IN SPEECH-TO-SPEECH TRANSLATION SYSTEM
Hashimoto, Kei
Yamagishi, Junichi
Byrne, William
King, Simon
Tokuda, Keiichi
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 5108 - 5111
[4] Unsupervised features from text for speech synthesis in a speech-to-speech translation system
Watts, Oliver
Zhou, Bowen
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2164 - 2167
[5] Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training ☆
Zheng, Renjie
Ma, Mingbo
Zheng, Baigong
Liu, Kaibo
Yuan, Jiahong
Church, Kenneth
Huang, Liang
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3928 - 3937
[6] SIMULTANEOUS SPEECH-TO-SPEECH TRANSLATION SYSTEM WITH TRANSFORMER-BASED INCREMENTAL ASR, MT, AND TTS
Fukuda, Ryo
Novitasari, Sashi
Oka, Yui
Kano, Yasumasa
Yano, Yuki
Ko, Yuka
Tokuyama, Hirotaka
Doi, Kosuke
Yanagita, Tomoya
Sakti, Sakriani
Sudoh, Katsuhito
Nakamura, Satoshi
[J]. 2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 186 - 192
[7] From Speech-to-Speech Translation to Automatic Dubbing
Federico, Marcello
Enyedi, Robert
Barra-Chicote, Roberto
Giri, Ritwik
Isik, Umut
Krishnaswamy, Arvindh
Sawaf, Hassan
[J]. 17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 257 - 264
[8] EVALUATING DIFFERENT CONFIRMATION STRATEGIES FOR SPEECH-TO-SPEECH TRANSLATION SYSTEMS
Stallard, David
Prasad, Rohit
Ananthakrishnan, Shankar
Choi, Fred
Saleem, Shirin
Natarajan, Prem
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5218 - 5221
[9] Hierarchical Classification for Speech-to-Speech Translation
Ettelaie, Emil
Georgiou, Panayiotis G.
Narayanan, Shrikanth S.
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2534 - 2537
[10] The NESPOLE! speech-to-speech translation system
Lavie, A
Levin, L
Frederking, R
Pianesi, F
[J]. MACHINE TRANSLATION: FROM RESEARCH TO REAL USERS, 2002, 2499 : 240 - 243

← 1 2 3 4 5 →