Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

被引:2
|
作者
Jia, Ye [1 ]
Ding, Yifan [1 ]
Bapna, Ankur [1 ]
Cherry, Colin [1 ]
Zhang, Yu [1 ]
Conneau, Alexis [1 ]
Morioka, Nobuyuki [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
关键词
speech-to-speech; speech translation; unsupervised pre-training; multi-task fine-tuning;
D O I
10.21437/Interspeech.2022-10938
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.
引用
收藏
页码:1721 / 1725
页数:5
相关论文
共 50 条
  • [21] From Speech-to-Speech Translation to Automatic Dubbing
    Federico, Marcello
    Enyedi, Robert
    Barra-Chicote, Roberto
    Giri, Ritwik
    Isik, Umut
    Krishnaswamy, Arvindh
    Sawaf, Hassan
    [J]. 17TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2020), 2020, : 257 - 264
  • [22] Multilingual speech-to-speech translation system: VoiceTra
    Matsuda, Shigeki
    Hu, Xinhui
    Shiga, Yoshinori
    Kashioka, Hideki
    Hori, Chiori
    Yasuda, Keiji
    Okuma, Hideo
    Uchiyama, Masao
    Sumita, Eiichiro
    Kawai, Hisashi
    Nakamura, Satoshi
    [J]. 2013 IEEE 14TH INTERNATIONAL CONFERENCE ON MOBILE DATA MANAGEMENT (MDM 2013), VOL 2, 2013, : 229 - 233
  • [23] UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
    Inaguma, Hirofumi
    Popuri, Sravya
    Kulikov, Ilia
    Chen, Peng-Jen
    Wang, Changhan
    Chung, Yu-An
    Tang, Yun
    Lee, Ann
    Watanabe, Shinji
    Pino, Juan
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15655 - 15680
  • [24] Speech-to-speech Low-resource Translation
    Liu, Hsiao-Chuan
    Day, Min-Yuh
    Wang, Chih-Chien
    [J]. 2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 91 - 95
  • [25] Semantic transfer in speech-to-speech machine translation
    Abb, B
    Buschbeck-Wolf, B
    Tschernitschek, C
    [J]. NATURAL LANGUAGE PROCESSING AND SPEECH TECHNOLOGY: RESULTS OF THE 3RD KONVENS CONFERENCE, 1996, : 123 - 136
  • [26] The ATR multilingual speech-to-speech translation system
    Nakamura, S
    Markov, K
    Nakaiwa, H
    Kikui, G
    Kawai, H
    Jitsuhiro, T
    Zhang, JS
    Yamamoto, H
    Sumita, E
    Yamamoto, S
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02): : 365 - 376
  • [27] The impact of ASR on speech-to-speech translation performance
    Sarikaya, Ruhi
    Zhou, Bowen
    Povey, Daniel
    Afify, Mohamed
    Gao, Yuqing
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 1289 - +
  • [28] A speech-to-speech translation based interface for tourism
    Cettolo, M
    Corazza, A
    Lazzari, G
    Pianesi, F
    Pianta, E
    Tovena, LM
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 1999, 1999, : 191 - 200
  • [29] Finite-state speech-to-speech translation
    Vidal, E
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 111 - 114
  • [30] Incremental Dialog Clustering For Speech-to-Speech Translation
    Stallard, David
    Tsakalidis, Stavros
    Saleem, Shirin
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 428 - 431