Textless Speech-to-Speech Translation on Real Data

被引:0
|
作者
Lee, Ann [1 ]
Gong, Hongyu [1 ]
Duquenne, Paul-Ambroise [1 ]
Schwenk, Holger [1 ]
Chen, Peng-Jen [1 ]
Wang, Changhan [1 ]
Popuri, Sravya [1 ]
Adi, Yossi [1 ]
Pino, Juan [1 ]
Gu, Jiatao [1 ]
Hsu, Wei-Ning [1 ]
机构
[1] Meta AI, Menlo Pk, CA 94025 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on unnormalized speech target. We also incorporate automatically mined S2ST data and show an additional 2.0 BLEU gain. To our knowledge, we are the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs(1).
引用
收藏
页码:860 / 872
页数:13
相关论文
共 50 条
  • [41] Input segmentation of spontaneous speech in JANUS: A speech-to-speech translation system
    Lavie, A
    Gates, D
    Coccaro, N
    Levin, L
    [J]. DIALOGUE PROCESSING IN SPOKEN LANGUAGE SYSTEMS, 1997, 1236 : 86 - 99
  • [42] Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation
    Jia, Ye
    Ding, Yifan
    Bapna, Ankur
    Cherry, Colin
    Zhang, Yu
    Conneau, Alexis
    Morioka, Nobuyuki
    [J]. INTERSPEECH 2022, 2022, : 1721 - 1725
  • [43] Speech-to-speech translation based on finite-state transducers
    Casacuberta, F
    Llorens, D
    Martínez, C
    Molau, S
    Nevado, F
    Ney, H
    Pastor, M
    Picó, D
    Sanchis, A
    Vidal, E
    Vilar, JM
    [J]. 2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 613 - 616
  • [44] The Asian Network-based Speech-to-Speech Translation System
    Sakti, Sakriani
    Kimura, Noriyuki
    Paul, Michael
    Hori, Chiori
    Sumita, Eiichiro
    Nakamura, Satoshi
    Park, Jun
    Wutiwiwatchai, Chai
    Xu, Bo
    Riza, Hammam
    Arora, Karunesh
    Luong, Chi Mai
    Li, Haizhou
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 507 - +
  • [45] EVALUATING DIFFERENT CONFIRMATION STRATEGIES FOR SPEECH-TO-SPEECH TRANSLATION SYSTEMS
    Stallard, David
    Prasad, Rohit
    Ananthakrishnan, Shankar
    Choi, Fred
    Saleem, Shirin
    Natarajan, Prem
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5218 - 5221
  • [46] VERBMOBIL: The evolution of a complex large speech-to-speech translation system
    Bub, T
    Schwinn, J
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 2371 - 2374
  • [47] JANUS-III: Speech-to-speech translation in multiple languages
    Lavie, A
    Waibel, A
    Levin, L
    Finke, M
    Gates, D
    Gavalda, M
    Zeppenfeld, T
    Zhan, PM
    [J]. 1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 99 - 102
  • [48] TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER
    Kano, Takatomo
    Sakti, Sakriani
    Nakamura, Satoshi
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 958 - 965
  • [49] System description: A highly interactive speech-to-speech translation system
    Dillinger, M
    Seligman, M
    [J]. MACHINE TRANSLATION: FROM REAL USERS TO RESEARCH, PROCEEDINGS, 2004, 3265 : 58 - 63
  • [50] Speech-to-Speech Translation Humanoid Robot in Doctor's Office
    Shin, Sangmi
    Matson, Eric T.
    Park, Jinok
    Yang, Bowon
    Lee, Juhee
    Jung, Jin-Woo
    [J]. PROCEEDINGS OF THE 2015 6TH INTERNATIONAL CONFERENCE ON AUTOMATION, ROBOTICS AND APPLICATIONS (ICARA), 2015, : 484 - 489