More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引:4
|
作者
Hassid, Michael [1 ]
Ramanovich, Michelle Tadmor [1 ]
Shillingford, Brendan [2 ]
Wang, Miaosen [2 ]
Jia, Ye [1 ]
Remez, Tal [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, London, England
关键词
D O I
10.1109/CVPR52688.2022.01033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)
引用
收藏
页码:10577 / 10587
页数:11
相关论文
共 50 条
  • [41] Optimisation of artificial neural network topology applied in the prosody control in text-to-speech synthesis
    Sebesta, V
    Tucková, J
    SOFSEM 2000: THEORY AND PRACTICE OF INFORMATICS, 2000, 1963 : 420 - 430
  • [42] Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    NEUROCOMPUTING, 2016, 171 : 1323 - 1334
  • [43] Soft-computing Methods for Text-to-Speech Driven Avatars
    Malcangi, Mario
    MATHEMATICAL METHODS AND APPLIED COMPUTING, VOL 1, 2009, : 288 - +
  • [44] Improving the Prosody of RNN-based English Text-To-Speech Synthesis by Incorporating a BERT model
    Kenter, Tom
    Sharma, Manish
    Clark, Rob
    INTERSPEECH 2020, 2020, : 4412 - 4416
  • [45] Issues in Chinese prosody: conceptual foundations of a linguistically-motivated text-to-speech system for Mandarin
    Lavin, Richard S.
    PACLIC 16: LANGUAGE, INFORMATION, AND COMPUTATION, PROCEEDINGS, 2002, : 259 - 270
  • [46] Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-To-Speech
    Klimkov, Viacheslav
    Ronanki, Srikanth
    Rohnke, Jonas
    Drugman, Thomas
    INTERSPEECH 2019, 2019, : 4440 - 4444
  • [47] CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
    Karlapati, Sri
    Moinet, Alexis
    Joly, Arnaud
    Klimkov, Viacheslav
    Sciez-Trigueros, Daniel
    Drugman, Thomas
    INTERSPEECH 2020, 2020, : 4387 - 4391
  • [48] Statistical methods in data-driven modeling of Spanish prosody for text to speech
    LopezGonzalo, E
    RodriguezGarcia, JM
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1377 - 1380
  • [49] Data-Driven Phrase Break Prediction for Bengali Text-to-Speech System
    Ghosh, Krishnendu
    Rao, K. Sreenivasa
    CONTEMPORARY COMPUTING, 2012, 306 : 118 - 129
  • [50] Beey: More than a Speech-to-Text Editor
    Weingartova, Lenka
    Volna, Veronika
    Balejova, Ewa
    INTERSPEECH 2021, 2021, : 958 - 959