More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引:4
|
作者
Hassid, Michael [1 ]
Ramanovich, Michelle Tadmor [1 ]
Shillingford, Brendan [2 ]
Wang, Miaosen [2 ]
Jia, Ye [1 ]
Remez, Tal [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, London, England
关键词
D O I
10.1109/CVPR52688.2022.01033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)
引用
收藏
页码:10577 / 10587
页数:11
相关论文
共 50 条
  • [1] Dealing with prosody in a text-to-speech system
    Goldsmith, John
    International Journal of Speech Technology, 1999, 3 (01): : 51 - 63
  • [2] Dealing with prosody in a text-to-speech system
    Goldsmith J.
    International Journal of Speech Technology, 1999, 3 (1) : 51 - 63
  • [3] An efficient text analyzer with prosody generator-driven approach for mandarin text-to-speech
    Hwang, SH
    Yeh, CY
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 488 - 491
  • [4] Efficient text analyser with prosody generator-driven approach for Mandarin text-to-speech
    Yeh, CY
    Hwang, SH
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2005, 152 (06): : 793 - 799
  • [5] Modeling arabic prosody for a text-to-speech system
    Boukadida, F.
    Ellouze, N.
    International Review on Computers and Software, 2009, 4 (03) : 337 - 343
  • [6] Towards a multilingual prosody model for text-to-speech
    Jokisch, O
    Ding, HW
    Kruschke, H
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 421 - 424
  • [7] Evaluation of Prosody in Text-to-Speech Synthesis System of Bangla
    Basu, Tulika
    Saha, Arup
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [8] AUTOMATIC PROSODY GENERATION IN A TEXT-TO-SPEECH SYSTEM FOR HEBREW
    Popovic, Branislav
    Knezevic, Dragan
    Secujski, Milan
    Pekar, Darko
    FACTA UNIVERSITATIS-SERIES ELECTRONICS AND ENERGETICS, 2014, 27 (03) : 467 - 477
  • [9] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
    Anil, Manjare Chandraprabha
    Shirbahadurkar, S. D.
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
  • [10] Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis
    O'Mahony, Johannah
    Lai, Catherine
    King, Simon
    INTERSPEECH 2022, 2022, : 3388 - 3392