More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引:4
|
作者
Hassid, Michael [1 ]
Ramanovich, Michelle Tadmor [1 ]
Shillingford, Brendan [2 ]
Wang, Miaosen [2 ]
Jia, Ye [1 ]
Remez, Tal [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] DeepMind, London, England
关键词
D O I
10.1109/CVPR52688.2022.01033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)
引用
收藏
页码:10577 / 10587
页数:11
相关论文
共 50 条
  • [21] Prosody model in a Mandarin Text-to-Speech System based on a hierarchical approach
    Pan, NH
    Jen, WT
    Yu, SS
    Yu, MS
    Huang, SY
    Wu, MJ
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 448 - 451
  • [22] Novel Eigenpitch-based Prosody Model for Text-to-Speech Synthesis
    Tian, Jilei
    Nurminen, Jani
    Kiss, Imre
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 313 - 316
  • [23] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [24] Modeling stylized invariance and local variability of prosody in text-to-speech synthesis
    Chu, Min
    Zhao, Yong
    Chang, Eric
    SPEECH COMMUNICATION, 2006, 48 (06) : 716 - 726
  • [25] High-Quality Prosody Generation in Mandarin Text-to-Speech System
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 2010, 46 (01): : 40 - 46
  • [26] High-quality prosody generation in Mandarin text-to-speech system
    Guo, Qing
    Zhang, Jie
    Katae, Nobuyuki
    Yu, Hao
    Fujitsu Scientific and Technical Journal, 2010, 46 (01): : 40 - 46
  • [27] Towards including prosody in a text-to-speech system for modern standard Arabic
    Ramsay, Allan
    Mansour, Hanady
    COMPUTER SPEECH AND LANGUAGE, 2008, 22 (01): : 84 - 103
  • [28] Implementation of high quality text-to-speech using words and diphones
    Shukla, SR
    Barnwell, TP
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 4020 - 4020
  • [29] A statistical model with hierarchical structure for predicting prosody in a mandarin text-to-speech system
    Yu, MS
    Pan, NH
    JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2005, 28 (03) : 385 - 399
  • [30] Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech
    Zhang, Guangyan
    Merritt, Thomas
    Ribeiro, Manuel Sam
    Tura-Vecino, Biel
    Yanagisawa, Kayoko
    Pokora, Kamil
    Ezzerg, Abdelhamid
    Cygert, Sebastian
    Abbas, Ammar
    Bilinski, Piotr
    Barra-Chicote, Roberto
    Korzekwa, Daniel
    Lorenzo-Trueba, Jaime
    INTERSPEECH 2023, 2023, : 27 - 31