More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

被引：4

作者：

Hassid, Michael ^{[1
]}

Ramanovich, Michelle Tadmor ^{[1
]}

Shillingford, Brendan ^{[2
]}

Wang, Miaosen ^{[2
]}

Jia, Ye ^{[1
]}

Remez, Tal ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] DeepMind, London, England

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.01033

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.(1)

引用

页码：10577 / 10587

页数：11

共 50 条

[1] Dealing with prosody in a text-to-speech system
Goldsmith, John
International Journal of Speech Technology, 1999, 3 (01): : 51 - 63
[2] Dealing with prosody in a text-to-speech system
Goldsmith J.
International Journal of Speech Technology, 1999, 3 (1) : 51 - 63
[3] An efficient text analyzer with prosody generator-driven approach for mandarin text-to-speech
Hwang, SH
Yeh, CY
2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 488 - 491
[4] Efficient text analyser with prosody generator-driven approach for Mandarin text-to-speech
Yeh, CY
Hwang, SH
IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2005, 152 (06): : 793 - 799
[5] Modeling arabic prosody for a text-to-speech system
Boukadida, F.
Ellouze, N.
International Review on Computers and Software, 2009, 4 (03) : 337 - 343
[6] Towards a multilingual prosody model for text-to-speech
Jokisch, O
Ding, HW
Kruschke, H
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 421 - 424
[7] Evaluation of Prosody in Text-to-Speech Synthesis System of Bangla
Basu, Tulika
Saha, Arup
2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
[8] AUTOMATIC PROSODY GENERATION IN A TEXT-TO-SPEECH SYSTEM FOR HEBREW
Popovic, Branislav
Knezevic, Dragan
Secujski, Milan
Pekar, Darko
FACTA UNIVERSITATIS-SERIES ELECTRONICS AND ENERGETICS, 2014, 27 (03) : 467 - 477
[9] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
Anil, Manjare Chandraprabha
Shirbahadurkar, S. D.
2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
[10] Combining conversational speech with read speech to improve prosody in Text-to-Speech synthesis
O'Mahony, Johannah
Lai, Catherine
King, Simon
INTERSPEECH 2022, 2022, : 3388 - 3392

← 1 2 3 4 5 →