TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY

被引：11

作者：

Zhang, Sibo ^{[1
]}

Yuan, Jiahong ^{[1
]}

Liao, Miao ^{[1
]}

Zhang, Liangjun ^{[1
]}

机构：

[1] Baidu Res, Sunnyvale, CA 94089 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Text-to-Video Synthesis; Multi-modal Processing; Phoneme-Pose; Generative Adversarial Networks; GENERATION;

D O I：

10.1109/ICASSP43922.2022.9747380

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

引用

页码：2659 / 2663

页数：5

共 32 条

[21] Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval
Song, Xue
Chen, Jingjing
Wu, Zuxuan
Jiang, Yu-Gang
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2914 - 2923
[22] One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing
Wang, Ting-Chun
Mallya, Arun
Liu, Ming-Yu
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10034 - 10044
[23] Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
Zhou, Siyu
Zhang, Fjwei
Wang, Ruomei
Su, Zhuo
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 254 - 268
[24] RESULTS OF A RCT: THE EFFECTS OF A VIDEO-DRIVEN AND TEXT-DRIVEN WEB-BASED OBESITY PREVENTION INTERVENTION
Walthouwer, M. J. L.
Oenema, A.
Lechner, L.
De Vries, H.
INTERNATIONAL JOURNAL OF BEHAVIORAL MEDICINE, 2014, 21 : S151 - S151
[25] Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
Tandon, Pulkit
Chandak, Shubham
Pataranutaporn, Pat
Liu, Yimeng
Mapuranga, Anesu M.
Maes, Pattie
Weissman, Tsachy
Sra, Misha
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2023, 41 (01) : 107 - 118
[26] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
Michel Jean Louis Walthouwer
Anke Oenema
Katja Soetens
Lilian Lechner
Hein De Vries
BMC Public Health, 13
[27] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
Walthouwer, Michel Jean Louis
Oenema, Anke
Soetens, Katja
Lechner, Lilian
De Vries, Hein
BMC PUBLIC HEALTH, 2013, 13
[28] Text driven face-video synthesis using GMM and spatial correlation
Teferi, Dereje
Faraj, Maycel L.
Bigun, Josef
IMAGE ANALYSIS, PROCEEDINGS, 2007, 4522 : 572 - +
[29] Use and Effectiveness of a Video- and Text-Driven Web-Based Computer-Tailored Intervention: Randomized Controlled Trial
Walthouwer, Michel Jean Louis
Oenema, Anke
Lechner, Lilian
de Vries, Hein
JOURNAL OF MEDICAL INTERNET RESEARCH, 2015, 17 (09) : e222
[30] 3-D Facial Priors Guided Local-Global Motion Collaboration Transforms for One-Shot Talking-Head Video Synthesis
Chen, Yilei
Zeng, Rui
Xiong, Shengwu
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (01) : 132 - 143

← 1 2 3 4 →