TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY

被引:11
|
作者
Zhang, Sibo [1 ]
Yuan, Jiahong [1 ]
Liao, Miao [1 ]
Zhang, Liangjun [1 ]
机构
[1] Baidu Res, Sunnyvale, CA 94089 USA
关键词
Text-to-Video Synthesis; Multi-modal Processing; Phoneme-Pose; Generative Adversarial Networks; GENERATION;
D O I
10.1109/ICASSP43922.2022.9747380
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.
引用
收藏
页码:2659 / 2663
页数:5
相关论文
共 32 条