TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY

被引:11
|
作者
Zhang, Sibo [1 ]
Yuan, Jiahong [1 ]
Liao, Miao [1 ]
Zhang, Liangjun [1 ]
机构
[1] Baidu Res, Sunnyvale, CA 94089 USA
关键词
Text-to-Video Synthesis; Multi-modal Processing; Phoneme-Pose; Generative Adversarial Networks; GENERATION;
D O I
10.1109/ICASSP43922.2022.9747380
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.
引用
收藏
页码:2659 / 2663
页数:5
相关论文
共 32 条
  • [21] Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval
    Song, Xue
    Chen, Jingjing
    Wu, Zuxuan
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2914 - 2923
  • [22] One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing
    Wang, Ting-Chun
    Mallya, Arun
    Liu, Ming-Yu
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10034 - 10044
  • [23] Utilizing Text-Video Relationships: A Text-Driven Multi-modal Fusion Framework for Moment Retrieval and Highlight Detection
    Zhou, Siyu
    Zhang, Fjwei
    Wang, Ruomei
    Su, Zhuo
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT X, 2025, 15040 : 254 - 268
  • [24] RESULTS OF A RCT: THE EFFECTS OF A VIDEO-DRIVEN AND TEXT-DRIVEN WEB-BASED OBESITY PREVENTION INTERVENTION
    Walthouwer, M. J. L.
    Oenema, A.
    Lechner, L.
    De Vries, H.
    INTERNATIONAL JOURNAL OF BEHAVIORAL MEDICINE, 2014, 21 : S151 - S151
  • [25] Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
    Tandon, Pulkit
    Chandak, Shubham
    Pataranutaporn, Pat
    Liu, Yimeng
    Mapuranga, Anesu M.
    Maes, Pattie
    Weissman, Tsachy
    Sra, Misha
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2023, 41 (01) : 107 - 118
  • [26] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
    Michel Jean Louis Walthouwer
    Anke Oenema
    Katja Soetens
    Lilian Lechner
    Hein De Vries
    BMC Public Health, 13
  • [27] Systematic development of a text-driven and a video-driven web-based computer-tailored obesity prevention intervention
    Walthouwer, Michel Jean Louis
    Oenema, Anke
    Soetens, Katja
    Lechner, Lilian
    De Vries, Hein
    BMC PUBLIC HEALTH, 2013, 13
  • [28] Text driven face-video synthesis using GMM and spatial correlation
    Teferi, Dereje
    Faraj, Maycel L.
    Bigun, Josef
    IMAGE ANALYSIS, PROCEEDINGS, 2007, 4522 : 572 - +
  • [29] Use and Effectiveness of a Video- and Text-Driven Web-Based Computer-Tailored Intervention: Randomized Controlled Trial
    Walthouwer, Michel Jean Louis
    Oenema, Anke
    Lechner, Lilian
    de Vries, Hein
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2015, 17 (09) : e222
  • [30] 3-D Facial Priors Guided Local-Global Motion Collaboration Transforms for One-Shot Talking-Head Video Synthesis
    Chen, Yilei
    Zeng, Rui
    Xiong, Shengwu
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2024, 70 (01) : 132 - 143