TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY

被引:11
|
作者
Zhang, Sibo [1 ]
Yuan, Jiahong [1 ]
Liao, Miao [1 ]
Zhang, Liangjun [1 ]
机构
[1] Baidu Res, Sunnyvale, CA 94089 USA
关键词
Text-to-Video Synthesis; Multi-modal Processing; Phoneme-Pose; Generative Adversarial Networks; GENERATION;
D O I
10.1109/ICASSP43922.2022.9747380
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.
引用
收藏
页码:2659 / 2663
页数:5
相关论文
共 32 条
  • [1] Text2Video: Text-driven facial animation using MPEG-4
    Rurainsky, J
    Eisert, P
    VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2005, PTS 1-4, 2005, 5960 : 492 - 500
  • [2] Text-based Editing of Talking-head Video
    Fried, Ohad
    Tewari, Ayush
    Zollhofer, Michael
    Finkelstein, Adam
    Shechtman, Eli
    Goldman, Dan B.
    Genova, Kyle
    Jin, Zeyu
    Theobalt, Christian
    Agrawala, Maneesh
    ACM TRANSACTIONS ON GRAPHICS, 2019, 38 (04):
  • [3] Text-driven Synchronized Diffusion Video and Audio Talking Head Generation
    Zhang, Zhenfei
    Huang, Tsung-Wei
    Su, Guan-Ming
    Chang, Ming-Ching
    Li, Xin
    2024 IEEE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL, MIPR 2024, 2024, : 61 - 67
  • [4] Text-Driven Video Prediction
    Song, Xue
    Chen, Jingjing
    Zhu, Bin
    Jiang, Yu-gang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
  • [5] Text2Video: Automatic Video Generation Based on Text Scripts
    Yu, Yipeng
    Tu, Zirui
    Lu, Longyu
    Chen, Xiao
    Zhan, Hui
    Sun, Zixun
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2753 - 2755
  • [6] Text2Performer: Text-Driven Human Video Generation
    Jiang, Yuming
    Yang, Shuai
    Koh, Tong Liang
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22690 - 22700
  • [7] A statistical parametric approach to video-realistic text-driven talking avatar
    Xie, Lei
    Sun, Naicai
    Fan, Bo
    MULTIMEDIA TOOLS AND APPLICATIONS, 2014, 73 (01) : 377 - 396
  • [8] A statistical parametric approach to video-realistic text-driven talking avatar
    Lei Xie
    Naicai Sun
    Bo Fan
    Multimedia Tools and Applications, 2014, 73 : 377 - 396
  • [9] Text2LIVE: Text-Driven Layered Image and Video Editing
    Bar-Tal, Omer
    Ofri-Amar, Dolev
    Fridman, Rafail
    Kasten, Yoni
    Dekel, Tali
    COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
  • [10] Shape-aware Text-driven Layered Video Editing
    Lee, Yao-Chih
    Jang, Ji-Ze Genevieve
    Chen, Yi-Ting
    Qiu, Elizabeth
    Huang, Jia-Bin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14317 - 14326