TEXT2VIDEO: TEXT-DRIVEN TALKING-HEAD VIDEO SYNTHESIS WITH PERSONALIZED PHONEME - POSE DICTIONARY

被引：11

作者：

Zhang, Sibo ^{[1
]}

Yuan, Jiahong ^{[1
]}

Liao, Miao ^{[1
]}

Zhang, Liangjun ^{[1
]}

机构：

[1] Baidu Res, Sunnyvale, CA 94089 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Text-to-Video Synthesis; Multi-modal Processing; Phoneme-Pose; Generative Adversarial Networks; GENERATION;

D O I：

10.1109/ICASSP43922.2022.9747380

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.

引用

页码：2659 / 2663

页数：5

共 32 条

[1] Text2Video: Text-driven facial animation using MPEG-4
Rurainsky, J
Eisert, P
VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2005, PTS 1-4, 2005, 5960 : 492 - 500
[2] Text-based Editing of Talking-head Video
Fried, Ohad
Tewari, Ayush
Zollhofer, Michael
Finkelstein, Adam
Shechtman, Eli
Goldman, Dan B.
Genova, Kyle
Jin, Zeyu
Theobalt, Christian
Agrawala, Maneesh
ACM TRANSACTIONS ON GRAPHICS, 2019, 38 (04):
[3] Text-driven Synchronized Diffusion Video and Audio Talking Head Generation
Zhang, Zhenfei
Huang, Tsung-Wei
Su, Guan-Ming
Chang, Ming-Ching
Li, Xin
2024 IEEE 7TH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL, MIPR 2024, 2024, : 61 - 67
[4] Text-Driven Video Prediction
Song, Xue
Chen, Jingjing
Zhu, Bin
Jiang, Yu-gang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (09)
[5] Text2Video: Automatic Video Generation Based on Text Scripts
Yu, Yipeng
Tu, Zirui
Lu, Longyu
Chen, Xiao
Zhan, Hui
Sun, Zixun
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2753 - 2755
[6] Text2Performer: Text-Driven Human Video Generation
Jiang, Yuming
Yang, Shuai
Koh, Tong Liang
Wu, Wayne
Loy, Chen Change
Liu, Ziwei
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22690 - 22700
[7] A statistical parametric approach to video-realistic text-driven talking avatar
Xie, Lei
Sun, Naicai
Fan, Bo
MULTIMEDIA TOOLS AND APPLICATIONS, 2014, 73 (01) : 377 - 396
[8] A statistical parametric approach to video-realistic text-driven talking avatar
Lei Xie
Naicai Sun
Bo Fan
Multimedia Tools and Applications, 2014, 73 : 377 - 396
[9] Text2LIVE: Text-Driven Layered Image and Video Editing
Bar-Tal, Omer
Ofri-Amar, Dolev
Fridman, Rafail
Kasten, Yoni
Dekel, Tali
COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
[10] Shape-aware Text-driven Layered Video Editing
Lee, Yao-Chih
Jang, Ji-Ze Genevieve
Chen, Yi-Ting
Qiu, Elizabeth
Huang, Jia-Bin
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14317 - 14326

← 1 2 3 4 →