Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head

被引：2

作者：

Wang, Zhangjing ^{[1
]}

He, Wenzhi ^{[1
]}

Wei, Yujiang ^{[1
]}

Luo, Yupeng ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China

来源：

DISPLAYS | 2023年 / 80卷

关键词：

Audio-visual; Talking face; Video synthesis; Multimodal; Speech -driven face animation; Cross-modality generation; BLIND QUALITY ASSESSMENT; SPEECH;

D O I：

10.1016/j.displa.2023.102552

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Audio-visual cross-modality generation refers to the generation of audio or visual content based on input from another modality. One of the key tasks in this field is the generation of realistic talking facial videos from audio and head pose information, which has significant applications in human-computer interaction, virtual reality, and video production. However, previous work has limitations such as the inability to generate natural head poses or interact with audio, which compromises the realism and expressive power of the generated videos. This paper aims to address these issues and improve the state-of-the-art in this field. To this end, we propose an autoregressive generation method called Flow2Flow and collect a large-scale in-the-wild solo-singing-themed audio-visual dataset called AVVS to investigate the rhythmic head movement patterns. The Flow2Flow model involves a multimodal transformer block with cross-attention, which can encode audio features and historical head poses to establish potential audio-visual motion entanglement and uses normalizing flows to generate future facial motion representation sequences. The generated motion representations are identity-independent, allowing the method to be transferred to any face identity. We model the motion of image content using warping flows generated from 3D keypoints based on the facial motion representation sequences, carefully manipulate animation generation, and estimate dense motion fields based on deformation flows using a neural rendering model to present photo-realistic talking facial videos. Experimental results show that our proposed method generates photo-realistic videos with natural head poses and lip-syncing, and we validate the effec-tiveness of our method compared to state-of-the-art methods on two public datasets.

引用

页数：13

共 13 条

[1] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Zhou, Hang
Liu, Yu
Liu, Ziwei
Luo, Ping
Wang, Xiaogang
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9299 - 9306
[2] Expressive Talking Head Generation with Granular Audio-Visual Control
Liang, Borong
Pan, Yan
Guo, Zhizhi
Zhou, Hang
Hong, Zhibin
Han, Xiaoguang
Han, Junyu
Liu, Jingtuo
Ding, Errui
Wang, Jingdong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3377 - 3386
[3] Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset
Zhang, Zhimeng
Li, Lincheng
Ding, Yu
Fan, Changjie
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3660 - 3669
[4] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
Lee, Yong-Hyeok
Jang, Dong-Won
Kim, Jae-Bin
Park, Rae-Hong
Park, Hyung-Min
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
[5] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Luo, Lei
Zhang, Zhenyu
Yang, Jian
Yan, Yan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
[6] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
Zhu, Hao
Huang, Huaibo
Li, Yi
Zheng, Aihua
He, Ran
PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368
[7] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
Zhou, Hang
Sun, Yasheng
Wu, Wayne
Loy, Chen Change
Wang, Xiaogang
Liu, Ziwei
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4174 - 4184
[8] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
Cheng, Xize
Jin, Tao
Huang, Rongjie
Li, Linjun
Lin, Wang
Wang, Zehan
Wang, Ye
Liu, Huadai
Yin, Aoxiong
Zhao, Zhou
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
[9] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
Lin, Yan-Bo
Tseng, Hung-Yu
Lee, Hsin-Ying
Lin, Yen-Yu
Yang, Ming-Hsuan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[10] AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
Sun, Yasheng
Chu, Wenqing
Zhou, Hang
Wang, Kaisiyuan
Koike, Hideki
IEEE ACCESS, 2024, 12 : 57288 - 57301

← 1 2 →