Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head

被引:2
|
作者
Wang, Zhangjing [1 ]
He, Wenzhi [1 ]
Wei, Yujiang [1 ]
Luo, Yupeng [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Informat & Commun Engn, Chengdu 611731, Peoples R China
关键词
Audio-visual; Talking face; Video synthesis; Multimodal; Speech -driven face animation; Cross-modality generation; BLIND QUALITY ASSESSMENT; SPEECH;
D O I
10.1016/j.displa.2023.102552
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Audio-visual cross-modality generation refers to the generation of audio or visual content based on input from another modality. One of the key tasks in this field is the generation of realistic talking facial videos from audio and head pose information, which has significant applications in human-computer interaction, virtual reality, and video production. However, previous work has limitations such as the inability to generate natural head poses or interact with audio, which compromises the realism and expressive power of the generated videos. This paper aims to address these issues and improve the state-of-the-art in this field. To this end, we propose an autoregressive generation method called Flow2Flow and collect a large-scale in-the-wild solo-singing-themed audio-visual dataset called AVVS to investigate the rhythmic head movement patterns. The Flow2Flow model involves a multimodal transformer block with cross-attention, which can encode audio features and historical head poses to establish potential audio-visual motion entanglement and uses normalizing flows to generate future facial motion representation sequences. The generated motion representations are identity-independent, allowing the method to be transferred to any face identity. We model the motion of image content using warping flows generated from 3D keypoints based on the facial motion representation sequences, carefully manipulate animation generation, and estimate dense motion fields based on deformation flows using a neural rendering model to present photo-realistic talking facial videos. Experimental results show that our proposed method generates photo-realistic videos with natural head poses and lip-syncing, and we validate the effec-tiveness of our method compared to state-of-the-art methods on two public datasets.
引用
收藏
页数:13
相关论文
共 13 条
  • [1] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
    Zhou, Hang
    Liu, Yu
    Liu, Ziwei
    Luo, Ping
    Wang, Xiaogang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9299 - 9306
  • [2] Expressive Talking Head Generation with Granular Audio-Visual Control
    Liang, Borong
    Pan, Yan
    Guo, Zhizhi
    Zhou, Hang
    Hong, Zhibin
    Han, Xiaoguang
    Han, Junyu
    Liu, Jingtuo
    Ding, Errui
    Wang, Jingdong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3377 - 3386
  • [3] Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset
    Zhang, Zhimeng
    Li, Lincheng
    Ding, Yu
    Fan, Changjie
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3660 - 3669
  • [4] Audio-Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model
    Lee, Yong-Hyeok
    Jang, Dong-Won
    Kim, Jae-Bin
    Park, Rae-Hong
    Park, Hyung-Min
    APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 18
  • [5] Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Luo, Lei
    Zhang, Zhenyu
    Yang, Jian
    Yan, Yan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 7878 - 7888
  • [6] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
    Zhu, Hao
    Huang, Huaibo
    Li, Yi
    Zheng, Aihua
    He, Ran
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368
  • [7] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
    Zhou, Hang
    Sun, Yasheng
    Wu, Wayne
    Loy, Chen Change
    Wang, Xiaogang
    Liu, Ziwei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4174 - 4184
  • [8] MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
    Cheng, Xize
    Jin, Tao
    Huang, Rongjie
    Li, Linjun
    Lin, Wang
    Wang, Zehan
    Wang, Ye
    Liu, Huadai
    Yin, Aoxiong
    Zhao, Zhou
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15689 - 15699
  • [9] Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing
    Lin, Yan-Bo
    Tseng, Hung-Yu
    Lee, Hsin-Ying
    Lin, Yen-Yu
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [10] AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
    Sun, Yasheng
    Chu, Wenqing
    Zhou, Hang
    Wang, Kaisiyuan
    Koike, Hideki
    IEEE ACCESS, 2024, 12 : 57288 - 57301