One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning

被引:0
|
作者
Wang, Suzhen [1 ]
Li, Lincheng [1 ]
Ding, Yu [1 ]
Yu, Xin [2 ]
机构
[1] Netease Fuxi AI Lab, Virtual Human Grp, Hangzhou, Peoples R China
[2] Univ Technol Sydney, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.
引用
收藏
页码:2531 / 2539
页数:9
相关论文
共 16 条
  • [1] Flow-guided One-shot Talking Face Generation with a High-resolution Audio-visual Dataset
    Zhang, Zhimeng
    Li, Lincheng
    Ding, Yu
    Fan, Changjie
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3660 - 3669
  • [2] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
    Zhu, Hao
    Huang, Huaibo
    Li, Yi
    Zheng, Aihua
    He, Ran
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368
  • [3] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
    Zhou, Hang
    Liu, Yu
    Liu, Ziwei
    Luo, Ping
    Wang, Xiaogang
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9299 - 9306
  • [4] AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
    Sun, Yasheng
    Chu, Wenqing
    Zhou, Hang
    Wang, Kaisiyuan
    Koike, Hideki
    [J]. IEEE ACCESS, 2024, 12 : 57288 - 57301
  • [5] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
    Zhou, Hang
    Sun, Yasheng
    Wu, Wayne
    Loy, Chen Change
    Wang, Xiaogang
    Liu, Ziwei
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4174 - 4184
  • [6] Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion
    Wang, Suzhen
    Li, Lincheng
    Ding, Yu
    Fan, Changjie
    Yu, Xin
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1098 - 1105
  • [7] Learning One-Shot Exemplar SVM from the Web for Face Verification
    Song, Fengyi
    Tan, Xiaoyang
    [J]. COMPUTER VISION - ACCV 2014, PT III, 2015, 9005 : 408 - 422
  • [8] Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
    Wang, Zhangjing
    He, Wenzhi
    Wei, Yujiang
    Luo, Yupeng
    [J]. DISPLAYS, 2023, 80
  • [9] Attentive One-Shot Meta-Imitation Learning From Visual Demonstration
    Bhutani, Vishal
    Majumder, Anima
    Vankadari, Madhu
    Dutta, Samrat
    Asati, Aaditya
    Kumar, Swagat
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2022, 2022, : 8584 - 8590
  • [10] StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN
    Yin, Fei
    Zhang, Yong
    Cun, Xiaodong
    Cao, Mingdeng
    Fan, Yanbo
    Wang, Xuan
    Bai, Qingyan
    Wu, Baoyuan
    Wang, Jue
    Yang, Yujiu
    [J]. COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 85 - 101