Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

被引:141
|
作者
Zhou, Hang [1 ]
Sun, Yasheng [2 ,3 ]
Wu, Wayne [2 ,4 ]
Loy, Chen Change [4 ]
Wang, Xiaogang [1 ]
Liu, Ziwei [4 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] SenseTime Res, Hong Kong, Peoples R China
[3] Tokyo Inst Technol, Tokyo, Japan
[4] Nanyang Technol Univ, S Lab, Singapore, Singapore
关键词
D O I
10.1109/CVPR46437.2021.00416
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While accurate lip synchronization has been achieved for arbitrary-subject audio-driven talking face generation, the problem of how to efficiently drive the head pose remains. Previous methods rely on pre-estimated structural information such as landmarks and 3D parameters, aiming to generate personalized rhythmic movements. However, the inaccuracy of such estimated information under extreme conditions would lead to degradation problems. In this paper, we propose a clean yet effective framework to generate posecontrollable talking faces. We operate on non-aligned raw face images, using only a single photo as an identity reference. The key is to modularize audio-visual representations by devising an implicit low-dimension pose code. Substantially, both speech content and head pose information lie in a joint non-identity embedding space. While speech content information can be defined by learning the intrinsic synchronization between audio-visual modalities, we identify that a pose code will be complementarily learned in a modulated convolution-based reconstruction framework. Extensive experiments show that our method generates accurately lip-synced talking faces whose poses are controllable by other videos. Moreover, our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.(1)
引用
下载
收藏
页码:4174 / 4184
页数:11
相关论文
共 50 条
  • [1] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
    Zhou, Hang
    Liu, Yu
    Liu, Ziwei
    Luo, Ping
    Wang, Xiaogang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9299 - 9306
  • [2] Audio-visual talking face detection
    Li, MK
    Li, DG
    Dimitrova, N
    Sethi, I
    2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS, 2003, : 473 - 476
  • [3] Multi-Modal Driven Pose-Controllable Talking Head Generation
    Sun, Kuiyuan
    Liu, Xiaolong
    Li, Xiaolong
    Zhao, Yao
    Wang, Wei
    ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20 (12)
  • [4] Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation
    Sun, Yasheng
    Zhou, Hang
    Liu, Ziwei
    Koike, Hideki
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 1018 - 1024
  • [5] Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning
    Zhu, Hao
    Huang, Huaibo
    Li, Yi
    Zheng, Aihua
    He, Ran
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 2362 - 2368
  • [6] An audio-visual imposture scenario by talking face animation
    Karam, W
    Mokbel, C
    Greige, H
    Aversano, G
    Pelachaud, C
    Chollet, G
    NONLINEAR SPEECH MODELING AND APPLICATIONS, 2005, 3445 : 365 - 369
  • [7] AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation
    Sun, Yasheng
    Chu, Wenqing
    Zhou, Hang
    Wang, Kaisiyuan
    Koike, Hideki
    IEEE ACCESS, 2024, 12 : 57288 - 57301
  • [8] Expressive Talking Head Generation with Granular Audio-Visual Control
    Liang, Borong
    Pan, Yan
    Guo, Zhizhi
    Zhou, Hang
    Hong, Zhibin
    Han, Xiaoguang
    Han, Junyu
    Liu, Jingtuo
    Ding, Errui
    Wang, Jingdong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 3377 - 3386
  • [9] Audio-Visual Face Reenactment
    Agarwal, Madhav
    Mukhopadhyay, Rudrabha
    Namboodiri, Vinay
    Jawahar, C. V.
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5167 - 5176
  • [10] Audio-visual speech synchrony measure for talking-face identity verification
    Bredin, Herve
    Chollet, Gerard
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 233 - +