Multi-Modal Driven Pose-Controllable Talking Head Generation

被引:0
|
作者
Sun, Kuiyuan [1 ]
Liu, Xiaolong [1 ]
Li, Xiaolong [1 ]
Zhao, Yao [1 ]
Wang, Wei [1 ]
机构
[1] Institute of Information Science, Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, Beijing, China
关键词
D O I
10.1145/3673901
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
Talking head, driving a source image to generate a talking video using other modality information, has made great progress in recent years. However, there are two main issues: (1) These methods are designed to utilize a single modality of information. (2) Most methods cannot control head pose. To address these problems, we propose a novel framework that can utilize multi-modal information to generate a talking head video, while achieving arbitrary head pose control by a movement sequence. Specifically, first, to extend driving information to multiple modalities, multi-modal information is encoded to a unified semantic latent space to generate expression parameters. Secondly, to disentangle attributes, the 3D Morphable Model (3DMM) is utilized to obtain identity information from the source image, and translation and rotation information from the target image. Thirdly, to control head pose and mouth shape, the source image is warped by a motion field generated by the expression parameter, translation parameter, and angle parameter. Finally, all the above parameters are utilized to render a landmark map, and the warped source image is combined with the landmark map to generate a delicate talking head video. Our experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance in terms of visual quality, lip-audio synchronization, and head pose control. © 2024 Copyright held by the owner/author(s)
引用
收藏
相关论文
共 50 条
  • [1] Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation
    Zhou, Hang
    Sun, Yasheng
    Wu, Wayne
    Loy, Chen Change
    Wang, Xiaogang
    Liu, Ziwei
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4174 - 4184
  • [2] Audio-Semantic Enhanced Pose-Driven Talking Head Generation
    Liu M.
    Li D.
    Li Y.
    Song X.
    Nie L.
    IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (11): : 1 - 1
  • [3] Asymmetry-aware bilinear pooling in multi-modal data for head pose estimation
    Chen, Jiazhong
    Li, Qingqing
    Ren, Dakai
    Cao, Hua
    Ling, Hefei
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2023, 110
  • [4] Unified losses for multi-modal pose coding and regression
    Johnson, Leif
    Cooper, Joseph
    Ballard, Dana
    2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [5] A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance
    Kim, Juntae
    Heo, Yoonseok
    Yu, Hogeon
    Nang, Jongho
    ELECTRONICS, 2023, 12 (06)
  • [6] Attention driven multi-modal similarity learning
    Gao, Xinjian
    Mu, Tingting
    Goulermas, John Y.
    Wang, Meng
    INFORMATION SCIENCES, 2018, 432 : 530 - 542
  • [7] High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning
    Xu, Chao
    Zhu, Junwei
    Zhang, Jiangning
    Han, Yue
    Chu, Wenqing
    Tai, Ying
    Wang, Chengjie
    Xie, Zhifeng
    Liu, Yong
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6609 - 6619
  • [8] A Multi-Modal Chinese Poetry Generation Model
    Liu, Dayiheng
    Guo, Quan
    Li, Wubo
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [9] Deep Fusion for Multi-Modal 6D Pose Estimation
    Lin, Shifeng
    Wang, Zunran
    Zhang, Shenghao
    Ling, Yonggen
    Yang, Chenguang
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2023, : 1 - 10
  • [10] Multi-Modal Pose Representations for 6-DOF Object Tracking
    Mateusz Majcher
    Bogdan Kwolek
    Journal of Intelligent & Robotic Systems, 110 (4)