Multi-Modal Driven Pose-Controllable Talking Head Generation

被引:0
|
作者
Sun, Kuiyuan [1 ]
Liu, Xiaolong [1 ]
Li, Xiaolong [1 ]
Zhao, Yao [1 ]
Wang, Wei [1 ]
机构
[1] Institute of Information Science, Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, Beijing, China
关键词
D O I
10.1145/3673901
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
Talking head, driving a source image to generate a talking video using other modality information, has made great progress in recent years. However, there are two main issues: (1) These methods are designed to utilize a single modality of information. (2) Most methods cannot control head pose. To address these problems, we propose a novel framework that can utilize multi-modal information to generate a talking head video, while achieving arbitrary head pose control by a movement sequence. Specifically, first, to extend driving information to multiple modalities, multi-modal information is encoded to a unified semantic latent space to generate expression parameters. Secondly, to disentangle attributes, the 3D Morphable Model (3DMM) is utilized to obtain identity information from the source image, and translation and rotation information from the target image. Thirdly, to control head pose and mouth shape, the source image is warped by a motion field generated by the expression parameter, translation parameter, and angle parameter. Finally, all the above parameters are utilized to render a landmark map, and the warped source image is combined with the landmark map to generate a delicate talking head video. Our experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance in terms of visual quality, lip-audio synchronization, and head pose control. © 2024 Copyright held by the owner/author(s)
引用
收藏
相关论文
共 50 条
  • [21] Automatic report generation based on multi-modal information
    Jing Zhang
    Xiaoxue Li
    Weizhi Nie
    Yuting Su
    Multimedia Tools and Applications, 2017, 76 : 12005 - 12015
  • [22] Multi-modal human robot interaction for map generation
    Saito, H
    Ishimura, K
    Hattori, M
    Takamori, T
    SICE 2002: PROCEEDINGS OF THE 41ST SICE ANNUAL CONFERENCE, VOLS 1-5, 2002, : 2721 - 2724
  • [23] Multi-modal visual tracking based on textual generation
    Wang, Jiahao
    Liu, Fang
    Jiao, Licheng
    Wang, Hao
    Li, Shuo
    Li, Lingling
    Chen, Puhua
    Liu, Xu
    INFORMATION FUSION, 2024, 112
  • [24] Multi-modal human robot interaction for map generation
    Ghidary, SS
    Nakata, Y
    Saito, H
    Hattori, M
    Takamori, T
    IROS 2001: PROCEEDINGS OF THE 2001 IEEE/RJS INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-4: EXPANDING THE SOCIETAL ROLE OF ROBOTICS IN THE NEXT MILLENNIUM, 2001, : 2246 - 2251
  • [25] Generation of Visual Representations for Multi-Modal Mathematical Knowledge
    Wu, Lianlong
    Choi, Seewon
    Raggi, Daniel
    Stockdill, Aaron
    Garcia, Grecia Garcia
    Colarusso, Fiorenzo
    Cheng, Peter C. H.
    Jamnik, Mateja
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23850 - 23852
  • [26] Collaborative Diffusion for Multi-Modal Face Generation and Editing
    Huang, Ziqi
    Chan, Kelvin C. K.
    Jiang, Yuming
    Liu, Ziwei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6080 - 6090
  • [27] Multi-modal feature robotic arm grasping pose detection with attention mechanism
    Chu H.-Y.
    Leng Q.-Q.
    Zhang X.-Q.
    Chang Z.-Y.
    Shao Y.-H.
    Kongzhi yu Juece/Control and Decision, 2024, 39 (03): : 777 - 785
  • [28] Human head detection using multi-modal object features
    Luo, Y
    Murphey, YL
    Khairallah, F
    PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2134 - 2139
  • [29] Recovering 6D Object Pose: A Review and Multi-modal Analysis
    Sahin, Caner
    Kim, Tae-Kyun
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT VI, 2019, 11134 : 15 - 31
  • [30] Head motion generation for speech-driven talking avatar
    Xie, L. (lxie@nwpu.edu.cn), 1600, Tsinghua University (53):