Multi-Modal Driven Pose-Controllable Talking Head Generation

被引：0

作者：

Sun, Kuiyuan ^{[1
]}

Liu, Xiaolong ^{[1
]}

Li, Xiaolong ^{[1
]}

Zhao, Yao ^{[1
]}

Wang, Wei ^{[1
]}

机构：

[1] Institute of Information Science, Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing Jiaotong University, Beijing, China

来源：

ACM Transactions on Multimedia Computing, Communications and Applications | 2024年 / 20卷 / 12期

关键词：

D O I：

10.1145/3673901

中图分类号：

G2 [信息与知识传播];

学科分类号：

05 ; 0503 ;

摘要：

Talking head, driving a source image to generate a talking video using other modality information, has made great progress in recent years. However, there are two main issues: (1) These methods are designed to utilize a single modality of information. (2) Most methods cannot control head pose. To address these problems, we propose a novel framework that can utilize multi-modal information to generate a talking head video, while achieving arbitrary head pose control by a movement sequence. Specifically, first, to extend driving information to multiple modalities, multi-modal information is encoded to a unified semantic latent space to generate expression parameters. Secondly, to disentangle attributes, the 3D Morphable Model (3DMM) is utilized to obtain identity information from the source image, and translation and rotation information from the target image. Thirdly, to control head pose and mouth shape, the source image is warped by a motion field generated by the expression parameter, translation parameter, and angle parameter. Finally, all the above parameters are utilized to render a landmark map, and the warped source image is combined with the landmark map to generate a delicate talking head video. Our experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance in terms of visual quality, lip-audio synchronization, and head pose control. © 2024 Copyright held by the owner/author(s)

引用

共 50 条

[21] Automatic report generation based on multi-modal information
Jing Zhang
Xiaoxue Li
Weizhi Nie
Yuting Su
Multimedia Tools and Applications, 2017, 76 : 12005 - 12015
[22] Multi-modal human robot interaction for map generation
Saito, H
Ishimura, K
Hattori, M
Takamori, T
SICE 2002: PROCEEDINGS OF THE 41ST SICE ANNUAL CONFERENCE, VOLS 1-5, 2002, : 2721 - 2724
[23] Multi-modal visual tracking based on textual generation
Wang, Jiahao
Liu, Fang
Jiao, Licheng
Wang, Hao
Li, Shuo
Li, Lingling
Chen, Puhua
Liu, Xu
INFORMATION FUSION, 2024, 112
[24] Multi-modal human robot interaction for map generation
Ghidary, SS
Nakata, Y
Saito, H
Hattori, M
Takamori, T
IROS 2001: PROCEEDINGS OF THE 2001 IEEE/RJS INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-4: EXPANDING THE SOCIETAL ROLE OF ROBOTICS IN THE NEXT MILLENNIUM, 2001, : 2246 - 2251
[25] Generation of Visual Representations for Multi-Modal Mathematical Knowledge
Wu, Lianlong
Choi, Seewon
Raggi, Daniel
Stockdill, Aaron
Garcia, Grecia Garcia
Colarusso, Fiorenzo
Cheng, Peter C. H.
Jamnik, Mateja
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23850 - 23852
[26] Collaborative Diffusion for Multi-Modal Face Generation and Editing
Huang, Ziqi
Chan, Kelvin C. K.
Jiang, Yuming
Liu, Ziwei
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6080 - 6090
[27] Multi-modal feature robotic arm grasping pose detection with attention mechanism
Chu H.-Y.
Leng Q.-Q.
Zhang X.-Q.
Chang Z.-Y.
Shao Y.-H.
Kongzhi yu Juece/Control and Decision, 2024, 39 (03): : 777 - 785
[28] Human head detection using multi-modal object features
Luo, Y
Murphey, YL
Khairallah, F
PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2003, VOLS 1-4, 2003, : 2134 - 2139
[29] Recovering 6D Object Pose: A Review and Multi-modal Analysis
Sahin, Caner
Kim, Tae-Kyun
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT VI, 2019, 11134 : 15 - 31
[30] Head motion generation for speech-driven talking avatar
Xie, L. (lxie@nwpu.edu.cn), 1600, Tsinghua University (53):

← 1 2 3 4 5 →